License: CC BY-NC-SA 4.0. Rights belong to Javad Taghia (taghia.javad@gmail.com).

Tulu Laptop Finetune + W&B

Minimal setup to finetune a laptop-friendly Tulu checkpoint with QLoRA and track runs in Weights & Biases.

Prereqs

Recent NVIDIA GPU with CUDA for 4-bit (bitsandbytes) set --use_4bit true. On CPU/MPS (default), set --use_4bit false, but expect much slower/limited runs.
Conda (Miniconda/Anaconda).
A Weights & Biases account + API key.

Setup

Create the env (Conda)

conda env create -f environment.yml
conda activate deeai

Add secrets (keep .env out of git)

cp .env.example .env
# Edit .env with your WANDB_API_KEY / project / entity
# Optionally set BASE_MODEL_CACHE to choose where HF downloads models

Verify packages (optional if you prefer pip)

pip install -r requirements.txt

If you see LlamaTokenizer requires the SentencePiece library, install it in the env:

pip install sentencepiece

If you get a torch.load vulnerability error, either upgrade torch (>=2.6 when available for your platform) or ensure safetensors is installed; this repo prefers safetensors by default:

pip install safetensors

If you see LlamaTokenizer requires the SentencePiece library, install it in the env:

pip install sentencepiece

Run a quick finetune

The defaults use allenai/tulu-2-7b with a small instruction dataset (mlabonne/guanaco-llama2-1k) and 4-bit QLoRA. This keeps memory needs closer to laptop GPUs.

python train_tulu.py \
  --output_dir outputs/tulu-lora \
  --offload_folder offload \
  --device cpu \
  --max_seq_length 512 \
  --per_device_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --no-use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output

Key flags:

--no-use_4bit if bitsandbytes/CUDA are unavailable; for Mac MPS this should stay false (CPU/MPS only).
--dataset_name to try another instruction set (any HF dataset with instruction/input/output fields).
--model_name if you want a different Tulu variant (e.g., allenai/tulu-2-dpo-7b) or a smaller model for constrained hardware (e.g., TinyLlama/TinyLlama-1.1B-Chat-v1.0 on Mac MPS).
--offload_folder sets where to offload weights when device_map="auto" (ensure it has space). Default offload/ lives in this repo so it stays alongside the project.
--instruction_field/--input_field/--output_field let you match custom dataset column names; defaults assume instruction/input/output. For text-only datasets, set --instruction_field text --output_field text.
--device can force cpu, mps, cuda, or auto (default). Use --device mps with a smaller fp16 model (e.g., TinyLlama) to fit memory; offloading is disabled on MPS/CPU.
--torch_dtype can force the dtype (float16/float32/bfloat16); on MPS use float16 to avoid unsupported bf16 weights.
--cpu_threads limits CPU threads (default 4) when running on CPU so you don’t overload your machine.
MPS (Mac) note: mixed precision isn’t supported for bfloat16; script will fall back to fp32 automatically on MPS. Keep --no-use_4bit on Mac, and offloading is disabled on MPS (model stays on device).

How W&B is used

train_tulu.py loads .env, logs into W&B, and reports through Trainer(report_to=["wandb"]).
Ensure WANDB_API_KEY, WANDB_PROJECT, and (optionally) WANDB_ENTITY are set in .env.
Each run captures hyperparameters and metrics; check the W&B UI for live loss curves and checkpoints.
Additional summaries are logged: train_duration_seconds, train_examples, estimated_tokens, precision_mode (bf16/fp16/fp32), use_4bit, model_name, dataset_name, per_device_batch_size, gradient_accumulation_steps, and max_seq_length.

Training objective and base model

Objective: standard causal LM cross-entropy. The model predicts the next token; cross-entropy measures how much probability mass it assigns to the true token. Minimizing it (maximum likelihood) encourages the model to imitate the target outputs in your instruction data. No rewards/RLHF here—pure supervised finetuning.
Base model: a Tulu checkpoint (LLaMA-style architecture) from the Hub (default allenai/tulu-2-7b). We train LoRA adapters on top of the frozen base (optionally 4-bit on CUDA), keeping the adapter small and the base intact.

Model cache location

Base model weights download to the Hugging Face cache. You can point downloads to an external directory by setting BASE_MODEL_CACHE in .env (e.g., /Volumes/JTQ-s/______GITLAB____/downloaded_base_models); the script maps this to HF_HOME/TRANSFORMERS_CACHE before loading models.
If BASE_MODEL_CACHE is not set, the default HF cache is used (typically ~/.cache/huggingface/hub).

Output

Finetuned adapters + tokenizer are written to outputs/tulu-lora (configurable via --output_dir).
outputs/ is tracked via Git LFS (.gitattributes), so weights can be committed and pushed to the Hub. Run git lfs install once, then git add outputs/... before committing.

Evaluation (inference/compare)

Quick smoke test with the saved adapter (edit lora_dir or pass flags):

python evaluation/simple_inference.py \
  --lora_dir outputs/tinyllama-lora \
  --device auto \
  --torch_dtype auto \
  --max_new_tokens 128 \
  --temperature 0.7 \
  --top_p 0.9

Compare base vs. LoRA outputs side-by-side:

python evaluation/compare_lora.py \
  --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --lora_dir outputs/tinyllama-lora \
  --prompt "Explain LoRA in one sentence."

For CPU or constrained machines, force CPU + fp32 (and add --offload_dir offload if using device_map=auto):

python evaluation/compare_lora.py \
  --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --lora_dir outputs/tinyllama-lora \
  --prompt "Explain LoRA in one sentence." \
  --device cpu \
  --torch_dtype float32

Optional flags: --max_new_tokens, --temperature, --top_p, --torch_dtype, --device, --offload_dir.

Troubleshooting

OOM? Reduce max_seq_length, increase gradient_accumulation_steps, or switch to a smaller dataset (e.g., use a tiny instruction set like mlabonne/guanaco-llama2-1k, or subset your dataset with --dataset_name your/dataset --max_train_samples 500 in code/script).
bitsandbytes import errors on macOS/CPU: run with --use_4bit false or use a Linux+CUDA machine.
bitsandbytes install error? We pin to 0.42.0, the latest widely distributed wheel. If you cannot install it (CPU-only/MPS), remove it from requirements.txt and set --use_4bit false.

=== pip install --upgrade "torch==2.2." "torchvision==0.17." "torchaudio==2.2.*" --index-url https://download.pytorch.org/whl/cu121 pip install --upgrade "bitsandbytes>=0.43.1" pip install --upgrade "transformers>=4.40.0"

python train_tulu.py
--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0
--output_dir outputs/tinyllama-lora
--offload_folder offload
--device cuda
--torch_dtype auto
--max_seq_length 512
--per_device_batch_size 2
--gradient_accumulation_steps 8
--num_train_epochs 1
--use_4bit
--instruction_field instruction
--input_field input
--output_field output

=== only cpu python train_tulu.py
--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0
--output_dir outputs/tinyllama-lora
--offload_folder offload
--device cuda
--torch_dtype auto
--max_seq_length 512
--per_device_batch_size 2
--gradient_accumulation_steps 8
--num_train_epochs 1
--use_4bit
--instruction_field instruction
--input_field input
--output_field output

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for telcom/dee-tulu-train

Base model

meta-llama/Llama-2-7b-hf

Finetuned

allenai/tulu-2-7b

Adapter

(1)

this model