License: CC BY-NC-SA 4.0. Rights belong to Javad Taghia (taghia.javad@gmail.com).

Tulu Laptop Finetune + W&B

Minimal setup to finetune a laptop-friendly Tulu checkpoint with QLoRA and track runs in Weights & Biases.

Prereqs

  • Recent NVIDIA GPU with CUDA for 4-bit (bitsandbytes) set --use_4bit true. On CPU/MPS (default), set --use_4bit false, but expect much slower/limited runs.
  • Conda (Miniconda/Anaconda).
  • A Weights & Biases account + API key.

Setup

  1. Create the env (Conda)
conda env create -f environment.yml
conda activate deeai
  1. Add secrets (keep .env out of git)
cp .env.example .env
# Edit .env with your WANDB_API_KEY / project / entity
# Optionally set BASE_MODEL_CACHE to choose where HF downloads models
  1. Verify packages (optional if you prefer pip)
pip install -r requirements.txt
  • If you see LlamaTokenizer requires the SentencePiece library, install it in the env:
pip install sentencepiece
  • If you get a torch.load vulnerability error, either upgrade torch (>=2.6 when available for your platform) or ensure safetensors is installed; this repo prefers safetensors by default:
pip install safetensors
  • If you see LlamaTokenizer requires the SentencePiece library, install it in the env:
pip install sentencepiece

Run a quick finetune

The defaults use allenai/tulu-2-7b with a small instruction dataset (mlabonne/guanaco-llama2-1k) and 4-bit QLoRA. This keeps memory needs closer to laptop GPUs.

python train_tulu.py \
  --output_dir outputs/tulu-lora \
  --offload_folder offload \
  --device cpu \
  --max_seq_length 512 \
  --per_device_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --no-use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output

Key flags:

  • --no-use_4bit if bitsandbytes/CUDA are unavailable; for Mac MPS this should stay false (CPU/MPS only).
  • --dataset_name to try another instruction set (any HF dataset with instruction/input/output fields).
  • --model_name if you want a different Tulu variant (e.g., allenai/tulu-2-dpo-7b) or a smaller model for constrained hardware (e.g., TinyLlama/TinyLlama-1.1B-Chat-v1.0 on Mac MPS).
  • --offload_folder sets where to offload weights when device_map="auto" (ensure it has space). Default offload/ lives in this repo so it stays alongside the project.
  • --instruction_field/--input_field/--output_field let you match custom dataset column names; defaults assume instruction/input/output. For text-only datasets, set --instruction_field text --output_field text.
  • --device can force cpu, mps, cuda, or auto (default). Use --device mps with a smaller fp16 model (e.g., TinyLlama) to fit memory; offloading is disabled on MPS/CPU.
  • --torch_dtype can force the dtype (float16/float32/bfloat16); on MPS use float16 to avoid unsupported bf16 weights.
  • --cpu_threads limits CPU threads (default 4) when running on CPU so you don’t overload your machine.
  • MPS (Mac) note: mixed precision isn’t supported for bfloat16; script will fall back to fp32 automatically on MPS. Keep --no-use_4bit on Mac, and offloading is disabled on MPS (model stays on device).

How W&B is used

  • train_tulu.py loads .env, logs into W&B, and reports through Trainer(report_to=["wandb"]).
  • Ensure WANDB_API_KEY, WANDB_PROJECT, and (optionally) WANDB_ENTITY are set in .env.
  • Each run captures hyperparameters and metrics; check the W&B UI for live loss curves and checkpoints.
  • Additional summaries are logged: train_duration_seconds, train_examples, estimated_tokens, precision_mode (bf16/fp16/fp32), use_4bit, model_name, dataset_name, per_device_batch_size, gradient_accumulation_steps, and max_seq_length.

Training objective and base model

  • Objective: standard causal LM cross-entropy. The model predicts the next token; cross-entropy measures how much probability mass it assigns to the true token. Minimizing it (maximum likelihood) encourages the model to imitate the target outputs in your instruction data. No rewards/RLHF here—pure supervised finetuning.
  • Base model: a Tulu checkpoint (LLaMA-style architecture) from the Hub (default allenai/tulu-2-7b). We train LoRA adapters on top of the frozen base (optionally 4-bit on CUDA), keeping the adapter small and the base intact.

Model cache location

  • Base model weights download to the Hugging Face cache. You can point downloads to an external directory by setting BASE_MODEL_CACHE in .env (e.g., /Volumes/JTQ-s/______GITLAB____/downloaded_base_models); the script maps this to HF_HOME/TRANSFORMERS_CACHE before loading models.
  • If BASE_MODEL_CACHE is not set, the default HF cache is used (typically ~/.cache/huggingface/hub).

Output

  • Finetuned adapters + tokenizer are written to outputs/tulu-lora (configurable via --output_dir).
  • outputs/ is tracked via Git LFS (.gitattributes), so weights can be committed and pushed to the Hub. Run git lfs install once, then git add outputs/... before committing.

Evaluation (inference/compare)

  • Quick smoke test with the saved adapter (edit lora_dir or pass flags):
python evaluation/simple_inference.py \
  --lora_dir outputs/tinyllama-lora \
  --device auto \
  --torch_dtype auto \
  --max_new_tokens 128 \
  --temperature 0.7 \
  --top_p 0.9
  • Compare base vs. LoRA outputs side-by-side:
python evaluation/compare_lora.py \
  --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --lora_dir outputs/tinyllama-lora \
  --prompt "Explain LoRA in one sentence."

For CPU or constrained machines, force CPU + fp32 (and add --offload_dir offload if using device_map=auto):

python evaluation/compare_lora.py \
  --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --lora_dir outputs/tinyllama-lora \
  --prompt "Explain LoRA in one sentence." \
  --device cpu \
  --torch_dtype float32

Optional flags: --max_new_tokens, --temperature, --top_p, --torch_dtype, --device, --offload_dir.

Troubleshooting

  • OOM? Reduce max_seq_length, increase gradient_accumulation_steps, or switch to a smaller dataset (e.g., use a tiny instruction set like mlabonne/guanaco-llama2-1k, or subset your dataset with --dataset_name your/dataset --max_train_samples 500 in code/script).
  • bitsandbytes import errors on macOS/CPU: run with --use_4bit false or use a Linux+CUDA machine.
  • bitsandbytes install error? We pin to 0.42.0, the latest widely distributed wheel. If you cannot install it (CPU-only/MPS), remove it from requirements.txt and set --use_4bit false.

=== pip install --upgrade "torch==2.2." "torchvision==0.17." "torchaudio==2.2.*" --index-url https://download.pytorch.org/whl/cu121 pip install --upgrade "bitsandbytes>=0.43.1" pip install --upgrade "transformers>=4.40.0"

python train_tulu.py
--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0
--output_dir outputs/tinyllama-lora
--offload_folder offload
--device cuda
--torch_dtype auto
--max_seq_length 512
--per_device_batch_size 2
--gradient_accumulation_steps 8
--num_train_epochs 1
--use_4bit
--instruction_field instruction
--input_field input
--output_field output

python train_tulu.py
--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0
--output_dir outputs/tinyllama-lora
--offload_folder offload
--device cuda
--torch_dtype auto
--max_seq_length 512
--per_device_batch_size 2
--gradient_accumulation_steps 8
--num_train_epochs 1
--use_4bit
--instruction_field instruction
--input_field input
--output_field output

=== only cpu python train_tulu.py
--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0
--output_dir outputs/tinyllama-lora
--offload_folder offload
--device cuda
--torch_dtype auto
--max_seq_length 512
--per_device_batch_size 2
--gradient_accumulation_steps 8
--num_train_epochs 1
--use_4bit
--instruction_field instruction
--input_field input
--output_field output

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for telcom/dee-tulu-train

Finetuned
allenai/tulu-2-7b
Adapter
(1)
this model