Llama-3.2-1B-Instruct (4-bit Quantized)

This repository contains a 4-bit quantized version of the Llama-3.2-1B-Instruct model. It has been quantized using bitsandbytes NF4 for extremely low VRAM consumption and fast inference, making it ideal for edge devices, low-resource systems, or fast evaluation pipelines (e.g., interview Thinker models).


Model Features

  • Base model: Llama-3.2-1B-Instruct
  • Quantization: 4-bit (NF4) using bitsandbytes
  • VRAM requirement: ~1.0 GB
  • Perfect for:
    • Lightweight chatbots
    • Reasoning/evaluation agents
    • Interview Thinker modules
    • Local inference on small GPUs
    • Low-latency systems
  • Compatible with:
    • LoRA fine-tuning
    • HuggingFace Transformers
    • Text-generation inference engines

Files Included

  • config.json
  • generation_config.json
  • model.safetensors (4-bit quantized weights)
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json
  • chat_template.jinja

These files allow you to load the model directly with load_in_4bit=True.


How To Load This Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Shlok307/llama-1b-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)
Downloads last month
17
Safetensors
Model size
1B params
Tensor type
F32
F16
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Shlok307/llama-1b-4bit

Quantized
(326)
this model