Llama-3.2-1B-Instruct (4-bit Quantized)

This repository contains a 4-bit quantized version of the Llama-3.2-1B-Instruct model. It has been quantized using bitsandbytes NF4 for extremely low VRAM consumption and fast inference, making it ideal for edge devices, low-resource systems, or fast evaluation pipelines (e.g., interview Thinker models).

Model Features

Base model: Llama-3.2-1B-Instruct
Quantization: 4-bit (NF4) using bitsandbytes
VRAM requirement: ~1.0 GB
Perfect for:
- Lightweight chatbots
- Reasoning/evaluation agents
- Interview Thinker modules
- Local inference on small GPUs
- Low-latency systems
Compatible with:
- LoRA fine-tuning
- HuggingFace Transformers
- Text-generation inference engines

Files Included

config.json
generation_config.json
model.safetensors (4-bit quantized weights)
tokenizer.json
tokenizer_config.json
special_tokens_map.json
chat_template.jinja

These files allow you to load the model directly with load_in_4bit=True.

How To Load This Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Shlok307/llama-1b-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto"
)

Downloads last month: 17

Safetensors

Model size

1B params

Tensor type

F32

F16

Model tree for Shlok307/llama-1b-4bit

Base model

meta-llama/Llama-3.2-1B-Instruct

Quantized

(326)

this model