README.md · TevunahAi/granite-20b-code-instruct-8k-FP8 at main

granite-20b-code-instruct-8k-FP8 / README.md

rockylynnstein

Update README.md

1c95c50 verified 8 days ago

preview code

raw

history blame contribute delete

7.96 kB

	---
	license: apache-2.0
	base_model: ibm-granite/granite-20b-code-instruct-8k
	tags:
	- fp8
	- quantized
	- code
	- granite
	- ibm
	- llmcompressor
	- vllm
	library_name: transformers
	pipeline_tag: text-generation
	---

	# granite-20b-code-instruct-8k-FP8

	FP8 quantized version of IBM's Granite 20B Code model for efficient inference

	This is an FP8 (E4M3) quantized version of [ibm-granite/granite-20b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k) using compressed_tensors format. Quantized by [TevunahAi](https://huggingface.co/TevunahAi) on enterprise-grade hardware.

	## 🎯 Recommended Usage: vLLM

	For optimal performance with full FP8 benefits (2x memory savings + faster inference), use vLLM or TensorRT-LLM:

	### Quick Start with vLLM

	```bash
	pip install vllm
	```

	Python API:

	```python
	from vllm import LLM, SamplingParams

	# vLLM auto-detects FP8 from model config
	llm = LLM(model="TevunahAi/granite-20b-code-instruct-8k-FP8", dtype="auto")

	# Generate
	prompt = "Write a Python function to calculate fibonacci numbers:"
	sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

	outputs = llm.generate([prompt], sampling_params)
	for output in outputs:
	print(output.outputs[0].text)
	```

	OpenAI-Compatible API Server:

	```bash
	vllm serve TevunahAi/granite-20b-code-instruct-8k-FP8 \
	--dtype auto \
	--max-model-len 8192
	```

	Then use with OpenAI client:

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://localhost:8000/v1",
	api_key="token-abc123", # dummy key
	)

	response = client.chat.completions.create(
	model="TevunahAi/granite-20b-code-instruct-8k-FP8",
	messages=[
	{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
	],
	temperature=0.7,
	max_tokens=256,
	)

	print(response.choices[0].message.content)
	```

	### vLLM Benefits

	- ✅ Weights, activations, and KV cache in FP8
	- ✅ ~20GB VRAM (50% reduction vs BF16)
	- ✅ Native FP8 tensor core acceleration on Ada/Hopper GPUs
	- ✅ Faster inference with optimized CUDA kernels
	- ✅ Production-grade performance

	## ⚙️ Alternative: Transformers (Not Recommended)

	This model can be loaded with `transformers`, but will decompress FP8 → BF16 during inference, requiring ~40GB+ VRAM. For 20B models, vLLM is strongly recommended.

	<details>
	<summary>Transformers Example (Click to expand)</summary>

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Loads FP8 weights but decompresses to BF16 during compute
	model = AutoModelForCausalLM.from_pretrained(
	"TevunahAi/granite-20b-code-instruct-8k-FP8",
	device_map="auto",
	torch_dtype="auto",
	low_cpu_mem_usage=True,
	)
	tokenizer = AutoTokenizer.from_pretrained("TevunahAi/granite-20b-code-instruct-8k-FP8")

	# Generate
	prompt = "Write a Python function to calculate fibonacci numbers:"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	outputs = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	Requirements:
	```bash
	pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
	```

	System Requirements:
	- ~40GB+ VRAM (decompressed to BF16)
	- Multi-GPU setup or A100/H100
	- CUDA 11.8 or newer

	⚠️ Warning: vLLM is the recommended deployment method for 20B models.

	</details>

	## 📊 Quantization Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [ibm-granite/granite-20b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k) \|
	\| Quantization Method \| FP8 E4M3 weight-only \|
	\| Framework \| llm-compressor + compressed_tensors \|
	\| Calibration Dataset \| open_platypus (512 samples) \|
	\| Storage Size \| ~20GB (sharded safetensors) \|
	\| VRAM (vLLM) \| ~20GB \|
	\| VRAM (Transformers) \| ~40GB+ (decompressed to BF16) \|
	\| Target Hardware \| NVIDIA Ada (RTX 4000/5000) or Hopper (H100/GH200) \|
	\| Quantization Time \| 46.5 minutes \|

	### Quantization Infrastructure

	Professional hardware ensures consistent, high-quality quantization:

	- CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
	- GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
	- Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
	- Software Stack: Ubuntu 25.10 \| Python 3.12 \| PyTorch 2.8 \| CUDA 13.0 \| llm-compressor

	## 🔧 Why FP8 for 20B Models?

	### With vLLM/TensorRT-LLM:
	- ✅ 50% memory reduction vs BF16 (weights + activations + KV cache)
	- ✅ Single GPU deployment on RTX 4090 (24GB) or RTX 5000 Ada (32GB)
	- ✅ Faster inference via native FP8 tensor cores
	- ✅ Better throughput with optimized kernels
	- ✅ Minimal quality loss for code generation tasks

	### With Transformers:
	- ✅ Smaller download size (~20GB vs ~40GB BF16)
	- ✅ Compatible with standard transformers workflow
	- ⚠️ Decompresses to BF16 during inference (no runtime memory benefit)
	- ❌ Requires 40GB+ VRAM - impractical for most setups

	For 20B models, vLLM is essential for practical deployment.

	## 💾 Model Files

	This model is sharded into multiple safetensors files (all required for inference). The compressed format enables efficient storage and faster downloads.

	## 🔬 IBM Granite Code Models

	Granite Code models are specifically trained for code generation, editing, and explanation tasks. This 20B parameter version offers strong performance on:

	- Code completion and generation
	- Bug fixing and refactoring
	- Code explanation and documentation
	- Multiple programming languages
	- 8K context window

	## 📚 Original Model

	This quantization is based on [ibm-granite/granite-20b-code-instruct-8k](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k) by IBM.

	For comprehensive information about:
	- Model architecture and training methodology
	- Supported programming languages
	- Evaluation benchmarks and results
	- Ethical considerations and responsible AI guidelines

	Please refer to the [original model card](https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k).

	## 🔧 Hardware Requirements

	### Minimum (vLLM):
	- GPU: NVIDIA RTX 4090 (24GB) or RTX 5000 Ada (32GB)
	- VRAM: 20GB minimum, 24GB+ recommended
	- CUDA: 11.8 or newer

	### Recommended (vLLM):
	- GPU: NVIDIA RTX 5000 Ada (32GB) / H100 (80GB)
	- VRAM: 24GB+
	- CUDA: 12.0+

	### Transformers:
	- GPU: Multi-GPU setup or A100 (40GB+)
	- VRAM: 40GB+ (single GPU) or distributed across multiple GPUs
	- Not recommended for practical deployment

	## 📖 Additional Resources

	- vLLM Documentation: [docs.vllm.ai](https://docs.vllm.ai/)
	- TensorRT-LLM: [github.com/NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
	- TevunahAi Models: [huggingface.co/TevunahAi](https://huggingface.co/TevunahAi)
	- llm-compressor: [github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)
	- IBM Granite: [github.com/ibm-granite](https://github.com/ibm-granite)

	## 📄 License

	This model inherits the Apache 2.0 License from the original Granite model.

	## 🙏 Acknowledgments

	- Original Model: IBM Granite team
	- Quantization Framework: Neural Magic's llm-compressor
	- Quantized by: [TevunahAi](https://huggingface.co/TevunahAi)

	## 📝 Citation

	If you use this model, please cite the original Granite work:

	```bibtex
	@misc{granite2024,
	title={Granite Code Models},
	author={IBM Research},
	year={2024},
	url={https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k}
	}
	```

	---

	<div align="center">

	Professional AI Model Quantization by TevunahAi

	Enterprise-grade quantization on specialized hardware

	[View all models](https://huggingface.co/TevunahAi) \| [Contact for custom quantization](https://huggingface.co/TevunahAi)

	</div>