Shane
Upload GPTQ quantized model (group_size=32)
62b65c0 verified
---
license: apache-2.0
base_model: Kwaipilot/KAT-Dev-72B-Exp
tags:
- gptq
- quantized
- vllm
- 4bit
- group_size_32
quantization_config:
quant_method: gptq
bits: 4
group_size: 32
damp_percent: 0.1
---
# KAT-Dev-72B-Exp - GPTQ INT4 (group_size=32)
This is a GPTQ quantized version of [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp).
## Quantization Details
- **Method**: GPTQ (GPT Quantization)
- **Bits**: 4
- **Group Size**: 32
- **Quantization Type**: INT
- **Symmetric**: True
- **Calibration Samples**: 128
- **Calibration Dataset**: allenai/c4
- **Max Sequence Length**: 512
## Hardware Used for Quantization
- 6x NVIDIA GeForce RTX 5090 (32GB each)
- CUDA 12.8+
- Sequential layer-by-layer processing (OOM-safe)
## Usage
### With vLLM
```python
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(model="Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", trust_remote_code=True)
# Create sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
# Generate text
prompts = ["Hello, how are you?"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
```
### With Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
trust_remote_code=True
)
# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Inference Performance
This quantized model offers:
- ~4x memory reduction compared to FP16
- Faster inference on compatible hardware
- Maintained accuracy through GPTQ quantization
### Recommended Hardware
- NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer)
- Minimum 24GB VRAM for single-GPU inference
- Multi-GPU setup for larger batch sizes
## Model Details
- **Base Model**: Kwaipilot/KAT-Dev-72B-Exp
- **Quantization Tool**: llm-compressor
- **Compatible Inference Engines**: vLLM, TGI (Text Generation Inference)
## Limitations
- Quantization may affect model accuracy on certain tasks
- Requires vLLM or compatible inference engine for optimal performance
## Acknowledgements
- Base model: [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp)
- Quantization: [llm-compressor](https://github.com/vllm-project/llm-compressor)
- Inference: [vLLM](https://github.com/vllm-project/vllm)