|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: Kwaipilot/KAT-Dev-72B-Exp |
|
|
tags: |
|
|
- gptq |
|
|
- quantized |
|
|
- vllm |
|
|
- 4bit |
|
|
- group_size_32 |
|
|
quantization_config: |
|
|
quant_method: gptq |
|
|
bits: 4 |
|
|
group_size: 32 |
|
|
damp_percent: 0.1 |
|
|
--- |
|
|
|
|
|
# KAT-Dev-72B-Exp - GPTQ INT4 (group_size=32) |
|
|
|
|
|
This is a GPTQ quantized version of [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp). |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
- **Method**: GPTQ (GPT Quantization) |
|
|
- **Bits**: 4 |
|
|
- **Group Size**: 32 |
|
|
- **Quantization Type**: INT |
|
|
- **Symmetric**: True |
|
|
- **Calibration Samples**: 128 |
|
|
- **Calibration Dataset**: allenai/c4 |
|
|
- **Max Sequence Length**: 512 |
|
|
|
|
|
## Hardware Used for Quantization |
|
|
|
|
|
- 6x NVIDIA GeForce RTX 5090 (32GB each) |
|
|
- CUDA 12.8+ |
|
|
- Sequential layer-by-layer processing (OOM-safe) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With vLLM |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
# Initialize the model |
|
|
llm = LLM(model="Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", trust_remote_code=True) |
|
|
|
|
|
# Create sampling parameters |
|
|
sampling_params = SamplingParams( |
|
|
temperature=0.7, |
|
|
top_p=0.9, |
|
|
max_tokens=512, |
|
|
) |
|
|
|
|
|
# Generate text |
|
|
prompts = ["Hello, how are you?"] |
|
|
outputs = llm.generate(prompts, sampling_params) |
|
|
|
|
|
for output in outputs: |
|
|
print(output.outputs[0].text) |
|
|
``` |
|
|
|
|
|
### With Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained( |
|
|
"Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Generate |
|
|
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
## Inference Performance |
|
|
|
|
|
This quantized model offers: |
|
|
- ~4x memory reduction compared to FP16 |
|
|
- Faster inference on compatible hardware |
|
|
- Maintained accuracy through GPTQ quantization |
|
|
|
|
|
### Recommended Hardware |
|
|
|
|
|
- NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer) |
|
|
- Minimum 24GB VRAM for single-GPU inference |
|
|
- Multi-GPU setup for larger batch sizes |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: Kwaipilot/KAT-Dev-72B-Exp |
|
|
- **Quantization Tool**: llm-compressor |
|
|
- **Compatible Inference Engines**: vLLM, TGI (Text Generation Inference) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Quantization may affect model accuracy on certain tasks |
|
|
- Requires vLLM or compatible inference engine for optimal performance |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
- Base model: [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp) |
|
|
- Quantization: [llm-compressor](https://github.com/vllm-project/llm-compressor) |
|
|
- Inference: [vLLM](https://github.com/vllm-project/vllm) |
|
|
|