--- license: apache-2.0 base_model: Kwaipilot/KAT-Dev-72B-Exp tags: - gptq - quantized - vllm - 4bit - group_size_32 quantization_config: quant_method: gptq bits: 4 group_size: 32 damp_percent: 0.1 --- # KAT-Dev-72B-Exp - GPTQ INT4 (group_size=32) This is a GPTQ quantized version of [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp). ## Quantization Details - **Method**: GPTQ (GPT Quantization) - **Bits**: 4 - **Group Size**: 32 - **Quantization Type**: INT - **Symmetric**: True - **Calibration Samples**: 128 - **Calibration Dataset**: allenai/c4 - **Max Sequence Length**: 512 ## Hardware Used for Quantization - 6x NVIDIA GeForce RTX 5090 (32GB each) - CUDA 12.8+ - Sequential layer-by-layer processing (OOM-safe) ## Usage ### With vLLM ```python from vllm import LLM, SamplingParams # Initialize the model llm = LLM(model="Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", trust_remote_code=True) # Create sampling parameters sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=512, ) # Generate text prompts = ["Hello, how are you?"] outputs = llm.generate(prompts, sampling_params) for output in outputs: print(output.outputs[0].text) ``` ### With Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained( "Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", trust_remote_code=True ) # Generate inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Inference Performance This quantized model offers: - ~4x memory reduction compared to FP16 - Faster inference on compatible hardware - Maintained accuracy through GPTQ quantization ### Recommended Hardware - NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer) - Minimum 24GB VRAM for single-GPU inference - Multi-GPU setup for larger batch sizes ## Model Details - **Base Model**: Kwaipilot/KAT-Dev-72B-Exp - **Quantization Tool**: llm-compressor - **Compatible Inference Engines**: vLLM, TGI (Text Generation Inference) ## Limitations - Quantization may affect model accuracy on certain tasks - Requires vLLM or compatible inference engine for optimal performance ## Acknowledgements - Base model: [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp) - Quantization: [llm-compressor](https://github.com/vllm-project/llm-compressor) - Inference: [vLLM](https://github.com/vllm-project/vllm)