MidnightPhreaker
/

KAT-Dev-72B-Exp-GPTQ-INT4-gs32

compressed-tensors

Model card Files Files and versions

KAT-Dev-72B-Exp-GPTQ-INT4-gs32 / README.md

Shane

Upload GPTQ quantized model (group_size=32)

62b65c0 verified about 2 months ago

|

history blame contribute delete

2.74 kB

	---
	license: apache-2.0
	base_model: Kwaipilot/KAT-Dev-72B-Exp
	tags:
	- gptq
	- quantized
	- vllm
	- 4bit
	- group_size_32
	quantization_config:
	quant_method: gptq
	bits: 4
	group_size: 32
	damp_percent: 0.1
	---

	# KAT-Dev-72B-Exp - GPTQ INT4 (group_size=32)

	This is a GPTQ quantized version of [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp).

	## Quantization Details

	- Method: GPTQ (GPT Quantization)
	- Bits: 4
	- Group Size: 32
	- Quantization Type: INT
	- Symmetric: True
	- Calibration Samples: 128
	- Calibration Dataset: allenai/c4
	- Max Sequence Length: 512

	## Hardware Used for Quantization

	- 6x NVIDIA GeForce RTX 5090 (32GB each)
	- CUDA 12.8+
	- Sequential layer-by-layer processing (OOM-safe)

	## Usage

	### With vLLM

	```python
	from vllm import LLM, SamplingParams

	# Initialize the model
	llm = LLM(model="Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", trust_remote_code=True)

	# Create sampling parameters
	sampling_params = SamplingParams(
	temperature=0.7,
	top_p=0.9,
	max_tokens=512,
	)

	# Generate text
	prompts = ["Hello, how are you?"]
	outputs = llm.generate(prompts, sampling_params)

	for output in outputs:
	print(output.outputs[0].text)
	```

	### With Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
	device_map="auto",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
	trust_remote_code=True
	)

	# Generate
	inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Inference Performance

	This quantized model offers:
	- ~4x memory reduction compared to FP16
	- Faster inference on compatible hardware
	- Maintained accuracy through GPTQ quantization

	### Recommended Hardware

	- NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer)
	- Minimum 24GB VRAM for single-GPU inference
	- Multi-GPU setup for larger batch sizes

	## Model Details

	- Base Model: Kwaipilot/KAT-Dev-72B-Exp
	- Quantization Tool: llm-compressor
	- Compatible Inference Engines: vLLM, TGI (Text Generation Inference)

	## Limitations

	- Quantization may affect model accuracy on certain tasks
	- Requires vLLM or compatible inference engine for optimal performance

	## Acknowledgements

	- Base model: [Kwaipilot/KAT-Dev-72B-Exp](https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp)
	- Quantization: [llm-compressor](https://github.com/vllm-project/llm-compressor)
	- Inference: [vLLM](https://github.com/vllm-project/vllm)