| | --- |
| | license: other |
| | base_model: TheDrummer/Behemoth-R1-123B-v2 |
| | tags: |
| | - nvfp4 |
| | - modelopt |
| | - quantized |
| | - blackwell |
| | - b200 |
| | library_name: transformers |
| | --- |
| | |
| | # Behemoth-R1-V2 ModelOpt NVFP4 |
| |
|
| | NVFP4 quantized version of [TheDrummer/Behemoth-R1-123B-v2](https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2) using NVIDIA Model Optimizer. |
| |
|
| | ## Quantization Details |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | **Original Model** | TheDrummer/Behemoth-R1-123B-v2 | |
| | | **Quantization** | NVFP4 (FP4 weights, FP16 activations) | |
| | | **Method** | NVIDIA ModelOpt PTQ | |
| | | **Calibration Samples** | 512 | |
| | | **Max Sequence Length** | 4096 | |
| |
|
| | ## Hardware Requirements |
| |
|
| | - **Optimal**: NVIDIA Blackwell GPUs (B100, B200, RTX PRO 6000 Blackwell) |
| | - **Compatible**: Hopper/Ampere (will use weight-only mode) |
| |
|
| | ## Usage with vLLM |
| |
|
| | ```python |
| | from vllm import LLM, SamplingParams |
| | |
| | llm = LLM( |
| | model="TheHouseOfTheDude/Behemoth-R1-V2_ModelOpt-NVFP4", |
| | quantization="modelopt", |
| | trust_remote_code=True, |
| | ) |
| | |
| | sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512) |
| | outputs = llm.generate(["Write a story about..."], sampling_params) |
| | print(outputs[0].outputs[0].text) |
| | ``` |
| |
|
| | ## Chat Template |
| |
|
| | Uses Mistral v7 (Non-Tekken) format. See the original model card for usage details. |
| |
|
| | ## Credits |
| |
|
| | - Original Model: [TheDrummer](https://huggingface.co/TheDrummer) |
| | - Quantization: TheHouseOfTheDude |
| | - Quantization Framework: NVIDIA ModelOpt |
| |
|