NVIDIA Qwen3.5-397B-A17B-NVFP4 Model Card
Model Overview
Description:
The NVIDIA Qwen3.5-397B-A17B-NVFP4 model is a quantized version of Qwen's Qwen3.5-397B-A17B model, an autoregressive multimodal language model that uses an optimized Transformer architecture with Mixture of Experts (MoE) and vision-language capabilities. For more information, refer to the Qwen3.5-397B-A17B model card. The NVIDIA Qwen3.5-397B-A17B-NVFP4 model was quantized using the TensorRT Model Optimizer.
This model is ready for commercial/non-commercial use.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA (Qwen3.5-397B-A17B) Model Card.
License/Terms of Use:
Deployment Geography:
Global
Use Case:
Developers looking to take off the shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications.
Release Date:
Huggingface via https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4
Model Architecture:
Architecture Type: Transformers (Hybrid)
Network Architecture: Qwen3_5MoeForConditionalGeneration
Model Details:
- Total Parameters: 397B
- Active Parameters: 17B (Sparse Mixture-of-Experts)
- Hidden Size: 4096
- Number of Layers: 60 (15 blocks of 3x Gated DeltaNet + 1x Gated Attention, each followed by MoE)
- Expert Configuration: 512 total experts, 10 activated per token + 1 shared expert, expert intermediate dimension 1024.
- Gated DeltaNet (Linear Attention): 64 value heads, 16 QK heads, head dimension 128. Provides long-context efficiency.
- Gated Attention (Full Attention): 32 Q heads, 2 KV heads, head dimension 256, with 64-dim rotary position embedding.
- Context Window: 262,144 tokens (native), extensible to 1,010,000 tokens with YaRN scaling.
- Vocabulary Size: 248,320
- Multi-Token Prediction (MTP): 1 additional prediction layer for speculative decoding.
- Thinking Mode: Default behavior with
<think>...</think>blocks before responses. - Multilingual: Supports 201 languages and dialects.
- Vision Encoder: 27-layer ViT with 1152 hidden size, patch size 16, spatial merge size 2.
Input:
Input Type(s): Text, Image, Video
Input Format(s): String, Image, Video
Input Parameters: 1D (One-Dimensional): Sequences, 2D (Two-Dimensional): Images, 3D (Three-Dimensional): Video
Output:
Output Type(s): Text
Output Format: String
Output Parameters: 1D (One-Dimensional): Sequences
Other Properties Related to Output: N/A
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s):
- SGLang
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Blackwell (B300, B200, RTX PRO 6000 Blackwell)
Preferred Operating System(s):
- Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment
Model Version(s):
** The model is quantized with nvidia-modelopt 0.42.0rc1.dev21+g421985313
Training, Testing, and Evaluation Datasets:
Calibration Dataset:
- Link: Nemotron-Post-Training-Dataset-v2
- Data collection method: Automated.
- Labeling method: Automated.
Training Datasets:
- Data Collection Method by Dataset: Undisclosed
- Labeling Method by Dataset: Undisclosed
- Properties: Undisclosed
Testing Dataset:
- Data Collection Method by Dataset: Undisclosed
- Labeling Method by Dataset: Undisclosed
- Properties: Undisclosed
Evaluation Dataset:
- Data collection method: Hybrid: Automated, Human
- Labeling method: Hybrid: Human, Automated
Inference:
Acceleration Engine: SGLang
Test Hardware: B300
Recommended Hardware
| GPU | Architecture | VRAM | Memory Type | Memory Bandwidth | TDP |
|---|---|---|---|---|---|
| NVIDIA B300 (SXM) | Blackwell Ultra (GB110) | 288 GB HBM3e | HBM3e | 4.1 TB/s | 1400 W |
| NVIDIA B200 (SXM) | Blackwell (GB100) | 192 GB HBM3e | HBM3e | 4.1 TB/s | 1000 W |
| NVIDIA RTX PRO 6000 Blackwell (PCIe) | Blackwell (GB202) | 96 GB GDDR7 | GDDR7 | 1.8 TB/s | 600 W |
B300: 288 GB HBM3e per GPU, 4096-bit memory bus, 18,944 CUDA cores, 592 Tensor Cores, up to 2032 MHz boost. Datacenter SXM module.
B200: 192 GB HBM3e per GPU, 4096-bit memory bus, 18,944 CUDA cores, 592 Tensor Cores, up to 1965 MHz boost. Datacenter SXM module.
RTX PRO 6000 Blackwell: 96 GB GDDR7 per GPU, 512-bit memory bus, 24,064 CUDA cores, 752 Tensor Cores, up to 2617 MHz boost. Professional workstation GPU (PCIe 5.0 x16).
Post Training Quantization
This model was obtained by quantizing the weights and activations of Qwen3.5-397B-A17B to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks are quantized, as well as the KV-cache to FP8. Vision encoder weights are not quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 4x.
Usage
Deploy with SGLang
The total quantized checkpoint size is ~224GB (397B total parameters, 17B active). On NVIDIA Blackwell GPUs:
| Configuration | GPUs | VRAM per GPU | Total VRAM | Throughput |
|---|---|---|---|---|
| B300 TP=4 | 4x B300 | 288 GB | 1,152 GB | ~120 tok/s |
| B300 TP=8 | 8x B300 | 288 GB | 2,304 GB | - |
| B200 TP=4 | 4x B200 | 192 GB | 768 GB | - |
| B200 TP=8 | 8x B200 | 192 GB | 1,536 GB | - |
| RTX PRO 6000 TP=4 | 4x RTX PRO 6000 | 96 GB | 384 GB | - |
| RTX PRO 6000 TP=8 | 8x RTX PRO 6000 | 96 GB | 768 GB | - |
The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.5 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.
# TP=4 (recommended, ~120 tok/s on 4x B300)
python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 4 \
--context-length 262144 \
--reasoning-parser qwen3
# TP=8 (if you have less VRAM per GPU, e.g. RTX PRO 6000)
python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 8 \
--context-length 262144 \
--reasoning-parser qwen3
Speculative Decoding (Experimental)
Qwen3.5 includes a built-in Multi-Token Prediction (MTP) head that can be used for speculative decoding via the NEXTN algorithm. This is experimental and may or may not work at this time:
python3 -m sglang.launch_server \
--model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
--quantization modelopt_fp4 \
--tp 8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4
Installation
Important: You must install SGLang from the bzhng-development:vz/qwen3-5 branch, which includes a fix for the exclusion of visual encoder weights not working properly during quantized inference. Without this fix, the visual weights may be incorrectly handled.
git clone -b vz/qwen3-5 git@github.com:bzhng-development/sglang.git
cd sglang
uv pip install -e "python"
uv pip install transformers==5.2.0
When a release is cut with this fix, we will update this model card.
Reproduce with ModelOpt
You may want to produce this checkpoint yourself. To reproduce the NVFP4 quantized checkpoint using TensorRT Model Optimizer:
python3 examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path Qwen/Qwen3.5-397B-A17B \
--qformat nvfp4 \
--export_path ./qwen3-5-nvfp4
Note: NVFP4 and FP8 KVCache provides a significant memory footprint reduction (~3.5x vs BF16) with negligible accuracy degradation.
Baseline: Qwen3.5-397B-A17B.
Model Limitations:
The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
- Downloads last month
- 13,148
Model tree for vincentzed-hf/Qwen3.5-397B-A17B-NVFP4
Base model
Qwen/Qwen3.5-397B-A17B