NVIDIA Qwen3.5-397B-A17B-NVFP4 Model Card

Model Overview

Description:

The NVIDIA Qwen3.5-397B-A17B-NVFP4 model is a quantized version of Qwen's Qwen3.5-397B-A17B model, an autoregressive multimodal language model that uses an optimized Transformer architecture with Mixture of Experts (MoE) and vision-language capabilities. For more information, refer to the Qwen3.5-397B-A17B model card. The NVIDIA Qwen3.5-397B-A17B-NVFP4 model was quantized using the TensorRT Model Optimizer.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA (Qwen3.5-397B-A17B) Model Card.

License/Terms of Use:

Apache 2.0

Deployment Geography:

Global

Use Case:

Developers looking to take off the shelf, pre-quantized models for deployment in AI Agent systems, chatbots, RAG systems, and other AI-powered applications.

Release Date:

Huggingface via https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4

Model Architecture:

Architecture Type: Transformers (Hybrid)
Network Architecture: Qwen3_5MoeForConditionalGeneration
Model Details:

  • Total Parameters: 397B
  • Active Parameters: 17B (Sparse Mixture-of-Experts)
  • Hidden Size: 4096
  • Number of Layers: 60 (15 blocks of 3x Gated DeltaNet + 1x Gated Attention, each followed by MoE)
  • Expert Configuration: 512 total experts, 10 activated per token + 1 shared expert, expert intermediate dimension 1024.
  • Gated DeltaNet (Linear Attention): 64 value heads, 16 QK heads, head dimension 128. Provides long-context efficiency.
  • Gated Attention (Full Attention): 32 Q heads, 2 KV heads, head dimension 256, with 64-dim rotary position embedding.
  • Context Window: 262,144 tokens (native), extensible to 1,010,000 tokens with YaRN scaling.
  • Vocabulary Size: 248,320
  • Multi-Token Prediction (MTP): 1 additional prediction layer for speculative decoding.
  • Thinking Mode: Default behavior with <think>...</think> blocks before responses.
  • Multilingual: Supports 201 languages and dialects.
  • Vision Encoder: 27-layer ViT with 1152 hidden size, patch size 16, spatial merge size 2.

Input:

Input Type(s): Text, Image, Video
Input Format(s): String, Image, Video
Input Parameters: 1D (One-Dimensional): Sequences, 2D (Two-Dimensional): Images, 3D (Three-Dimensional): Video

Output:

Output Type(s): Text
Output Format: String
Output Parameters: 1D (One-Dimensional): Sequences
Other Properties Related to Output: N/A

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • SGLang

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Blackwell (B300, B200, RTX PRO 6000 Blackwell)

Preferred Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment

Model Version(s):

** The model is quantized with nvidia-modelopt 0.42.0rc1.dev21+g421985313

Training, Testing, and Evaluation Datasets:

Calibration Dataset:

Training Datasets:

  • Data Collection Method by Dataset: Undisclosed
  • Labeling Method by Dataset: Undisclosed
  • Properties: Undisclosed

Testing Dataset:

  • Data Collection Method by Dataset: Undisclosed
  • Labeling Method by Dataset: Undisclosed
  • Properties: Undisclosed

Evaluation Dataset:

  • Data collection method: Hybrid: Automated, Human
  • Labeling method: Hybrid: Human, Automated

Inference:

Acceleration Engine: SGLang
Test Hardware: B300

Recommended Hardware

GPU Architecture VRAM Memory Type Memory Bandwidth TDP
NVIDIA B300 (SXM) Blackwell Ultra (GB110) 288 GB HBM3e HBM3e 4.1 TB/s 1400 W
NVIDIA B200 (SXM) Blackwell (GB100) 192 GB HBM3e HBM3e 4.1 TB/s 1000 W
NVIDIA RTX PRO 6000 Blackwell (PCIe) Blackwell (GB202) 96 GB GDDR7 GDDR7 1.8 TB/s 600 W

B300: 288 GB HBM3e per GPU, 4096-bit memory bus, 18,944 CUDA cores, 592 Tensor Cores, up to 2032 MHz boost. Datacenter SXM module.
B200: 192 GB HBM3e per GPU, 4096-bit memory bus, 18,944 CUDA cores, 592 Tensor Cores, up to 1965 MHz boost. Datacenter SXM module.
RTX PRO 6000 Blackwell: 96 GB GDDR7 per GPU, 512-bit memory bus, 24,064 CUDA cores, 752 Tensor Cores, up to 2617 MHz boost. Professional workstation GPU (PCIe 5.0 x16).

Post Training Quantization

This model was obtained by quantizing the weights and activations of Qwen3.5-397B-A17B to NVFP4 data type, ready for inference with SGLang. Only the weights and activations of the linear operators within transformer blocks are quantized, as well as the KV-cache to FP8. Vision encoder weights are not quantized. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 4x.

Usage

Deploy with SGLang

The total quantized checkpoint size is ~224GB (397B total parameters, 17B active). On NVIDIA Blackwell GPUs:

Configuration GPUs VRAM per GPU Total VRAM Throughput
B300 TP=4 4x B300 288 GB 1,152 GB ~120 tok/s
B300 TP=8 8x B300 288 GB 2,304 GB -
B200 TP=4 4x B200 192 GB 768 GB -
B200 TP=8 8x B200 192 GB 1,536 GB -
RTX PRO 6000 TP=4 4x RTX PRO 6000 96 GB 384 GB -
RTX PRO 6000 TP=8 8x RTX PRO 6000 96 GB 768 GB -

The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.5 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities.

# TP=4 (recommended, ~120 tok/s on 4x B300)
python3 -m sglang.launch_server \
    --model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
    --quantization modelopt_fp4 \
    --tp 4 \
    --context-length 262144 \
    --reasoning-parser qwen3

# TP=8 (if you have less VRAM per GPU, e.g. RTX PRO 6000)
python3 -m sglang.launch_server \
    --model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
    --quantization modelopt_fp4 \
    --tp 8 \
    --context-length 262144 \
    --reasoning-parser qwen3

Speculative Decoding (Experimental)

Qwen3.5 includes a built-in Multi-Token Prediction (MTP) head that can be used for speculative decoding via the NEXTN algorithm. This is experimental and may or may not work at this time:

python3 -m sglang.launch_server \
    --model-path vincentzed-hf/Qwen3.5-397B-A17B-NVFP4 \
    --quantization modelopt_fp4 \
    --tp 8 \
    --context-length 262144 \
    --reasoning-parser qwen3 \
    --speculative-algo NEXTN \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4

Installation

Important: You must install SGLang from the bzhng-development:vz/qwen3-5 branch, which includes a fix for the exclusion of visual encoder weights not working properly during quantized inference. Without this fix, the visual weights may be incorrectly handled.

git clone -b vz/qwen3-5 git@github.com:bzhng-development/sglang.git
cd sglang
uv pip install -e "python"
uv pip install transformers==5.2.0

When a release is cut with this fix, we will update this model card.

Reproduce with ModelOpt

You may want to produce this checkpoint yourself. To reproduce the NVFP4 quantized checkpoint using TensorRT Model Optimizer:

python3 examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path Qwen/Qwen3.5-397B-A17B \
    --qformat nvfp4 \
    --export_path ./qwen3-5-nvfp4

Note: NVFP4 and FP8 KVCache provides a significant memory footprint reduction (~3.5x vs BF16) with negligible accuracy degradation.

Baseline: Qwen3.5-397B-A17B.

Model Limitations:

The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month
13,148
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vincentzed-hf/Qwen3.5-397B-A17B-NVFP4

Quantized
(17)
this model