missing think tag
Successfully running on rtx4080s 32G with vllm version 0.16.0rc2.dev315+g648951a9c, but missing <think> </think> tag.
export CONTEXT_LENGTH=60000
export CUDA_DEVICE_WAIT_POLICY=1
export VLLM_NVFP4_GEMM_BACKEND=marlin
export VLLM_MARLIN_USE_ATOMIC_ADD=1
# --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
# --language-model-only
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,8 \
VLLM_USE_MODELSCOPE=true vllm serve \
/mnt/models/hf_home/Qwen3.5-397B-A17B-NVFP4 \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len $CONTEXT_LENGTH \
--served-model-name qwen3.5-plus \
--enable-auto-tool-choice \
--kv-cache-dtype fp8 \
--swap-space 16 \
--max-num-seqs 6 \
--gpu-memory-utilization 0.95 \
--language-model-only \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coder
Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.
--reasoning-parser deepseek_r1?
--reasoning-parser deepseek_r1?
The following pr solved my problem.https://github.com/vllm-project/vllm/pull/34779
Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.
Thank you. AWQ 4-bit is more widely used on pre-Blackwell GPUs. Would you consider it?
--reasoning-parser deepseek_r1?
The following pr solved my problem.
https://github.com/vllm-project/vllm/pull/34779Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.Thank you. AWQ 4-bit is more widely used on pre-Blackwell GPUs. Would you consider it?
+1
due to some kernel problem int w4a16 always give better performance than fp4 w4a4...
but my idea is gptq, seems gptq give slightly better performance than awq, and is more easy to make mixed quantization
--reasoning-parser deepseek_r1?
The following pr solved my problem.
https://github.com/vllm-project/vllm/pull/34779Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.Thank you. AWQ 4-bit is more widely used on pre-Blackwell GPUs. Would you consider it?
may I ask about throughput?
I have re-uploaded weights with some issues fixed.
--reasoning-parser deepseek_r1?
The following pr solved my problem.
https://github.com/vllm-project/vllm/pull/34779Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.Thank you. AWQ 4-bit is more widely used on pre-Blackwell GPUs. Would you consider it?
may I ask about throughput?
Logs of 1 request:
(APIServer pid=28860) INFO: 192.168.100.150:39576 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=28860) INFO 02-21 12:29:52 [loggers.py:259] Engine 000: Avg prompt throughput: 2.0 tokens/s, Avg generation throughput: 40.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.7%, Prefix cache hit rate: 0.0%
(APIServer pid=28860) INFO 02-21 12:30:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 93.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
(APIServer pid=28860) INFO 02-21 12:30:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 93.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%
(APIServer pid=28860) INFO 02-21 12:30:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 93.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 0.0%
(APIServer pid=28860) INFO 02-21 12:30:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 93.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 0.0%