missing think tag

by fouvy - opened 1 day ago

1 day ago

•

Successfully running on rtx4080s 32G with vllm version 0.16.0rc2.dev315+g648951a9c, but missing <think> </think> tag.

export CONTEXT_LENGTH=60000
export CUDA_DEVICE_WAIT_POLICY=1
export VLLM_NVFP4_GEMM_BACKEND=marlin
export VLLM_MARLIN_USE_ATOMIC_ADD=1
# --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
# --language-model-only
CUDA_DEVICE_ORDER=PCI_BUS_ID \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,8 \
VLLM_USE_MODELSCOPE=true vllm serve \
        /mnt/models/hf_home/Qwen3.5-397B-A17B-NVFP4 \
        --port 8000 \
        --tensor-parallel-size 8 \
        --max-model-len $CONTEXT_LENGTH \
        --served-model-name qwen3.5-plus \
        --enable-auto-tool-choice \
        --kv-cache-dtype fp8 \
        --swap-space 16 \
        --max-num-seqs 6 \
        --gpu-memory-utilization 0.95 \
        --language-model-only \
        --reasoning-parser qwen3 \
        --tool-call-parser qwen3_coder

Sehyo

Owner 1 day ago

Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.

aabbccddwasd

1 day ago

--reasoning-parser deepseek_r1?

fouvy

1 day ago

•

edited 1 day ago

--reasoning-parser deepseek_r1?

The following pr solved my problem.
https://github.com/vllm-project/vllm/pull/34779

Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.

Thank you. AWQ 4-bit is more widely used on pre-Blackwell GPUs. Would you consider it?

aabbccddwasd

1 day ago

•

edited 1 day ago

--reasoning-parser deepseek_r1?

The following pr solved my problem.
https://github.com/vllm-project/vllm/pull/34779

Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.

Thank you. AWQ 4-bit is more widely used on pre-Blackwell GPUs. Would you consider it?

+1
due to some kernel problem int w4a16 always give better performance than fp4 w4a4...
but my idea is gptq, seems gptq give slightly better performance than awq, and is more easy to make mixed quantization

aabbccddwasd

1 day ago

--reasoning-parser deepseek_r1?

The following pr solved my problem.
https://github.com/vllm-project/vllm/pull/34779

Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.

Thank you. AWQ 4-bit is more widely used on pre-Blackwell GPUs. Would you consider it?

may I ask about throughput?

Sehyo

Owner 1 day ago

I have re-uploaded weights with some issues fixed.

fouvy

about 11 hours ago

--reasoning-parser deepseek_r1?

The following pr solved my problem.
https://github.com/vllm-project/vllm/pull/34779

Hello,
I have discovered an issue in the quantization, I will requantize and reupload later today with some fixes.

Thank you. AWQ 4-bit is more widely used on pre-Blackwell GPUs. Would you consider it?

may I ask about throughput?

Logs of 1 request:

(APIServer pid=28860) INFO:     192.168.100.150:39576 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=28860) INFO 02-21 12:29:52 [loggers.py:259] Engine 000: Avg prompt throughput: 2.0 tokens/s, Avg generation throughput: 40.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.7%, Prefix cache hit rate: 0.0%
(APIServer pid=28860) INFO 02-21 12:30:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 93.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
(APIServer pid=28860) INFO 02-21 12:30:12 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 93.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.5%, Prefix cache hit rate: 0.0%
(APIServer pid=28860) INFO 02-21 12:30:22 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 93.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 0.0%
(APIServer pid=28860) INFO 02-21 12:30:32 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 93.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.9%, Prefix cache hit rate: 0.0%

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment