File size: 4,804 Bytes
e66e0ef
 
82d0f52
e66e0ef
 
 
 
 
 
 
 
e00807a
e66e0ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e00807a
e66e0ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82d0f52
 
e66e0ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# Intern-S1-Pro Deployment Guide

The Intern-S1-Pro release is a 1T parameter model stored in FP8 format. Deployment requires at least **two 8-GPU H200** nodes, with either of the following configurations:

- Tensor Parallelism (TP)
- Data Parallelism (DP) + Expert Parallelism (EP)

> NOTE: The deployment examples in this guide are provided for reference only and may not represent the latest or most optimized configurations. Inference frameworks are under active development — always consult the official documentation from each framework’s maintainers to ensure peak performance and compatibility.

## LMDeploy

Required version `lmdeploy>=0.12.1`

- Tensor Parallelism

```bash
# start ray on node 0 and node 1

# node 0
lmdeploy serve api_server internlm/Intern-S1-Pro --backend pytorch --tp 16
```

- Data Parallelism + Expert Parallelism

```
# node 0, proxy server
lmdeploy serve proxy --server-name ${proxy_server_ip} --server-port ${proxy_server_port} --routing-strategy 'min_expected_latency' --serving-strategy Hybrid

# node 0
export LMDEPLOY_DP_MASTER_ADDR=${node0_ip}
export LMDEPLOY_DP_MASTER_PORT=29555
lmdeploy serve api_server \
    internlm/Intern-S1-Pro \
    --backend pytorch \
    --tp 1 \
    --dp 16 \
    --ep 16 \
    --proxy-url http://${proxy_server_ip}:${proxy_server_port} \
    --nnodes 2 \
    --node-rank 0 \
    --reasoning-parser intern-s1 \
    --tool-call-parser qwen3

# node 1
export LMDEPLOY_DP_MASTER_ADDR=${node0_ip}
export LMDEPLOY_DP_MASTER_PORT=29555
lmdeploy serve api_server \
    internlm/Intern-S1-Pro \
    --backend pytorch \
    --tp 1 \
    --dp 16 \
    --ep 16 \
    --proxy-url http://${proxy_server_ip}:${proxy_server_port} \
    --nnodes 2 \
    --node-rank 1 \
    --reasoning-parser intern-s1 \
    --tool-call-parser qwen3
```

## vLLM

You can use the vLLM nightly-built docker image `vllm/vllm-openai:nightly` to deploy. Refer to [using-docker](https://docs.vllm.ai/en/latest/deployment/docker/?h=docker) for more.

```bash
# node 0
export VLLM_ENGINE_READY_TIMEOUT_S=10000
vllm serve internlm/Intern-S1-Pro \
    --all2all-backend deepep_low_latency \
    --tensor-parallel-size 1 \
    --enable-expert-parallel \
    --data-parallel-size 16 \
    --data-parallel-size-local 8 \
    --data-parallel-address ${node0_ip} \
    --data-parallel-rpc-port 13345 \
    --gpu_memory_utilization 0.8 \
    --mm_processor_cache_gb=0 \
    --media-io-kwargs '{"video": {"num_frames": 768, "fps": 2}}' \
    --max-model-len 65536 \
    --trust-remote-code \
    --api-server-count=8 \
    --reasoning-parser deepseek_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

# node 1
export VLLM_ENGINE_READY_TIMEOUT_S=10000
vllm serve internlm/Intern-S1-Pro \
    --all2all-backend deepep_low_latency \
    --tensor-parallel-size 1 \
    --enable-expert-parallel \
    --data-parallel-size 16 \
    --data-parallel-size-local 8 \
    --data-parallel-start-rank 8 \
    --data-parallel-address ${node0_ip} \
    --data-parallel-rpc-port 13345 \
    --gpu_memory_utilization 0.8 \
    --mm_processor_cache_gb=0 \
    --media-io-kwargs '{"video": {"num_frames": 768, "fps": 2}}' \
    --max-model-len 65536 \
    --trust-remote-code \
    --headless \
    --reasoning-parser deepseek_r1 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes
```

> NOTE: To prevent out-of-memory (OOM) errors, we limit the context length using `--max-model-len 65536`. For datasets requiring longer responses, you may increase this value as needed. Additionally, video inference can consume substantial memory in vLLM API server processes; we therefore recommend setting `--media-io-kwargs '{"video": {"num_frames": 768, "fps": 2}}'` to constrain preprocessing memory usage during video benchmarking.

## SGLang

You can use the docker image `lmsysorg/sglang:dev` to deploy. Refer to [using-docker](https://docs.sglang.io/get_started/install.html#method-3-using-docker) for more.

- Tensor Parallelism + Expert Parallelism

```bash
export DIST_ADDR=${master_node_ip}:${master_node_port}

# node 0
python3 -m sglang.launch_server \
  --model-path internlm/Intern-S1-Pro \
  --tp 16 \
  --ep 16 \
  --mem-fraction-static 0.85 \
  --trust-remote-code \
  --dist-init-addr ${DIST_ADDR} \
  --nnodes 2 \
  --attention-backend fa3 \
  --mm-attention-backend fa3 \
  --keep-mm-feature-on-device \
  --node-rank 0 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen

# node 1
python3 -m sglang.launch_server \
  --model-path internlm/Intern-S1-Pro \
  --tp 16 \
  --ep 16 \
  --mem-fraction-static 0.85 \
  --trust-remote-code \
  --dist-init-addr ${DIST_ADDR} \
  --nnodes 2 \
  --attention-backend fa3 \
  --mm-attention-backend fa3 \
  --keep-mm-feature-on-device \
  --node-rank 1 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen
```