Commit
·
285b7b5
1
Parent(s):
7cad2ba
add KTransformers deployment guide (#30)
Browse files- add KTransformers deployment guide (c677b81f9c6420136b0c8960e9b2ac37279d108a)
Co-authored-by: unicorn chan <UnicornChan@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -14,7 +14,7 @@ pipeline_tag: image-text-to-text
|
|
| 14 |
> [!Note]
|
| 15 |
> This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.
|
| 16 |
>
|
| 17 |
-
> These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.
|
| 18 |
|
| 19 |
> [!Tip]
|
| 20 |
> For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by [Alibaba Cloud Model Studio](https://modelstudio.alibabacloud.com/).
|
|
@@ -920,7 +920,7 @@ In the following, we show example commands to launch OpenAI-Compatible API serve
|
|
| 920 |
> [!Important]
|
| 921 |
> Inference efficiency and throughput vary significantly across frameworks.
|
| 922 |
> We recommend using the latest framework versions to ensure optimal performance and compatibility.
|
| 923 |
-
> For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang or vLLM are strongly recommended.
|
| 924 |
|
| 925 |
> [!Important]
|
| 926 |
> The model has a default context length of 262,144 tokens.
|
|
@@ -993,6 +993,11 @@ The following will create API endpoints at `http://localhost:8000/v1`:
|
|
| 993 |
vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
|
| 994 |
```
|
| 995 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 996 |
#### Hugging Face Transformers
|
| 997 |
|
| 998 |
Hugging Face Transformers contains a _lightweight_ server which can be used for quick testing and moderate load deployment.
|
|
@@ -1276,7 +1281,7 @@ For more information, please refer to [Qwen Code](https://qwenlm.github.io/qwen-
|
|
| 1276 |
Qwen3.5 natively supports context lengths of up to 262,144 tokens.
|
| 1277 |
For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.
|
| 1278 |
|
| 1279 |
-
YaRN is currently supported by several inference frameworks, e.g., `transformers`, `vllm` and `sglang`.
|
| 1280 |
In general, there are two approaches to enabling YaRN for supported frameworks:
|
| 1281 |
|
| 1282 |
- Modifying the model configuration file:
|
|
@@ -1304,7 +1309,7 @@ In general, there are two approaches to enabling YaRN for supported frameworks:
|
|
| 1304 |
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000
|
| 1305 |
```
|
| 1306 |
|
| 1307 |
-
For `sglang`, you can use
|
| 1308 |
```shell
|
| 1309 |
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
|
| 1310 |
```
|
|
|
|
| 14 |
> [!Note]
|
| 15 |
> This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.
|
| 16 |
>
|
| 17 |
+
> These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc.
|
| 18 |
|
| 19 |
> [!Tip]
|
| 20 |
> For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by [Alibaba Cloud Model Studio](https://modelstudio.alibabacloud.com/).
|
|
|
|
| 920 |
> [!Important]
|
| 921 |
> Inference efficiency and throughput vary significantly across frameworks.
|
| 922 |
> We recommend using the latest framework versions to ensure optimal performance and compatibility.
|
| 923 |
+
> For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang, KTransformers or vLLM are strongly recommended.
|
| 924 |
|
| 925 |
> [!Important]
|
| 926 |
> The model has a default context length of 262,144 tokens.
|
|
|
|
| 993 |
vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
|
| 994 |
```
|
| 995 |
|
| 996 |
+
#### KTransformers
|
| 997 |
+
|
| 998 |
+
[KTransformers](https://github.com/kvcache-ai/ktransformers) is a flexible framework for experiencing cutting-edge LLM inference optimizations with CPU-GPU heterogeneous computing.
|
| 999 |
+
For running Qwen3.5 with KTransformers, see the [KTransformers Deployment Guide](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Qwen3.5.md).
|
| 1000 |
+
|
| 1001 |
#### Hugging Face Transformers
|
| 1002 |
|
| 1003 |
Hugging Face Transformers contains a _lightweight_ server which can be used for quick testing and moderate load deployment.
|
|
|
|
| 1281 |
Qwen3.5 natively supports context lengths of up to 262,144 tokens.
|
| 1282 |
For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.
|
| 1283 |
|
| 1284 |
+
YaRN is currently supported by several inference frameworks, e.g., `transformers`, `vllm`, `ktransformers` and `sglang`.
|
| 1285 |
In general, there are two approaches to enabling YaRN for supported frameworks:
|
| 1286 |
|
| 1287 |
- Modifying the model configuration file:
|
|
|
|
| 1309 |
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000
|
| 1310 |
```
|
| 1311 |
|
| 1312 |
+
For `sglang` and `ktransformers`, you can use
|
| 1313 |
```shell
|
| 1314 |
SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
|
| 1315 |
```
|