littlebird13 UnicornChan commited on
Commit
285b7b5
·
1 Parent(s): 7cad2ba

add KTransformers deployment guide (#30)

Browse files

- add KTransformers deployment guide (c677b81f9c6420136b0c8960e9b2ac37279d108a)


Co-authored-by: unicorn chan <UnicornChan@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +9 -4
README.md CHANGED
@@ -14,7 +14,7 @@ pipeline_tag: image-text-to-text
14
  > [!Note]
15
  > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.
16
  >
17
- > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.
18
 
19
  > [!Tip]
20
  > For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by [Alibaba Cloud Model Studio](https://modelstudio.alibabacloud.com/).
@@ -920,7 +920,7 @@ In the following, we show example commands to launch OpenAI-Compatible API serve
920
  > [!Important]
921
  > Inference efficiency and throughput vary significantly across frameworks.
922
  > We recommend using the latest framework versions to ensure optimal performance and compatibility.
923
- > For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang or vLLM are strongly recommended.
924
 
925
  > [!Important]
926
  > The model has a default context length of 262,144 tokens.
@@ -993,6 +993,11 @@ The following will create API endpoints at `http://localhost:8000/v1`:
993
  vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
994
  ```
995
 
 
 
 
 
 
996
  #### Hugging Face Transformers
997
 
998
  Hugging Face Transformers contains a _lightweight_ server which can be used for quick testing and moderate load deployment.
@@ -1276,7 +1281,7 @@ For more information, please refer to [Qwen Code](https://qwenlm.github.io/qwen-
1276
  Qwen3.5 natively supports context lengths of up to 262,144 tokens.
1277
  For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.
1278
 
1279
- YaRN is currently supported by several inference frameworks, e.g., `transformers`, `vllm` and `sglang`.
1280
  In general, there are two approaches to enabling YaRN for supported frameworks:
1281
 
1282
  - Modifying the model configuration file:
@@ -1304,7 +1309,7 @@ In general, there are two approaches to enabling YaRN for supported frameworks:
1304
  VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000
1305
  ```
1306
 
1307
- For `sglang`, you can use
1308
  ```shell
1309
  SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
1310
  ```
 
14
  > [!Note]
15
  > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.
16
  >
17
+ > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc.
18
 
19
  > [!Tip]
20
  > For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by [Alibaba Cloud Model Studio](https://modelstudio.alibabacloud.com/).
 
920
  > [!Important]
921
  > Inference efficiency and throughput vary significantly across frameworks.
922
  > We recommend using the latest framework versions to ensure optimal performance and compatibility.
923
+ > For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang, KTransformers or vLLM are strongly recommended.
924
 
925
  > [!Important]
926
  > The model has a default context length of 262,144 tokens.
 
993
  vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
994
  ```
995
 
996
+ #### KTransformers
997
+
998
+ [KTransformers](https://github.com/kvcache-ai/ktransformers) is a flexible framework for experiencing cutting-edge LLM inference optimizations with CPU-GPU heterogeneous computing.
999
+ For running Qwen3.5 with KTransformers, see the [KTransformers Deployment Guide](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Qwen3.5.md).
1000
+
1001
  #### Hugging Face Transformers
1002
 
1003
  Hugging Face Transformers contains a _lightweight_ server which can be used for quick testing and moderate load deployment.
 
1281
  Qwen3.5 natively supports context lengths of up to 262,144 tokens.
1282
  For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.
1283
 
1284
+ YaRN is currently supported by several inference frameworks, e.g., `transformers`, `vllm`, `ktransformers` and `sglang`.
1285
  In general, there are two approaches to enabling YaRN for supported frameworks:
1286
 
1287
  - Modifying the model configuration file:
 
1309
  VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000
1310
  ```
1311
 
1312
+ For `sglang` and `ktransformers`, you can use
1313
  ```shell
1314
  SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
1315
  ```