add KTransformers deployment guide (#30)

- add KTransformers deployment guide (c677b81f9c6420136b0c8960e9b2ac37279d108a)

Co-authored-by: unicorn chan <UnicornChan@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +9 -4

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ pipeline_tag: image-text-to-text
 > [!Note]
 > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.
 >
-> These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.
 > [!Tip]
 > For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by [Alibaba Cloud Model Studio](https://modelstudio.alibabacloud.com/).
@@ -920,7 +920,7 @@ In the following, we show example commands to launch OpenAI-Compatible API serve
 > [!Important]
 > Inference efficiency and throughput vary significantly across frameworks.
 > We recommend using the latest framework versions to ensure optimal performance and compatibility.
-> For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang or vLLM are strongly recommended.
 > [!Important]
 > The model has a default context length of 262,144 tokens.
@@ -993,6 +993,11 @@ The following will create API endpoints at `http://localhost:8000/v1`:
     vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
     ```
 #### Hugging Face Transformers
 Hugging Face Transformers contains a _lightweight_ server which can be used for quick testing and moderate load deployment.
@@ -1276,7 +1281,7 @@ For more information, please refer to [Qwen Code](https://qwenlm.github.io/qwen-
 Qwen3.5 natively supports context lengths of up to 262,144 tokens.
 For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.
-YaRN is currently supported by several inference frameworks, e.g., `transformers`, `vllm` and `sglang`.
 In general, there are two approaches to enabling YaRN for supported frameworks:
 - Modifying the model configuration file:
@@ -1304,7 +1309,7 @@ In general, there are two approaches to enabling YaRN for supported frameworks:
     VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000
     ```
-  For `sglang`, you can use
     ```shell
     SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
     ```

 > [!Note]
 > This repository contains model weights and configuration files for the post-trained model in the Hugging Face Transformers format.
 >
+> These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, KTransformers, etc.
 > [!Tip]
 > For users seeking managed, scalable inference without infrastructure maintenance, the official Qwen API service is provided by [Alibaba Cloud Model Studio](https://modelstudio.alibabacloud.com/).
 > [!Important]
 > Inference efficiency and throughput vary significantly across frameworks.
 > We recommend using the latest framework versions to ensure optimal performance and compatibility.
+> For production workloads or high-throughput scenarios, dedicated serving engines such as SGLang, KTransformers or vLLM are strongly recommended.
 > [!Important]
 > The model has a default context length of 262,144 tokens.
     vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --language-model-only
     ```
+#### KTransformers
+[KTransformers](https://github.com/kvcache-ai/ktransformers) is a flexible framework for experiencing cutting-edge LLM inference optimizations with CPU-GPU heterogeneous computing.
+For running Qwen3.5 with KTransformers, see the [KTransformers Deployment Guide](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/Qwen3.5.md).
 #### Hugging Face Transformers
 Hugging Face Transformers contains a _lightweight_ server which can be used for quick testing and moderate load deployment.
 Qwen3.5 natively supports context lengths of up to 262,144 tokens.
 For long-horizon tasks where the total length (including both input and output) exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively., e.g., YaRN.
+YaRN is currently supported by several inference frameworks, e.g., `transformers`, `vllm`, `ktransformers` and `sglang`.
 In general, there are two approaches to enabling YaRN for supported frameworks:
 - Modifying the model configuration file:
     VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve ... --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --max-model-len 1010000
     ```
+  For `sglang` and `ktransformers`, you can use
     ```shell
     SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 python -m sglang.launch_server ... --json-model-override-args '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' --context-length 1010000
     ```