jiaqiz commited on
Commit
2f63d7e
·
verified ·
1 Parent(s): cb891ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -0
README.md CHANGED
@@ -110,6 +110,7 @@ Llama-3.1-Nemotron-Ultra-253B-v1 is a general purpose reasoning and chat model i
110
 
111
  (Coming soon) You can try this model out through the preview API, using this link: [Llama-3\_1-Nemotron-Ultra-253B-v1](https://build.nvidia.com/nvidia/llama-3\_1-nemotron-ultra-253b-v1).
112
 
 
113
  See the snippet below for usage with [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/index) library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below
114
 
115
  We recommend using the *transformers* package with version 4.48.3.
@@ -166,6 +167,25 @@ thinking = "off"
166
  print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"},{"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
167
  ```
168
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
  ## Inference:
170
  **Engine:**
171
 
 
110
 
111
  (Coming soon) You can try this model out through the preview API, using this link: [Llama-3\_1-Nemotron-Ultra-253B-v1](https://build.nvidia.com/nvidia/llama-3\_1-nemotron-ultra-253b-v1).
112
 
113
+ ### Use It with Transformers
114
  See the snippet below for usage with [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/index) library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below
115
 
116
  We recommend using the *transformers* package with version 4.48.3.
 
167
  print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"},{"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
168
  ```
169
 
170
+ ### Use It with vLLM
171
+
172
+ ```
173
+ pip install vllm==0.8.3
174
+ ```
175
+ An example on how to serve with vLLM:
176
+ ```
177
+ python3 -m vllm.entrypoints.openai.api_server \
178
+ --model "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1" \
179
+ --trust-remote-code \
180
+ --seed=1 \
181
+ --host="0.0.0.0" \
182
+ --port=5000 \
183
+ --served-model-name "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1" \
184
+ --tensor-parallel-size=8 \
185
+ --max-model-len=32768 \
186
+ --gpu-memory-utilization 0.95 \
187
+ --enforce-eager
188
+ ```
189
  ## Inference:
190
  **Engine:**
191