Update README.md
Browse files
README.md
CHANGED
|
@@ -110,6 +110,7 @@ Llama-3.1-Nemotron-Ultra-253B-v1 is a general purpose reasoning and chat model i
|
|
| 110 |
|
| 111 |
(Coming soon) You can try this model out through the preview API, using this link: [Llama-3\_1-Nemotron-Ultra-253B-v1](https://build.nvidia.com/nvidia/llama-3\_1-nemotron-ultra-253b-v1).
|
| 112 |
|
|
|
|
| 113 |
See the snippet below for usage with [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/index) library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below
|
| 114 |
|
| 115 |
We recommend using the *transformers* package with version 4.48.3.
|
|
@@ -166,6 +167,25 @@ thinking = "off"
|
|
| 166 |
print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"},{"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
|
| 167 |
```
|
| 168 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
## Inference:
|
| 170 |
**Engine:**
|
| 171 |
|
|
|
|
| 110 |
|
| 111 |
(Coming soon) You can try this model out through the preview API, using this link: [Llama-3\_1-Nemotron-Ultra-253B-v1](https://build.nvidia.com/nvidia/llama-3\_1-nemotron-ultra-253b-v1).
|
| 112 |
|
| 113 |
+
### Use It with Transformers
|
| 114 |
See the snippet below for usage with [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/index) library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below
|
| 115 |
|
| 116 |
We recommend using the *transformers* package with version 4.48.3.
|
|
|
|
| 167 |
print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"},{"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
|
| 168 |
```
|
| 169 |
|
| 170 |
+
### Use It with vLLM
|
| 171 |
+
|
| 172 |
+
```
|
| 173 |
+
pip install vllm==0.8.3
|
| 174 |
+
```
|
| 175 |
+
An example on how to serve with vLLM:
|
| 176 |
+
```
|
| 177 |
+
python3 -m vllm.entrypoints.openai.api_server \
|
| 178 |
+
--model "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1" \
|
| 179 |
+
--trust-remote-code \
|
| 180 |
+
--seed=1 \
|
| 181 |
+
--host="0.0.0.0" \
|
| 182 |
+
--port=5000 \
|
| 183 |
+
--served-model-name "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1" \
|
| 184 |
+
--tensor-parallel-size=8 \
|
| 185 |
+
--max-model-len=32768 \
|
| 186 |
+
--gpu-memory-utilization 0.95 \
|
| 187 |
+
--enforce-eager
|
| 188 |
+
```
|
| 189 |
## Inference:
|
| 190 |
**Engine:**
|
| 191 |
|