nvidia
/

Llama-3_1-Nemotron-Ultra-253B-v1

Text Generation

Model card Files Files and versions

jiaqiz commited on Apr 8

Commit

2f63d7e

·

verified ·

1 Parent(s): cb891ad

Update README.md

Files changed (1) hide show

README.md +20 -0

README.md CHANGED Viewed

@@ -110,6 +110,7 @@ Llama-3.1-Nemotron-Ultra-253B-v1 is a general purpose reasoning and chat model i
 (Coming soon) You can try this model out through the preview API, using this link: [Llama-3\_1-Nemotron-Ultra-253B-v1](https://build.nvidia.com/nvidia/llama-3\_1-nemotron-ultra-253b-v1).
 See the snippet below for usage with [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/index) library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below
 We recommend using the *transformers* package with version 4.48.3.
@@ -166,6 +167,25 @@ thinking = "off"
 print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"},{"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
 ```
 ## Inference:
 **Engine:**

 (Coming soon) You can try this model out through the preview API, using this link: [Llama-3\_1-Nemotron-Ultra-253B-v1](https://build.nvidia.com/nvidia/llama-3\_1-nemotron-ultra-253b-v1).
+### Use It with Transformers
 See the snippet below for usage with [Hugging Face Transformers](https://huggingface.co/docs/transformers/main/en/index) library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below
 We recommend using the *transformers* package with version 4.48.3.
 print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"},{"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))
 ```
+### Use It with vLLM
+```
+pip install vllm==0.8.3
+```
+An example on how to serve with vLLM:
+```
+python3 -m vllm.entrypoints.openai.api_server \
+  --model "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1" \
+  --trust-remote-code \
+  --seed=1 \
+  --host="0.0.0.0" \
+  --port=5000 \
+  --served-model-name "nvidia/Llama-3_1-Nemotron-Ultra-253B-v1" \
+  --tensor-parallel-size=8 \
+  --max-model-len=32768 \
+  --gpu-memory-utilization 0.95 \
+  --enforce-eager
+```
 ## Inference:
 **Engine:**