mshojaei77 commited on
Commit
bfe8b6c
·
verified ·
1 Parent(s): 827fff7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -119
README.md CHANGED
@@ -1,168 +1,138 @@
1
  ---
2
  license: apache-2.0
3
- pipeline_tag: text-generation
4
- library_name: transformers
5
  tags:
 
6
  - vllm
 
 
7
  ---
8
 
9
- <p align="center">
10
- <img alt="gpt-oss-120b" src="https://raw.githubusercontent.com/openai/gpt-oss/main/docs/gpt-oss-120b.svg">
11
- </p>
12
 
13
- <p align="center">
14
- <a href="https://gpt-oss.com"><strong>Try gpt-oss</strong></a> ·
15
- <a href="https://cookbook.openai.com/topic/gpt-oss"><strong>Guides</strong></a> ·
16
- <a href="https://openai.com/index/gpt-oss-model-card"><strong>Model card</strong></a> ·
17
- <a href="https://openai.com/index/introducing-gpt-oss/"><strong>OpenAI blog</strong></a>
18
- </p>
19
 
20
- <br>
21
 
22
- Welcome to the gpt-oss series, [OpenAI’s open-weight models](https://openai.com/open-models) designed for powerful reasoning, agentic tasks, and versatile developer use cases.
23
 
24
- We’re releasing two flavors of these open models:
25
- - `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 or AMD MI300X) (117B parameters with 5.1B active parameters)
26
- - `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)
27
 
28
- Both models were trained on our [harmony response format](https://github.com/openai/harmony) and should only be used with the harmony format as it will not work correctly otherwise.
29
-
30
-
31
- > [!NOTE]
32
- > This model card is dedicated to the larger `gpt-oss-120b` model. Check out [`gpt-oss-20b`](https://huggingface.co/openai/gpt-oss-20b) for the smaller model.
33
-
34
- # Highlights
35
-
36
- * **Permissive Apache 2.0 license:** Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
37
- * **Configurable reasoning effort:** Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
38
- * **Full chain-of-thought:** Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
39
- * **Fine-tunable:** Fully customize models to your specific use case through parameter fine-tuning.
40
- * **Agentic capabilities:** Use the models’ native capabilities for function calling, [web browsing](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#browser), [Python code execution](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#python), and Structured Outputs.
41
- * **Native MXFP4 quantization:** The models are trained with native MXFP4 precision for the MoE layer, making `gpt-oss-120b` run on a single 80GB GPU (like NVIDIA H100 or AMD MI300X) and the `gpt-oss-20b` model run within 16GB of memory.
42
 
43
  ---
44
 
45
- # Inference examples
46
-
47
- ## Transformers
48
-
49
- You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the [harmony response format](https://github.com/openai/harmony). If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our [openai-harmony](https://github.com/openai/harmony) package.
50
-
51
- To get started, install the necessary dependencies to setup your environment:
52
-
53
- ```
54
- pip install -U transformers kernels torch
55
- ```
56
 
57
- Once, setup you can proceed to run the model by running the snippet below:
 
 
 
 
 
58
 
59
- ```py
60
- from transformers import pipeline
61
- import torch
62
 
63
- model_id = "openai/gpt-oss-120b"
64
 
65
- pipe = pipeline(
66
- "text-generation",
67
- model=model_id,
68
- torch_dtype="auto",
69
- device_map="auto",
70
- )
71
 
72
- messages = [
73
- {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
74
- ]
75
 
76
- outputs = pipe(
77
- messages,
78
- max_new_tokens=256,
79
- )
80
- print(outputs[0]["generated_text"][-1])
81
- ```
82
 
83
- Alternatively, you can run the model via [`Transformers Serve`](https://huggingface.co/docs/transformers/main/serving) to spin up a OpenAI-compatible webserver:
 
 
 
 
 
84
 
85
- ```
86
- transformers serve
87
- transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-120b
88
- ```
89
 
90
- [Learn more about how to use gpt-oss with Transformers.](https://cookbook.openai.com/articles/gpt-oss/run-transformers)
91
-
92
- ## vLLM
93
-
94
- vLLM recommends using [uv](https://docs.astral.sh/uv/) for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.
95
 
96
  ```bash
97
- uv pip install --pre vllm==0.10.1+gptoss \
98
- --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
99
- --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
100
- --index-strategy unsafe-best-match
101
-
102
- vllm serve openai/gpt-oss-120b
 
 
 
 
 
 
103
  ```
104
 
105
- [Learn more about how to use gpt-oss with vLLM.](https://cookbook.openai.com/articles/gpt-oss/run-vllm)
106
-
107
- ## PyTorch / Triton
108
 
109
- To learn about how to use this model with PyTorch and Triton, check out our [reference implementations in the gpt-oss repository](https://github.com/openai/gpt-oss?tab=readme-ov-file#reference-pytorch-implementation).
110
-
111
- ## Ollama
112
-
113
- If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after [installing Ollama](https://ollama.com/download).
114
 
 
115
  ```bash
116
- # gpt-oss-120b
117
- ollama pull gpt-oss:120b
118
- ollama run gpt-oss:120b
119
- ```
120
-
121
- [Learn more about how to use gpt-oss with Ollama.](https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama)
122
 
123
- #### LM Studio
 
 
124
 
125
- If you are using [LM Studio](https://lmstudio.ai/) you can use the following commands to download.
 
 
126
 
 
127
  ```bash
128
- # gpt-oss-120b
129
- lms get openai/gpt-oss-120b
 
 
 
 
 
 
130
  ```
131
 
132
- Check out our [awesome list](https://github.com/openai/gpt-oss/blob/main/awesome-gpt-oss.md) for a broader collection of gpt-oss resources and inference partners.
 
 
 
133
 
134
- ---
135
 
136
- # Download the model
137
 
138
- You can download the model weights from the [Hugging Face Hub](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4) directly from Hugging Face CLI:
139
 
140
- ```shell
141
- # gpt-oss-120b
142
- huggingface-cli download openai/gpt-oss-120b --include "original/*" --local-dir gpt-oss-120b/
143
- pip install gpt-oss
144
- python -m gpt_oss.chat model/
145
- ```
146
 
147
- # Reasoning levels
 
 
 
 
148
 
149
- You can adjust the reasoning level that suits your task across three levels:
 
 
 
 
 
 
150
 
151
- * **Low:** Fast responses for general dialogue.
152
- * **Medium:** Balanced speed and detail.
153
- * **High:** Deep and detailed analysis.
154
 
155
- The reasoning level can be set in the system prompts, e.g., "Reasoning: high".
156
 
157
- # Tool use
158
 
159
- The gpt-oss models are excellent for:
160
- * Web browsing (using built-in browsing tools)
161
- * Function calling with defined schemas
162
- * Agentic operations like browser tasks
163
 
164
- # Fine-tuning
165
 
166
- Both gpt-oss models can be fine-tuned for a variety of specialized use cases.
167
 
168
- This larger model `gpt-oss-120b` can be fine-tuned on a single H100 node, whereas the smaller [`gpt-oss-20b`](https://huggingface.co/openai/gpt-oss-20b) can even be fine-tuned on consumer hardware.
 
1
  ---
2
  license: apache-2.0
 
 
3
  tags:
4
+ - gpt_oss
5
  - vllm
6
+ - conversational
7
+ - mxfp4
8
  ---
9
 
10
+ # gpt-oss-120b (Clean Fork for Stable Deployment)
 
 
11
 
12
+ This repository, `mshojaei77/gpt-oss-120b`, is a clean, deployment-focused fork of the official `openai/gpt-oss-120b` model.
 
 
 
 
 
13
 
14
+ ## Why Does This Fork Exist?
15
 
16
+ The original `openai/gpt-oss-120b` repository contains large, non-essential folders (`/original` and `/metal`) which can cause issues with automated downloaders like the one used by vLLM. These extra folders can lead to "Disk quota exceeded" errors during deployment, even on systems with sufficient disk space for the core model.
17
 
18
+ **This fork solves that problem by containing *only* the essential files required for inference.**
 
 
19
 
20
+ By using this repository, you are guaranteed to download only the ~65 GB of necessary model weights and configuration files, ensuring a smooth and reliable deployment.
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ---
23
 
24
+ ## Model Quick Facts
 
 
 
 
 
 
 
 
 
 
25
 
26
+ - **Original Model:** `openai/gpt-oss-120b`
27
+ - **Parameters:** ~117B (~5.1B active per forward pass)
28
+ - **Architecture:** Mixture of Experts (MoE)
29
+ - **Quantization:** Pre-quantized with **MXFP4** for MoE layers.
30
+ - **License:** Apache-2.0
31
+ - **Format:** The model was trained for the **Harmony** response format. vLLM's chat template handles this automatically.
32
 
33
+ ---
 
 
34
 
35
+ ## 🚀 Production-Grade Deployment with vLLM on RunPod
36
 
37
+ This is a battle-tested guide for deploying this model on a single **NVIDIA H100 (80GB)** GPU using vLLM and RunPod.
 
 
 
 
 
38
 
39
+ ### Step 1: Configure Your RunPod Pod
 
 
40
 
41
+ A correct disk configuration is the most critical step.
 
 
 
 
 
42
 
43
+ 1. **GPU:** Select `1 x H100 80GB`.
44
+ 2. **Template:** Use a standard PyTorch image (e.g., `runpod/pytorch`).
45
+ 3. **Disks (Important!):**
46
+ - **Container Disk:** `30 GB` (This is temporary).
47
+ - **Volume Disk:** **`90 GB` or more**. This is your persistent storage.
48
+ - **Volume Mount Path:** Set to `/workspace`.
49
 
50
+ ### Step 2: Set Up the Environment
 
 
 
51
 
52
+ Connect to your pod and run the following commands to install dependencies in a persistent virtual environment.
 
 
 
 
53
 
54
  ```bash
55
+ # Install uv, a fast package manager
56
+ pip install uv
57
+
58
+ # Create and activate a virtual environment inside our persistent /workspace
59
+ uv venv --python 3.12 --seed /workspace/.venv
60
+ source /workspace/.venv/bin/activate
61
+
62
+ # Install the specialized vLLM build for gpt-oss
63
+ uv pip install --pre "vllm==0.10.1+gptoss" \
64
+ --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
65
+ --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
66
+ --index-strategy unsafe-best-match
67
  ```
68
 
69
+ ### Step 3: Launch the vLLM Server
 
 
70
 
71
+ This command will download the model from **this repository** and start an OpenAI-compatible API server.
 
 
 
 
72
 
73
+ First, configure your shell session:
74
  ```bash
75
+ # Create cache directories inside the persistent volume
76
+ mkdir -p /workspace/hf-cache /workspace/tmp
 
 
 
 
77
 
78
+ # Point all caching and temp operations to the 90GB volume
79
+ export HF_HOME="/workspace/hf-cache"
80
+ export TMPDIR="/workspace/tmp"
81
 
82
+ # Force vLLM to use the highly optimized FlashAttention-3 kernel
83
+ export VLLM_FLASH_ATTN_VERSION=3
84
+ ```
85
 
86
+ Now, launch the server. The first launch will download the ~65GB model.
87
  ```bash
88
+ vllm serve mshojaei77/gpt-oss-120b \
89
+ --trust-remote-code \
90
+ --dtype bfloat16 \
91
+ --port 8000 \
92
+ --gpu-memory-utilization 0.90 \
93
+ --max-model-len 32768 \
94
+ --max-num-seqs 16 \
95
+ --download-dir /workspace/hf-cache
96
  ```
97
 
98
+ **Why these flags?**
99
+ - `--gpu-memory-utilization 0.90`: Safely uses 90% of the H100's VRAM.
100
+ - `--max-model-len 32768`: Enables the full 32k context window.
101
+ - `--download-dir /workspace/hf-cache`: **Crucial flag.** Forces vLLM to use your persistent volume, avoiding bugs where it might default to the small container disk.
102
 
103
+ > **⚠️ CAUTION:** Do not use `--kv-cache-dtype fp8` with this setup on an H100/H200. There is a known kernel incompatibility in this vLLM build that can cause a runtime error. The H100 has sufficient VRAM to handle the 32k context in full `bfloat16` precision.
104
 
105
+ ### Step 4: Use the API
106
 
107
+ Once the server is running, you can connect to it using any OpenAI-compatible client. If using RunPod's public proxy, find your URL in the pod's "Ports" section.
108
 
109
+ ```python
110
+ from openai import OpenAI
 
 
 
 
111
 
112
+ # Replace with your RunPod proxy URL or http://localhost:8000/v1 if testing internally
113
+ client = OpenAI(
114
+ base_url="<YOUR_RUNPOD_PROXY_URL>/v1",
115
+ api_key="EMPTY"
116
+ )
117
 
118
+ response = client.chat.completions.create(
119
+ model="mshojaei77/gpt-oss-120b",
120
+ messages=[
121
+ {"role": "system", "content": "You are a helpful assistant."},
122
+ {"role": "user", "content": "Explain what MXFP4 quantization is."}
123
+ ]
124
+ )
125
 
126
+ print(response.choices[0].message.content)
127
+ ```
 
128
 
129
+ ---
130
 
131
+ ### Original Model
132
 
133
+ This model is a fork of [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b). Please refer to the original model card for all details regarding its architecture, training, and intended use.
 
 
 
134
 
135
+ ### License
136
 
137
+ This model is licensed under the Apache-2.0 License, consistent with the original repository.
138