mshojaei77
/

gpt-oss-120b

@@ -1,168 +1,138 @@
 ---
 license: apache-2.0
-pipeline_tag: text-generation
-library_name: transformers
 tags:
 - vllm
 ---
-<p align="center">
-  <img alt="gpt-oss-120b" src="https://raw.githubusercontent.com/openai/gpt-oss/main/docs/gpt-oss-120b.svg">
-</p>
-<p align="center">
-  <a href="https://gpt-oss.com"><strong>Try gpt-oss</strong></a> ·
-  <a href="https://cookbook.openai.com/topic/gpt-oss"><strong>Guides</strong></a> ·
-  <a href="https://openai.com/index/gpt-oss-model-card"><strong>Model card</strong></a> ·
-  <a href="https://openai.com/index/introducing-gpt-oss/"><strong>OpenAI blog</strong></a>
-</p>
-<br>
-Welcome to the gpt-oss series, [OpenAI’s open-weight models](https://openai.com/open-models) designed for powerful reasoning, agentic tasks, and versatile developer use cases.
-We’re releasing two flavors of these open models:
-- `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 or AMD MI300X) (117B parameters with 5.1B active parameters)
-- `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)
-Both models were trained on our [harmony response format](https://github.com/openai/harmony) and should only be used with the harmony format as it will not work correctly otherwise.
-> [!NOTE]
-> This model card is dedicated to the larger `gpt-oss-120b` model. Check out [`gpt-oss-20b`](https://huggingface.co/openai/gpt-oss-20b) for the smaller model.
-# Highlights
-* **Permissive Apache 2.0 license:** Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment.
-* **Configurable reasoning effort:** Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.
-* **Full chain-of-thought:** Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users.
-* **Fine-tunable:** Fully customize models to your specific use case through parameter fine-tuning.
-* **Agentic capabilities:** Use the models’ native capabilities for function calling, [web browsing](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#browser), [Python code execution](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#python), and Structured Outputs.
-* **Native MXFP4 quantization:** The models are trained with native MXFP4 precision for the MoE layer, making `gpt-oss-120b` run on a single 80GB GPU (like NVIDIA H100 or AMD MI300X) and the `gpt-oss-20b` model run within 16GB of memory.
 ---
-# Inference examples
-## Transformers
-You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the [harmony response format](https://github.com/openai/harmony). If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our [openai-harmony](https://github.com/openai/harmony) package.
-To get started, install the necessary dependencies to setup your environment:
-```
-pip install -U transformers kernels torch
-```
-Once, setup you can proceed to run the model by running the snippet below:
-```py
-from transformers import pipeline
-import torch
-model_id = "openai/gpt-oss-120b"
-pipe = pipeline(
-    "text-generation",
-    model=model_id,
-    torch_dtype="auto",
-    device_map="auto",
-)
-messages = [
-    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
-]
-outputs = pipe(
-    messages,
-    max_new_tokens=256,
-)
-print(outputs[0]["generated_text"][-1])
-```
-Alternatively, you can run the model via [`Transformers Serve`](https://huggingface.co/docs/transformers/main/serving) to spin up a OpenAI-compatible webserver:
-```
-transformers serve
-transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-120b
-```
-[Learn more about how to use gpt-oss with Transformers.](https://cookbook.openai.com/articles/gpt-oss/run-transformers)
-## vLLM
-vLLM recommends using [uv](https://docs.astral.sh/uv/) for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.
 ```bash
-uv pip install --pre vllm==0.10.1+gptoss \
-    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
-    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
-    --index-strategy unsafe-best-match
-vllm serve openai/gpt-oss-120b
 ```
-[Learn more about how to use gpt-oss with vLLM.](https://cookbook.openai.com/articles/gpt-oss/run-vllm)
-## PyTorch / Triton
-To learn about how to use this model with PyTorch and Triton, check out our [reference implementations in the gpt-oss repository](https://github.com/openai/gpt-oss?tab=readme-ov-file#reference-pytorch-implementation).
-## Ollama
-If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after [installing Ollama](https://ollama.com/download).
 ```bash
-# gpt-oss-120b
-ollama pull gpt-oss:120b
-ollama run gpt-oss:120b
-```
-[Learn more about how to use gpt-oss with Ollama.](https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama)
-#### LM Studio
-If you are using [LM Studio](https://lmstudio.ai/) you can use the following commands to download.
 ```bash
-# gpt-oss-120b
-lms get openai/gpt-oss-120b
 ```
-Check out our [awesome list](https://github.com/openai/gpt-oss/blob/main/awesome-gpt-oss.md) for a broader collection of gpt-oss resources and inference partners.
----
-# Download the model
-You can download the model weights from the [Hugging Face Hub](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4) directly from Hugging Face CLI:
-```shell
-# gpt-oss-120b
-huggingface-cli download openai/gpt-oss-120b --include "original/*" --local-dir gpt-oss-120b/
-pip install gpt-oss
-python -m gpt_oss.chat model/
-```
-# Reasoning levels
-You can adjust the reasoning level that suits your task across three levels:
-* **Low:** Fast responses for general dialogue.
-* **Medium:** Balanced speed and detail.
-* **High:** Deep and detailed analysis.
-The reasoning level can be set in the system prompts, e.g., "Reasoning: high".
-# Tool use
-The gpt-oss models are excellent for:
-* Web browsing (using built-in browsing tools)
-* Function calling with defined schemas
-* Agentic operations like browser tasks
-# Fine-tuning
-Both gpt-oss models can be fine-tuned for a variety of specialized use cases.
-This larger model `gpt-oss-120b` can be fine-tuned on a single H100 node, whereas the smaller [`gpt-oss-20b`](https://huggingface.co/openai/gpt-oss-20b) can even be fine-tuned on consumer hardware.

 ---
 license: apache-2.0
 tags:
+- gpt_oss
 - vllm
+- conversational
+- mxfp4
 ---
+# gpt-oss-120b (Clean Fork for Stable Deployment)
+This repository, `mshojaei77/gpt-oss-120b`, is a clean, deployment-focused fork of the official `openai/gpt-oss-120b` model.
+## Why Does This Fork Exist?
+The original `openai/gpt-oss-120b` repository contains large, non-essential folders (`/original` and `/metal`) which can cause issues with automated downloaders like the one used by vLLM. These extra folders can lead to "Disk quota exceeded" errors during deployment, even on systems with sufficient disk space for the core model.
+**This fork solves that problem by containing *only* the essential files required for inference.**
+By using this repository, you are guaranteed to download only the ~65 GB of necessary model weights and configuration files, ensuring a smooth and reliable deployment.
 ---
+## Model Quick Facts
+-   **Original Model:** `openai/gpt-oss-120b`
+-   **Parameters:** ~117B (~5.1B active per forward pass)
+-   **Architecture:** Mixture of Experts (MoE)
+-   **Quantization:** Pre-quantized with **MXFP4** for MoE layers.
+-   **License:** Apache-2.0
+-   **Format:** The model was trained for the **Harmony** response format. vLLM's chat template handles this automatically.
+---
+## 🚀 Production-Grade Deployment with vLLM on RunPod
+This is a battle-tested guide for deploying this model on a single **NVIDIA H100 (80GB)** GPU using vLLM and RunPod.
+### Step 1: Configure Your RunPod Pod
+A correct disk configuration is the most critical step.
+1.  **GPU:** Select `1 x H100 80GB`.
+2.  **Template:** Use a standard PyTorch image (e.g., `runpod/pytorch`).
+3.  **Disks (Important!):**
+    -   **Container Disk:** `30 GB` (This is temporary).
+    -   **Volume Disk:** **`90 GB` or more**. This is your persistent storage.
+    -   **Volume Mount Path:** Set to `/workspace`.
+### Step 2: Set Up the Environment
+Connect to your pod and run the following commands to install dependencies in a persistent virtual environment.
 ```bash
+# Install uv, a fast package manager
+pip install uv
+# Create and activate a virtual environment inside our persistent /workspace
+uv venv --python 3.12 --seed /workspace/.venv
+source /workspace/.venv/bin/activate
+# Install the specialized vLLM build for gpt-oss
+uv pip install --pre "vllm==0.10.1+gptoss" \
+  --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
+  --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
+  --index-strategy unsafe-best-match
 ```
+### Step 3: Launch the vLLM Server
+This command will download the model from **this repository** and start an OpenAI-compatible API server.
+First, configure your shell session:
 ```bash
+# Create cache directories inside the persistent volume
+mkdir -p /workspace/hf-cache /workspace/tmp
+# Point all caching and temp operations to the 90GB volume
+export HF_HOME="/workspace/hf-cache"
+export TMPDIR="/workspace/tmp"
+# Force vLLM to use the highly optimized FlashAttention-3 kernel
+export VLLM_FLASH_ATTN_VERSION=3
+```
+Now, launch the server. The first launch will download the ~65GB model.
 ```bash
+vllm serve mshojaei77/gpt-oss-120b \
+  --trust-remote-code \
+  --dtype bfloat16 \
+  --port 8000 \
+  --gpu-memory-utilization 0.90 \
+  --max-model-len 32768 \
+  --max-num-seqs 16 \
+  --download-dir /workspace/hf-cache
 ```
+**Why these flags?**
+-   `--gpu-memory-utilization 0.90`: Safely uses 90% of the H100's VRAM.
+-   `--max-model-len 32768`: Enables the full 32k context window.
+-   `--download-dir /workspace/hf-cache`: **Crucial flag.** Forces vLLM to use your persistent volume, avoiding bugs where it might default to the small container disk.
+> **⚠️ CAUTION:** Do not use `--kv-cache-dtype fp8` with this setup on an H100/H200. There is a known kernel incompatibility in this vLLM build that can cause a runtime error. The H100 has sufficient VRAM to handle the 32k context in full `bfloat16` precision.
+### Step 4: Use the API
+Once the server is running, you can connect to it using any OpenAI-compatible client. If using RunPod's public proxy, find your URL in the pod's "Ports" section.
+```python
+from openai import OpenAI
+# Replace with your RunPod proxy URL or http://localhost:8000/v1 if testing internally
+client = OpenAI(
+    base_url="<YOUR_RUNPOD_PROXY_URL>/v1",
+    api_key="EMPTY"
+)
+response = client.chat.completions.create(
+    model="mshojaei77/gpt-oss-120b",
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "Explain what MXFP4 quantization is."}
+    ]
+)
+print(response.choices[0].message.content)
+```
+---
+### Original Model
+This model is a fork of [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b). Please refer to the original model card for all details regarding its architecture, training, and intended use.
+### License
+This model is licensed under the Apache-2.0 License, consistent with the original repository.