Add BoolQ evaluation results via inspect-ai on HF Jobs
Browse files**Description:**
This PR adds BoolQ evaluation results for `openai/gpt-oss-20b`, following the Hugging Face Skills evaluation workflow.
- Benchmark: BoolQ (google/boolq, validation split)
- Task: `inspect_evals/boolq`
- Framework: `inspect-ai` + `inspect-evals`
- Infra: `hf jobs uv run` on `a10g-small`, Inference Providers
- Metric: accuracy = 89.1% (stderr = 0.005)
The command used was:
```bash
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN \
-- \
--model "openai/gpt-oss-20b" \
--task "inspect_evals/boolq"
```
README.md
CHANGED
|
@@ -4,6 +4,21 @@ pipeline_tag: text-generation
|
|
| 4 |
library_name: transformers
|
| 5 |
tags:
|
| 6 |
- vllm
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
| 8 |
|
| 9 |
<p align="center">
|
|
@@ -179,4 +194,25 @@ This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas
|
|
| 179 |
primaryClass={cs.CL},
|
| 180 |
url={https://arxiv.org/abs/2508.10925},
|
| 181 |
}
|
| 182 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
library_name: transformers
|
| 5 |
tags:
|
| 6 |
- vllm
|
| 7 |
+
model-index:
|
| 8 |
+
- name: ChatGPT-OSS 20B
|
| 9 |
+
results:
|
| 10 |
+
- task:
|
| 11 |
+
name: BoolQ
|
| 12 |
+
type: boolq
|
| 13 |
+
dataset:
|
| 14 |
+
name: BoolQ
|
| 15 |
+
type: google/boolq
|
| 16 |
+
config: default
|
| 17 |
+
split: validation
|
| 18 |
+
metrics:
|
| 19 |
+
- name: accuracy
|
| 20 |
+
type: accuracy
|
| 21 |
+
value: 89.1
|
| 22 |
---
|
| 23 |
|
| 24 |
<p align="center">
|
|
|
|
| 194 |
primaryClass={cs.CL},
|
| 195 |
url={https://arxiv.org/abs/2508.10925},
|
| 196 |
}
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
## Evaluation
|
| 200 |
+
|
| 201 |
+
This model was evaluated on the **BoolQ** benchmark using the `inspect-ai` framework and `inspect-evals`, run on Hugging Face Jobs with Inference Providers.
|
| 202 |
+
|
| 203 |
+
**Benchmark:** BoolQ (google/boolq, validation split, 3,270 examples)
|
| 204 |
+
**Task:** `inspect_evals/boolq`
|
| 205 |
+
**Framework:** `inspect-ai` + `inspect-evals`
|
| 206 |
+
**Infrastructure:** `hf jobs uv run` on an `a10g-small` GPU
|
| 207 |
+
**Provider model:** `hf-inference-providers/openai/gpt-oss-20b`
|
| 208 |
+
**Metric:** accuracy = **89.1%** (stderr = 0.005)
|
| 209 |
+
|
| 210 |
+
**Command used:**
|
| 211 |
+
|
| 212 |
+
```bash
|
| 213 |
+
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
|
| 214 |
+
--flavor a10g-small \
|
| 215 |
+
--secrets HF_TOKEN \
|
| 216 |
+
-- \
|
| 217 |
+
--model "openai/gpt-oss-20b" \
|
| 218 |
+
--task "inspect_evals/boolq"
|