Add BoolQ evaluation results via inspect-ai on HF Jobs
#172
by
mackenzietechdocs
- opened
Description:
This PR adds BoolQ evaluation results for openai/gpt-oss-20b, following the Hugging Face Skills evaluation workflow.
- Benchmark: BoolQ (google/boolq, validation split)
- Task:
inspect_evals/boolq - Framework:
inspect-ai+inspect-evals - Infra:
hf jobs uv runona10g-small, Inference Providers - Metric: accuracy = 89.1% (stderr = 0.005)
The command used was:
hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
--flavor a10g-small \
--secrets HF_TOKEN \
-- \
--model "openai/gpt-oss-20b" \
--task "inspect_evals/boolq"