Add BoolQ evaluation results via inspect-ai on HF Jobs

#172

Description:

This PR adds BoolQ evaluation results for openai/gpt-oss-20b, following the Hugging Face Skills evaluation workflow.

  • Benchmark: BoolQ (google/boolq, validation split)
  • Task: inspect_evals/boolq
  • Framework: inspect-ai + inspect-evals
  • Infra: hf jobs uv run on a10g-small, Inference Providers
  • Metric: accuracy = 89.1% (stderr = 0.005)

The command used was:

hf jobs uv run hf_model_evaluation/scripts/inspect_eval_uv.py \
  --flavor a10g-small \
  --secrets HF_TOKEN \
  -- \
  --model "openai/gpt-oss-20b" \
  --task "inspect_evals/boolq"
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment