| # Mass Evaluations | |
| Simple benchmark tool for running predefined prompts through all checkpoints of a model. | |
| ## Usage | |
| ```bash | |
| python benchmark.py [model_name] [options] | |
| ``` | |
| ## Examples | |
| ```bash | |
| # Benchmark all checkpoints of a model | |
| python benchmark.py pico-decoder-tiny-dolma5M-v1 | |
| # Specify custom output directory | |
| python benchmark.py pico-decoder-tiny-dolma5M-v1 --output my_results/ | |
| # Use custom prompts file | |
| python benchmark.py pico-decoder-tiny-dolma5M-v1 --prompts my_prompts.json | |
| ``` | |
| ## Managing Prompts | |
| Prompts are stored in `prompts.json` as a simple array of strings: | |
| ```json | |
| [ | |
| "Hello, how are you?", | |
| "Complete this story: Once upon a time", | |
| "What is the capital of France?" | |
| ] | |
| ``` | |
| ### Adding New Prompts | |
| Simply edit `prompts.json` and add new prompt strings to the array. Super simple! | |
| ## Features | |
| - **Auto-discovery**: Finds all `step_*` checkpoints automatically | |
| - **JSON-based prompts**: Easily customizable prompts via JSON file | |
| - **Readable output**: Markdown reports with clear structure | |
| - **Error handling**: Continues on failures, logs errors | |
| - **Progress tracking**: Shows real-time progress | |
| - **Metadata logging**: Includes generation time and parameters | |
| ## Output | |
| Results are saved as markdown files in `results/` directory: | |
| ``` | |
| results/ | |
| βββ pico-decoder-tiny-dolma5M-v1_benchmark_20250101_120000.md | |
| βββ pico-decoder-tiny-dolma29k-v3_benchmark_20250101_130000.md | |
| βββ ... | |
| ``` | |
| ## Predefined Prompts | |
| 1. "Hello, how are you?" (conversational) | |
| 2. "Complete this story: Once upon a time" (creative) | |
| 3. "Explain quantum physics in simple terms" (explanatory) | |
| 4. "Write a haiku about coding" (creative + structured) | |
| 5. "What is the capital of France?" (factual) | |
| 6. "The meaning of life is" (philosophical) | |
| 7. "In the year 2050," (futuristic) | |
| 8. "Python programming is" (technical) | |