Spaces:
Sleeping
Sleeping
| title: Test of Time Accuracy | |
| datasets: | |
| - baharef/ToT | |
| - aauss/ToT_separate_instructions | |
| tags: | |
| - evaluate | |
| - metric | |
| - temporal reasoning | |
| description: Accuracy metric for the Test of Time benchmark by Bahar et al. (2025). | |
| sdk: gradio | |
| sdk_version: 6.0.0 | |
| app_file: app.py | |
| pinned: false | |
| emoji: π | |
| colorFrom: gray | |
| colorTo: indigo | |
| # Metric Card for Test of Time Accuracy | |
| ## Metric Description | |
| This metric is designed for the **Test of Time (ToT)** benchmark (Bahar et al., 2025). It measures the accuracy of model predictions against reference answers. The metric expects model outputs to be formatted as a JSON object (e.g., `{"answer": "...", "explanation": "..."}`). | |
| It performs the following steps: | |
| 1. Extracts the first valid JSON object from the model's prediction string. | |
| 2. Processes the JSON based on the specified `subset`: | |
| - **semantic**: Extracts the value of the "answer" field. | |
| - **arithmetic**: Removes the "explanation" field and compares the remaining dictionary (containing the answer) to the reference. | |
| 3. Compares the processed prediction with the reference to calculate accuracy, which is a dictionary for the artihmetic subset and a string for the semantic subset. | |
| ## How to Use | |
| You can load the metric using the `evaluate` library: | |
| ```python | |
| import evaluate | |
| metric = evaluate.load("aauss/test_of_time_accuracy") | |
| predictions = [ | |
| '{"explanation": "Some explanation...", "unordered_list": ["London"]}', | |
| ' "Response without opening curly brackets...", "answer": "2005-04-07"}', | |
| ] | |
| references = [ | |
| '{"unordered_list": ["London"]}', | |
| "{'answer': '2005-04-07'}", | |
| ] | |
| print( | |
| metric.compute( | |
| predictions=predictions, | |
| references=references, | |
| subset="arithmetic", | |
| ) | |
| ) | |
| >>> 0.5 | |
| print( | |
| metric.compute( | |
| predictions=predictions, | |
| references=references, | |
| subset="arithmetic", | |
| return_average=False | |
| ) | |
| ) | |
| >>> [True, False] | |
| predictions = [ | |
| '{"explanation": "Some explanation leading to a wrong answer...", "answer": 1}', | |
| '{"explanation": "Some explanation ...", "answer": "1985"}' | |
| ] | |
| references = ["0", "1985"] | |
| print( | |
| metric.compute( | |
| predictions=predictions, | |
| references=references, | |
| subset="semantic", | |
| ) | |
| ) | |
| >>> 0.5 | |
| ``` | |
| ### Inputs | |
| - **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string that contains a JSON object (e.g., generated by an LLM). | |
| - **references** (`list` of `str`): List of reference answers. | |
| - **subset** (`str`): The subset of the benchmark being evaluated. Must be one of: | |
| - `"arithmetic"`: Used for arithmetic tasks where the answer might need structure preservation (ignores explanation). | |
| - `"semantic"`: Used for semantic tasks where only the "answer" value is compared. | |
| - **return_average** (`bool`, optional): If `True`, returns the average accuracy. If `False`, returns a list of boolean scores (correct/incorrect) for each sample. Defaults to `True`. | |
| ### Output Values | |
| The metric returns a dictionary with the following keys: | |
| - **accuracy** (`float` or `list` of `bool`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of booleans indicating correctness per sample if `return_average=False`. | |
| *State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."* | |
| #### Values from Popular Papers | |
| Checkout the original [paper](https://openreview.net/pdf?id=44CoQe6VCq) for some reference performances. | |
| ## Limitations and Bias | |
| - The metric relies on `json.JSONDecoder` to find the first JSON object in the prediction string. If the model output is malformed or does not contain a valid JSON, extraction may fail (returning `None`), leading to an incorrect prediction. | |
| - It strictly expects the extracted JSON to follow the format as described in the task and optionally `explanation` for the logic to work as intended. | |
| ## Citation | |
| Evaluation was not described in more detail in the paper but we can expect that model answers were parsed to allow for a more robust evaluation. | |
| ```bibtex | |
| @InProceedings{huggingface:module, | |
| title = {Test of Time Accuracy}, | |
| authors={Auss Abbood}, | |
| year={2025} | |
| } | |
| ``` |