aauss's picture
Update README.md
374b027 verified
---
title: Test of Time Accuracy
datasets:
- baharef/ToT
- aauss/ToT_separate_instructions
tags:
- evaluate
- metric
- temporal reasoning
description: Accuracy metric for the Test of Time benchmark by Bahar et al. (2025).
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
emoji: πŸ“Š
colorFrom: gray
colorTo: indigo
---
# Metric Card for Test of Time Accuracy
## Metric Description
This metric is designed for the **Test of Time (ToT)** benchmark (Bahar et al., 2025). It measures the accuracy of model predictions against reference answers. The metric expects model outputs to be formatted as a JSON object (e.g., `{"answer": "...", "explanation": "..."}`).
It performs the following steps:
1. Extracts the first valid JSON object from the model's prediction string.
2. Processes the JSON based on the specified `subset`:
- **semantic**: Extracts the value of the "answer" field.
- **arithmetic**: Removes the "explanation" field and compares the remaining dictionary (containing the answer) to the reference.
3. Compares the processed prediction with the reference to calculate accuracy, which is a dictionary for the artihmetic subset and a string for the semantic subset.
## How to Use
You can load the metric using the `evaluate` library:
```python
import evaluate
metric = evaluate.load("aauss/test_of_time_accuracy")
predictions = [
'{"explanation": "Some explanation...", "unordered_list": ["London"]}',
' "Response without opening curly brackets...", "answer": "2005-04-07"}',
]
references = [
'{"unordered_list": ["London"]}',
"{'answer': '2005-04-07'}",
]
print(
metric.compute(
predictions=predictions,
references=references,
subset="arithmetic",
)
)
>>> 0.5
print(
metric.compute(
predictions=predictions,
references=references,
subset="arithmetic",
return_average=False
)
)
>>> [True, False]
predictions = [
'{"explanation": "Some explanation leading to a wrong answer...", "answer": 1}',
'{"explanation": "Some explanation ...", "answer": "1985"}'
]
references = ["0", "1985"]
print(
metric.compute(
predictions=predictions,
references=references,
subset="semantic",
)
)
>>> 0.5
```
### Inputs
- **predictions** (`list` of `str`): List of predictions to score. Each prediction should be a string that contains a JSON object (e.g., generated by an LLM).
- **references** (`list` of `str`): List of reference answers.
- **subset** (`str`): The subset of the benchmark being evaluated. Must be one of:
- `"arithmetic"`: Used for arithmetic tasks where the answer might need structure preservation (ignores explanation).
- `"semantic"`: Used for semantic tasks where only the "answer" value is compared.
- **return_average** (`bool`, optional): If `True`, returns the average accuracy. If `False`, returns a list of boolean scores (correct/incorrect) for each sample. Defaults to `True`.
### Output Values
The metric returns a dictionary with the following keys:
- **accuracy** (`float` or `list` of `bool`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of booleans indicating correctness per sample if `return_average=False`.
*State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."*
#### Values from Popular Papers
Checkout the original [paper](https://openreview.net/pdf?id=44CoQe6VCq) for some reference performances.
## Limitations and Bias
- The metric relies on `json.JSONDecoder` to find the first JSON object in the prediction string. If the model output is malformed or does not contain a valid JSON, extraction may fail (returning `None`), leading to an incorrect prediction.
- It strictly expects the extracted JSON to follow the format as described in the task and optionally `explanation` for the logic to work as intended.
## Citation
Evaluation was not described in more detail in the paper but we can expect that model answers were parsed to allow for a more robust evaluation.
```bibtex
@InProceedings{huggingface:module,
title = {Test of Time Accuracy},
authors={Auss Abbood},
year={2025}
}
```