Spaces:

aauss
/

test_of_time_accuracy

Sleeping

App Files Files Community

test_of_time_accuracy / README.md

aauss

Update README.md

374b027 verified 20 days ago

preview code

raw

history blame contribute delete

4.31 kB

	---
	title: Test of Time Accuracy
	datasets:
	- baharef/ToT
	- aauss/ToT_separate_instructions
	tags:
	- evaluate
	- metric
	- temporal reasoning
	description: Accuracy metric for the Test of Time benchmark by Bahar et al. (2025).
	sdk: gradio
	sdk_version: 6.0.0
	app_file: app.py
	pinned: false
	emoji: 📊
	colorFrom: gray
	colorTo: indigo
	---

	# Metric Card for Test of Time Accuracy

	## Metric Description

	This metric is designed for the Test of Time (ToT) benchmark (Bahar et al., 2025). It measures the accuracy of model predictions against reference answers. The metric expects model outputs to be formatted as a JSON object (e.g., `{"answer": "...", "explanation": "..."}`).

	It performs the following steps:

	1. Extracts the first valid JSON object from the model's prediction string.
	2. Processes the JSON based on the specified `subset`:
	- semantic: Extracts the value of the "answer" field.
	- arithmetic: Removes the "explanation" field and compares the remaining dictionary (containing the answer) to the reference.
	3. Compares the processed prediction with the reference to calculate accuracy, which is a dictionary for the artihmetic subset and a string for the semantic subset.

	## How to Use

	You can load the metric using the `evaluate` library:

	```python
	import evaluate
	metric = evaluate.load("aauss/test_of_time_accuracy")

	predictions = [
	'{"explanation": "Some explanation...", "unordered_list": ["London"]}',
	' "Response without opening curly brackets...", "answer": "2005-04-07"}',
	]

	references = [
	'{"unordered_list": ["London"]}',
	"{'answer': '2005-04-07'}",
	]

	print(
	metric.compute(
	predictions=predictions,
	references=references,
	subset="arithmetic",
	)
	)
	>>> 0.5

	print(
	metric.compute(
	predictions=predictions,
	references=references,
	subset="arithmetic",
	return_average=False
	)
	)
	>>> [True, False]

	predictions = [
	'{"explanation": "Some explanation leading to a wrong answer...", "answer": 1}',
	'{"explanation": "Some explanation ...", "answer": "1985"}'
	]

	references = ["0", "1985"]

	print(
	metric.compute(
	predictions=predictions,
	references=references,
	subset="semantic",
	)
	)
	>>> 0.5
	```

	### Inputs

	- predictions (`list` of `str`): List of predictions to score. Each prediction should be a string that contains a JSON object (e.g., generated by an LLM).
	- references (`list` of `str`): List of reference answers.
	- subset (`str`): The subset of the benchmark being evaluated. Must be one of:
	- `"arithmetic"`: Used for arithmetic tasks where the answer might need structure preservation (ignores explanation).
	- `"semantic"`: Used for semantic tasks where only the "answer" value is compared.
	- return_average (`bool`, optional): If `True`, returns the average accuracy. If `False`, returns a list of boolean scores (correct/incorrect) for each sample. Defaults to `True`.

	### Output Values

	The metric returns a dictionary with the following keys:

	- accuracy (`float` or `list` of `bool`): The accuracy score (0.0 to 1.0) if `return_average=True`, or a list of booleans indicating correctness per sample if `return_average=False`.

	State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."

	#### Values from Popular Papers

	Checkout the original [paper](https://openreview.net/pdf?id=44CoQe6VCq) for some reference performances.


	## Limitations and Bias

	- The metric relies on `json.JSONDecoder` to find the first JSON object in the prediction string. If the model output is malformed or does not contain a valid JSON, extraction may fail (returning `None`), leading to an incorrect prediction.
	- It strictly expects the extracted JSON to follow the format as described in the task and optionally `explanation` for the logic to work as intended.

	## Citation

	Evaluation was not described in more detail in the paper but we can expect that model answers were parsed to allow for a more robust evaluation.

	```bibtex
	@InProceedings{huggingface:module,
	title = {Test of Time Accuracy},
	authors={Auss Abbood},
	year={2025}
	}
	```