Spaces:

huggingface
/

metric-explorer

Build error

App Files Files Community

metric-explorer / rouge_metric_card.md

Sasha

initial metric card explorer commit

0a279d3 almost 4 years ago

preview code

raw

history blame contribute delete

6.07 kB

	# Metric Card for ROUGE

	## Metric Description
	ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

	Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.

	This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)

	## How to Use
	At minimum, this metric takes as input a list of predictions and a list of references:
	```python
	>>> rouge = datasets.load_metric('rouge')
	>>> predictions = ["hello there", "general kenobi"]
	>>> references = ["hello there", "general kenobi"]
	>>> results = rouge.compute(predictions=predictions,
	... references=references)
	>>> print(list(results.keys()))
	['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
	>>> print(results["rouge1"])
	AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
	>>> print(results["rouge1"].mid.fmeasure)
	1.0
	```

	### Inputs
	- predictions (`list`): list of predictions to score. Each prediction
	should be a string with tokens separated by spaces.
	- references (`list`): list of reference for each prediction. Each
	reference should be a string with tokens separated by spaces.
	- rouge_types (`list`): A list of rouge types to calculate. Defaults to `['rouge1', 'rouge2', 'rougeL', 'rougeLsum']`.
	- Valid rouge types:
	- `"rouge1"`: unigram (1-gram) based scoring
	- `"rouge2"`: bigram (2-gram) based scoring
	- `"rougeL"`: Longest common subsequence based scoring.
	- `"rougeLSum"`: splits text using `"\n"`
	- See [here](https://github.com/huggingface/datasets/issues/617) for more information
	- use_aggregator (`boolean`): If True, returns aggregates. Defaults to `True`.
	- use_stemmer (`boolean`): If `True`, uses Porter stemmer to strip word suffixes. Defaults to `False`.

	### Output Values
	The output is a dictionary with one entry for each rouge type in the input list `rouge_types`. If `use_aggregator=False`, each dictionary entry is a list of Score objects, with one score for each sentence. Each Score object includes the `precision`, `recall`, and `fmeasure`. E.g. if `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=False`, the output is:

	```python
	{'rouge1': [Score(precision=1.0, recall=0.5, fmeasure=0.6666666666666666), Score(precision=1.0, recall=1.0, fmeasure=1.0)], 'rouge2': [Score(precision=0.0, recall=0.0, fmeasure=0.0), Score(precision=1.0, recall=1.0, fmeasure=1.0)]}
	```

	If `rouge_types=['rouge1', 'rouge2']` and `use_aggregator=True`, the output is of the following format:
	```python
	{'rouge1': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0)), 'rouge2': AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))}
	```

	The `precision`, `recall`, and `fmeasure` values all have a range of 0 to 1.


	#### Values from Popular Papers


	### Examples
	An example without aggregation:
	```python
	>>> rouge = datasets.load_metric('rouge')
	>>> predictions = ["hello goodbye", "ankh morpork"]
	>>> references = ["goodbye", "general kenobi"]
	>>> results = rouge.compute(predictions=predictions,
	... references=references)
	>>> print(list(results.keys()))
	['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
	>>> print(results["rouge1"])
	[Score(precision=0.5, recall=0.5, fmeasure=0.5), Score(precision=0.0, recall=0.0, fmeasure=0.0)]
	```

	The same example, but with aggregation:
	```python
	>>> rouge = datasets.load_metric('rouge')
	>>> predictions = ["hello goodbye", "ankh morpork"]
	>>> references = ["goodbye", "general kenobi"]
	>>> results = rouge.compute(predictions=predictions,
	... references=references,
	... use_aggregator=True)
	>>> print(list(results.keys()))
	['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
	>>> print(results["rouge1"])
	AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.25, recall=0.25, fmeasure=0.25), high=Score(precision=0.5, recall=0.5, fmeasure=0.5))
	```

	The same example, but only calculating `rouge_1`:
	```python
	>>> rouge = datasets.load_metric('rouge')
	>>> predictions = ["hello goodbye", "ankh morpork"]
	>>> references = ["goodbye", "general kenobi"]
	>>> results = rouge.compute(predictions=predictions,
	... references=references,
	... rouge_types=['rouge_1'],
	... use_aggregator=True)
	>>> print(list(results.keys()))
	['rouge1']
	>>> print(results["rouge1"])
	AggregateScore(low=Score(precision=0.0, recall=0.0, fmeasure=0.0), mid=Score(precision=0.25, recall=0.25, fmeasure=0.25), high=Score(precision=0.5, recall=0.5, fmeasure=0.5))
	```

	## Limitations and Bias
	See [Schluter (2017)](https://aclanthology.org/E17-2007/) for an in-depth discussion of many of ROUGE's limits.

	## Citation
	```bibtex
	@inproceedings{lin-2004-rouge,
	title = "{ROUGE}: A Package for Automatic Evaluation of Summaries",
	author = "Lin, Chin-Yew",
	booktitle = "Text Summarization Branches Out",
	month = jul,
	year = "2004",
	address = "Barcelona, Spain",
	publisher = "Association for Computational Linguistics",
	url = "https://www.aclweb.org/anthology/W04-1013",
	pages = "74--81",
	}
	```

	## Further References
	- This metrics is a wrapper around the [Google Research reimplementation of ROUGE](https://github.com/google-research/google-research/tree/master/rouge)