File size: 10,151 Bytes

2a3a1e7
 
c3f8f15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a3a1e7
 
 
 
c3f8f15
2a3a1e7
 
 
 
 
c3f8f15
 
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
 
 
 
 
c3f8f15
2a3a1e7
c3f8f15
 
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
 
 
 
 
 
 
 
2a3a1e7
 
 
c3f8f15
 
 
 
 
2a3a1e7
 
 
 
c3f8f15
 
2a3a1e7
c3f8f15
 
 
 
 
 
 
 
2a3a1e7
 
c3f8f15
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
 
 
c3f8f15
2a3a1e7
c3f8f15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a3a1e7
 
 
 
 
c3f8f15
2a3a1e7
 
 
c3f8f15
 
 
 
 
 
 
 
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
c3f8f15
 
 
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a3a1e7
 
 
c3f8f15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a3a1e7
 
 
c3f8f15
 
 
 
 
2a3a1e7
 
 
 
c3f8f15
 
 
 
 
2a3a1e7
c3f8f15
2a3a1e7

---
library_name: transformers
license: apache-2.0
base_model: google/pegasus-xsum
datasets:
- eilamc14/wikilarge-clean
language:
- en
tags:
- pegasus
- text-simplification
- WikiLarge
model-index:
- name: pegasus-xsum-text-simplification
  results:
  - task:
      type: text2text-generation
      name: Text Simplification
    dataset:
      name: ASSET
      type: facebook/asset
      url: https://huggingface.co/datasets/facebook/asset
      split: test
    metrics:
    - type: SARI
      value: 33.80
    - type: FKGL
      value: 9.23
    - type: BERTScore
      value: 87.54
    - type: LENS
      value: 62.46
    - type: Identical ratio
      value: 0.29
    - type: Identical ratio (ci)
      value: 0.29

  - task:
      type: text2text-generation
      name: Text Simplification
    dataset:
      name: MEDEASI
      type: cbasu/Med-EASi
      url: https://huggingface.co/datasets/cbasu/Med-EASi
      split: test
    metrics:
    - type: SARI
      value: 32.68
    - type: FKGL
      value: 10.98
    - type: BERTScore
      value: 45.14
    - type: LENS
      value: 50.55
    - type: Identical ratio
      value: 0.30
    - type: Identical ratio (ci)
      value: 0.30

  - task:
      type: text2text-generation
      name: Text Simplification
    dataset:
      name: OneStopEnglish
      type: OneStopEnglish
      url: https://github.com/nishkalavallabhi/OneStopEnglishCorpus
      split: advanced→elementary
    metrics:
    - type: SARI
      value: 37.07
    - type: FKGL
      value: 8.66
    - type: BERTScore
      value: 77.77
    - type: LENS
      value: 60.97
    - type: Identical ratio
      value: 0.40
    - type: Identical ratio (ci)
      value: 0.40
---

# Model Card for Model ID

This is one of the models fine-tuned on text simplification for [Simplify This](https://github.com/eilamc14/Simplify-This) project.

## Model Details

### Model Description

Fine-tuned **sequence-to-sequence (encoder–decoder) Transformer** for **English text simplification**.  
Trained on the dataset **`eilamc14/wikilarge-clean`** (cleaned WikiLarge-style pairs).

- **Model type:** Seq2Seq Transformer (encoder–decoder)
- **Language (NLP):** English
- **License:** `apache-2.0`
- **Finetuned from model:** `google/pegasus-xsum`

### Model Sources

- **Repository (code):** https://github.com/eilamc14/Simplify-This
- **Dataset:** https://huggingface.co/datasets/eilamc14/wikilarge-clean
- **Paper [optional]:** —
- **Demo [optional]:** —

## Uses

### Direct Use

The model is intended for **English text simplification**.

- **Input format:** `Simplify: <complex sentence>`
- **Output:** `<simplified sentence>`

**Typical uses**
- Research on automatic text simplification
- Benchmarking against other simplification systems
- Demos/prototypes that require simpler English rewrites

### Downstream Use

This repository already contains a **fine-tuned** model specialized for text simplification.

Further fine-tuning is **optional** and mainly relevant when:
- Adapting to a markedly different domain (e.g., medical/legal/news)
- Addressing specific failure modes (e.g., over/under-simplification, factual drops)
- Distilling/quantizing for deployment constraints

When fine-tuning further, keep the same input convention: `Simplify: <...>`.

### Out-of-Scope Use

Not intended for:
- Tasks unrelated to simplification (dialogue, translation etc.)
- Production use without additional safety filtering (no toxicity/bias mitigation)
- Languages other than English
- High-stakes settings (legal/medical advice, safety-critical decisions)


## Bias, Risks, and Limitations

The model was trained on **Wikipedia and Simple English Wikipedia** alignments (via WikiLarge).  
As a result, it inherits the characteristics and limitations of this data:

- **Domain bias:** Simplifications may reflect encyclopedic style; performance may degrade on informal, technical, or domain-specific text (e.g., medical/legal/news).
- **Content bias:** Wikipedia content itself contains biases in coverage, cultural perspective, and phrasing. Simplified outputs may reflect or amplify these.
- **Simplification quality:** The model may:
  - Over-simplify (drop important details)
  - Under-simplify (retain complex phrasing)
  - Produce ungrammatical or awkward rephrasings
- **Language limitation:** Only suitable for English. Applying to other languages is unsupported.
- **Safety limitation:** The model has not been aligned to avoid toxic, biased, or harmful content. If the input text contains such content, the output may reproduce or modify it without safeguards.


### Recommendations

- **Evaluation required:** Always evaluate the model in the target domain before deployment. Benchmark simplification quality (e.g., with SARI, FKGL, BERTScore, LENS, human evaluation).
- **Human oversight:** Use human-in-the-loop review for applications where meaning preservation is critical (education, accessibility tools, etc.).
- **Attribution:** Preserve source attribution where required (Wikipedia → CC BY-SA).
- **Not for high-stakes use:** Avoid legal, medical, or safety-critical applications without extensive validation and domain adaptation.

## How to Get Started with the Model

Load the model and tokenizer directly from the Hugging Face Hub:

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "eilamc14/bart-base-text-simplification"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# Example input
PREFIX = "Simplify: "
text = "The committee deemed the proposal unnecessarily complicated."

# Tokenize and generate
inputs = tokenizer(PREFIX+text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Details

### Training Data

[WikiLarge-clean](https://huggingface.co/datasets/eilamc14/wikilarge-clean) Dataset

### Training Procedure

- **Hardware:** NVIDIA L4 GPU on Google Colab
- **Objective:** Standard sequence-to-sequence cross-entropy loss
- **Training type:** Full fine-tuning of all parameters (no LoRA/PEFT used)
- **Batching:** Dynamic padding with Hugging Face `Trainer` / PyTorch DataLoader
- **Evaluation:** Monitored on the `validation` split with metrics (SARI and identical_ratio)
- **Stopping criteria:** Early stopping CallBack based on validation performance
  
#### Preprocessing

The dataset was preprocessed by prefixing each source sentence with **"Simplify: "** and tokenizing both the source (inputs) and target (labels).

#### Memory & Checkpointing

To reduce VRAM during training, gradient checkpointing was enabled and the KV cache was disabled:

```python
model.config.use_cache = False          # required when using gradient checkpointing
model.gradient_checkpointing_enable()   # saves memory at the cost of extra compute
```

**Notes**
- Disabling `use_cache` avoids warnings/conflicts with gradient checkpointing and reduces memory usage in the forward pass.
- Gradient checkpointing trades **GPU memory ↓** for **training speed ↓** (extra recomputation).
- For **inference/evaluation**, re-enable the cache for faster generation:

```python
model.config.use_cache = True
```

#### Training Hyperparameters

The models were trained with Hugging Face `Seq2SeqTrainingArguments`.  
Hyperparameters varied slightly across models and runs to optimize, and full logs (batch size, steps, exact LR schedule) were not preserved.  
Below are the **typical defaults** used:

- **Epochs:** 5  
- **Evaluation strategy:** every 300 steps  
- **Save strategy:** every 300 steps (keep best model, `eval_loss` as criterion)  
- **Learning rate:** ~3e-5  
- **Batch size:** ~8-64 , depends on model size
- **Optimizer:** `adamw_torch_fused`
- **Precision:** bf16 
- **Generation config (during eval):** `max_length=128`, `num_beams=4`, `predict_with_generate=True`  
- **Other settings:**  
  - Weight decay: 0.01  
  - Label smoothing: 0.1  
  - Warmup ratio: 0.1  
  - Max grad norm: 0.5  
  - Dataloader workers: 8 (L4 GPU)  

> Because hyperparameters were adjusted between runs and not all were logged, exact reproduction may differ slightly.

## Evaluation

### Testing Data

- [**ASSET**](https://huggingface.co/datasets/facebook/asset) (test subset)
- [**MEDEASI**](https://huggingface.co/datasets/cbasu/Med-EASi) (test subset)
- [**OneStopEnglish**](https://github.com/nishkalavallabhi/OneStopEnglishCorpus) (advanced → elementary)

### Metrics

- **Identical ratio** — share of outputs identical to the source, both normalized by basic, language-agnostic: strip, NFKC, collapse spaces
- **Identical ratio (ci)** — case insensitive identical ratio
- **SARI** — main simplification metric (higher is better)
- **FKGL** — readability grade level (lower is simpler)
- **BERTScore (F1)** — semantic similarity (higher is better)
- **LENS** — composite simplification quality score (higher is better)
  
### Generation Arguments

```python
gen_args = dict(
    max_new_tokens=64,
    num_beams=4,
    length_penalty=1.0,
    no_repeat_ngram_size=3,
    early_stopping=True,
    do_sample=False,
)
```

### Results

| Dataset            | Identical ratio | Identical ratio (ci) |  SARI | FKGL | BERTScore |  LENS |
|--------------------|----------------:|---------------------:|------:|-----:|----------:|------:|
| **ASSET**          |            0.29 |                 0.29 | 33.80 | 9.23 |     87.54 | 62.46 |
| **MEDEASI**        |            0.30 |                 0.30 | 32.68 | 10.98|     45.14 | 50.55 |
| **OneStopEnglish** |            0.40 |                 0.40 | 37.07 | 8.66 |     77.77 | 60.97 |


## Environmental Impact

- **Hardware Type:** Single NVIDIA L4 GPU (Google Colab)
- **Hours used:** Approx. 5–10
- **Cloud Provider:** Google Cloud (via Colab)
- **Compute Region:** Unknown (Google Colab dynamic allocation)
- **Carbon Emitted:** Estimated to be very low (< a few kg CO₂eq), since training was limited to a single GPU for a small number of hours.

## Citation

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]