File size: 10,151 Bytes
2a3a1e7
 
c3f8f15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a3a1e7
 
 
 
c3f8f15
2a3a1e7
 
 
 
 
c3f8f15
 
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
 
 
 
 
c3f8f15
2a3a1e7
c3f8f15
 
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
 
 
 
 
 
 
 
2a3a1e7
 
 
c3f8f15
 
 
 
 
2a3a1e7
 
 
 
c3f8f15
 
2a3a1e7
c3f8f15
 
 
 
 
 
 
 
2a3a1e7
 
c3f8f15
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
 
 
c3f8f15
2a3a1e7
c3f8f15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a3a1e7
 
 
 
 
c3f8f15
2a3a1e7
 
 
c3f8f15
 
 
 
 
 
 
 
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
c3f8f15
 
 
 
2a3a1e7
c3f8f15
 
 
2a3a1e7
c3f8f15
2a3a1e7
c3f8f15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a3a1e7
 
 
c3f8f15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a3a1e7
 
 
c3f8f15
 
 
 
 
2a3a1e7
 
 
 
c3f8f15
 
 
 
 
2a3a1e7
c3f8f15
2a3a1e7
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
library_name: transformers
license: apache-2.0
base_model: google/pegasus-xsum
datasets:
- eilamc14/wikilarge-clean
language:
- en
tags:
- pegasus
- text-simplification
- WikiLarge
model-index:
- name: pegasus-xsum-text-simplification
  results:
  - task:
      type: text2text-generation
      name: Text Simplification
    dataset:
      name: ASSET
      type: facebook/asset
      url: https://huggingface.co/datasets/facebook/asset
      split: test
    metrics:
    - type: SARI
      value: 33.80
    - type: FKGL
      value: 9.23
    - type: BERTScore
      value: 87.54
    - type: LENS
      value: 62.46
    - type: Identical ratio
      value: 0.29
    - type: Identical ratio (ci)
      value: 0.29

  - task:
      type: text2text-generation
      name: Text Simplification
    dataset:
      name: MEDEASI
      type: cbasu/Med-EASi
      url: https://huggingface.co/datasets/cbasu/Med-EASi
      split: test
    metrics:
    - type: SARI
      value: 32.68
    - type: FKGL
      value: 10.98
    - type: BERTScore
      value: 45.14
    - type: LENS
      value: 50.55
    - type: Identical ratio
      value: 0.30
    - type: Identical ratio (ci)
      value: 0.30

  - task:
      type: text2text-generation
      name: Text Simplification
    dataset:
      name: OneStopEnglish
      type: OneStopEnglish
      url: https://github.com/nishkalavallabhi/OneStopEnglishCorpus
      split: advanced→elementary
    metrics:
    - type: SARI
      value: 37.07
    - type: FKGL
      value: 8.66
    - type: BERTScore
      value: 77.77
    - type: LENS
      value: 60.97
    - type: Identical ratio
      value: 0.40
    - type: Identical ratio (ci)
      value: 0.40
---

# Model Card for Model ID

This is one of the models fine-tuned on text simplification for [Simplify This](https://github.com/eilamc14/Simplify-This) project.

## Model Details

### Model Description

Fine-tuned **sequence-to-sequence (encoder–decoder) Transformer** for **English text simplification**.  
Trained on the dataset **`eilamc14/wikilarge-clean`** (cleaned WikiLarge-style pairs).

- **Model type:** Seq2Seq Transformer (encoder–decoder)
- **Language (NLP):** English
- **License:** `apache-2.0`
- **Finetuned from model:** `google/pegasus-xsum`

### Model Sources

- **Repository (code):** https://github.com/eilamc14/Simplify-This
- **Dataset:** https://huggingface.co/datasets/eilamc14/wikilarge-clean
- **Paper [optional]:**- **Demo [optional]:**## Uses

### Direct Use

The model is intended for **English text simplification**.

- **Input format:** `Simplify: <complex sentence>`
- **Output:** `<simplified sentence>`

**Typical uses**
- Research on automatic text simplification
- Benchmarking against other simplification systems
- Demos/prototypes that require simpler English rewrites

### Downstream Use

This repository already contains a **fine-tuned** model specialized for text simplification.

Further fine-tuning is **optional** and mainly relevant when:
- Adapting to a markedly different domain (e.g., medical/legal/news)
- Addressing specific failure modes (e.g., over/under-simplification, factual drops)
- Distilling/quantizing for deployment constraints

When fine-tuning further, keep the same input convention: `Simplify: <...>`.

### Out-of-Scope Use

Not intended for:
- Tasks unrelated to simplification (dialogue, translation etc.)
- Production use without additional safety filtering (no toxicity/bias mitigation)
- Languages other than English
- High-stakes settings (legal/medical advice, safety-critical decisions)


## Bias, Risks, and Limitations

The model was trained on **Wikipedia and Simple English Wikipedia** alignments (via WikiLarge).  
As a result, it inherits the characteristics and limitations of this data:

- **Domain bias:** Simplifications may reflect encyclopedic style; performance may degrade on informal, technical, or domain-specific text (e.g., medical/legal/news).
- **Content bias:** Wikipedia content itself contains biases in coverage, cultural perspective, and phrasing. Simplified outputs may reflect or amplify these.
- **Simplification quality:** The model may:
  - Over-simplify (drop important details)
  - Under-simplify (retain complex phrasing)
  - Produce ungrammatical or awkward rephrasings
- **Language limitation:** Only suitable for English. Applying to other languages is unsupported.
- **Safety limitation:** The model has not been aligned to avoid toxic, biased, or harmful content. If the input text contains such content, the output may reproduce or modify it without safeguards.


### Recommendations

- **Evaluation required:** Always evaluate the model in the target domain before deployment. Benchmark simplification quality (e.g., with SARI, FKGL, BERTScore, LENS, human evaluation).
- **Human oversight:** Use human-in-the-loop review for applications where meaning preservation is critical (education, accessibility tools, etc.).
- **Attribution:** Preserve source attribution where required (Wikipedia → CC BY-SA).
- **Not for high-stakes use:** Avoid legal, medical, or safety-critical applications without extensive validation and domain adaptation.

## How to Get Started with the Model

Load the model and tokenizer directly from the Hugging Face Hub:

```python
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "eilamc14/bart-base-text-simplification"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# Example input
PREFIX = "Simplify: "
text = "The committee deemed the proposal unnecessarily complicated."

# Tokenize and generate
inputs = tokenizer(PREFIX+text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Training Details

### Training Data

[WikiLarge-clean](https://huggingface.co/datasets/eilamc14/wikilarge-clean) Dataset

### Training Procedure

- **Hardware:** NVIDIA L4 GPU on Google Colab
- **Objective:** Standard sequence-to-sequence cross-entropy loss
- **Training type:** Full fine-tuning of all parameters (no LoRA/PEFT used)
- **Batching:** Dynamic padding with Hugging Face `Trainer` / PyTorch DataLoader
- **Evaluation:** Monitored on the `validation` split with metrics (SARI and identical_ratio)
- **Stopping criteria:** Early stopping CallBack based on validation performance
  
#### Preprocessing

The dataset was preprocessed by prefixing each source sentence with **"Simplify: "** and tokenizing both the source (inputs) and target (labels).

#### Memory & Checkpointing

To reduce VRAM during training, gradient checkpointing was enabled and the KV cache was disabled:

```python
model.config.use_cache = False          # required when using gradient checkpointing
model.gradient_checkpointing_enable()   # saves memory at the cost of extra compute
```

**Notes**
- Disabling `use_cache` avoids warnings/conflicts with gradient checkpointing and reduces memory usage in the forward pass.
- Gradient checkpointing trades **GPU memory ↓** for **training speed ↓** (extra recomputation).
- For **inference/evaluation**, re-enable the cache for faster generation:

```python
model.config.use_cache = True
```

#### Training Hyperparameters

The models were trained with Hugging Face `Seq2SeqTrainingArguments`.  
Hyperparameters varied slightly across models and runs to optimize, and full logs (batch size, steps, exact LR schedule) were not preserved.  
Below are the **typical defaults** used:

- **Epochs:** 5  
- **Evaluation strategy:** every 300 steps  
- **Save strategy:** every 300 steps (keep best model, `eval_loss` as criterion)  
- **Learning rate:** ~3e-5  
- **Batch size:** ~8-64 , depends on model size
- **Optimizer:** `adamw_torch_fused`
- **Precision:** bf16 
- **Generation config (during eval):** `max_length=128`, `num_beams=4`, `predict_with_generate=True`  
- **Other settings:**  
  - Weight decay: 0.01  
  - Label smoothing: 0.1  
  - Warmup ratio: 0.1  
  - Max grad norm: 0.5  
  - Dataloader workers: 8 (L4 GPU)  

> Because hyperparameters were adjusted between runs and not all were logged, exact reproduction may differ slightly.

## Evaluation

### Testing Data

- [**ASSET**](https://huggingface.co/datasets/facebook/asset) (test subset)
- [**MEDEASI**](https://huggingface.co/datasets/cbasu/Med-EASi) (test subset)
- [**OneStopEnglish**](https://github.com/nishkalavallabhi/OneStopEnglishCorpus) (advanced → elementary)

### Metrics

- **Identical ratio** — share of outputs identical to the source, both normalized by basic, language-agnostic: strip, NFKC, collapse spaces
- **Identical ratio (ci)** — case insensitive identical ratio
- **SARI** — main simplification metric (higher is better)
- **FKGL** — readability grade level (lower is simpler)
- **BERTScore (F1)** — semantic similarity (higher is better)
- **LENS** — composite simplification quality score (higher is better)
  
### Generation Arguments

```python
gen_args = dict(
    max_new_tokens=64,
    num_beams=4,
    length_penalty=1.0,
    no_repeat_ngram_size=3,
    early_stopping=True,
    do_sample=False,
)
```

### Results

| Dataset            | Identical ratio | Identical ratio (ci) |  SARI | FKGL | BERTScore |  LENS |
|--------------------|----------------:|---------------------:|------:|-----:|----------:|------:|
| **ASSET**          |            0.29 |                 0.29 | 33.80 | 9.23 |     87.54 | 62.46 |
| **MEDEASI**        |            0.30 |                 0.30 | 32.68 | 10.98|     45.14 | 50.55 |
| **OneStopEnglish** |            0.40 |                 0.40 | 37.07 | 8.66 |     77.77 | 60.97 |


## Environmental Impact

- **Hardware Type:** Single NVIDIA L4 GPU (Google Colab)
- **Hours used:** Approx. 5–10
- **Cloud Provider:** Google Cloud (via Colab)
- **Compute Region:** Unknown (Google Colab dynamic allocation)
- **Carbon Emitted:** Estimated to be very low (< a few kg CO₂eq), since training was limited to a single GPU for a small number of hours.

## Citation

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]