|
|
--- |
|
|
{} |
|
|
--- |
|
|
# ERC Classifiers |
|
|
|
|
|
This repository contains a model trained for multi-label classification of scientific papers in the ERC (European Research Council) context. The model predicts multiple categories for a paper, such as its research domain or topic, based on the abstract and title. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
The model is based on **SPECTER** (a transformer-based model pre-trained on scientific literature), fine-tuned for **multi-label classification** on a dataset of scientific papers. The model classifies papers into several categories, which are defined by the **ERC categories**. The fine-tuned model is trained to predict these categories given the title and abstract of each paper. |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
The preprocessing pipeline involves: |
|
|
|
|
|
1. **Data Loading**: Papers are loaded from a Parquet file containing the title, abstract, and their respective categories. |
|
|
2. **Label Cleaning**: Labels (categories) are processed to remove any unnecessary information (like content within parentheses). |
|
|
3. **Label Encoding**: Categories are transformed into a binary matrix using the **MultiLabelBinarizer** from scikit-learn. Each category corresponds to a column, and the value is `1` if the paper belongs to that category, `0` otherwise. |
|
|
4. **Statistics and Visualization**: Basic statistics and visualizations, such as label distributions, are generated to help understand the dataset better. |
|
|
|
|
|
### Training |
|
|
|
|
|
The model is fine-tuned on the preprocessed dataset using the following setup: |
|
|
|
|
|
* **Base Model**: The model uses the `allenai/specter` transformer as the base model for sequence classification. |
|
|
* **Optimizer**: AdamW optimizer with a learning rate of `5e-5` is used. |
|
|
* **Loss Function**: Binary Cross-Entropy with logits (`BCEWithLogitsLoss`) is employed, as the task is multi-label classification. |
|
|
* **Epochs**: The model is trained for **5 epochs** with a batch size of 4. |
|
|
* **Training Data**: The model is trained on a processed dataset stored in `train_ready.parquet`. |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
The model is evaluated using both **single-label** and **multi-label** metrics: |
|
|
|
|
|
#### Single-Label Evaluation |
|
|
|
|
|
* **Accuracy**: The accuracy is measured by checking how often the true label appears in the predicted labels. |
|
|
* **Precision, Recall, F1**: These metrics are calculated for each class and averaged for the entire dataset. |
|
|
|
|
|
#### Multi-Label Evaluation |
|
|
|
|
|
* **Micro and Macro Metrics**: Precision, recall, and F1 scores are computed using both micro-averaging (overall performance) and macro-averaging (performance per label). |
|
|
* **Label Frequency Plot**: A plot showing the frequency distribution of labels in the test set. |
|
|
* **Top and Bottom F1 Plot**: A plot visualizing the top and bottom labels based on their F1 scores. |
|
|
|
|
|
## Dataset |
|
|
|
|
|
The dataset consists of scientific papers, each with the following columns: |
|
|
|
|
|
* **title**: The title of the paper. |
|
|
* **abstract**: The abstract of the paper. |
|
|
* **label**: A list of categories (labels) assigned to the paper. |
|
|
|
|
|
The dataset is preprocessed and stored in a `train_ready.parquet` file. |
|
|
|
|
|
## Files |
|
|
|
|
|
* `config.json`: Model configuration file. |
|
|
* `model.safetensors`: Saved fine-tuned model weights. |
|
|
* `tokenizer.json`: Tokenizer configuration for the fine-tuned model. |
|
|
* `tokenizer_config.json`: Tokenizer settings. |
|
|
* `special_tokens_map.json`: Special tokens used by the tokenizer. |
|
|
* `vocab.txt`: Vocabulary file for the fine-tuned tokenizer. |
|
|
|
|
|
## Usage |
|
|
|
|
|
To use the model, follow these steps: |
|
|
|
|
|
1. **Install Dependencies**: |
|
|
|
|
|
```bash |
|
|
pip install transformers torch datasets |
|
|
``` |
|
|
|
|
|
2. **Load the Model and Tokenizer**: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
|
|
model_name = "SIRIS-Lab/erc-classifiers" |
|
|
|
|
|
# Load fine-tuned model and tokenizer |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
``` |
|
|
|
|
|
3. **Use the Model for Prediction**: |
|
|
|
|
|
```python |
|
|
# Example paper title and abstract |
|
|
text = "Example title and abstract of a scientific paper." |
|
|
|
|
|
# Tokenize the input text |
|
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) |
|
|
|
|
|
# Make predictions |
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
|
|
|
# Apply sigmoid activation to get probabilities |
|
|
probabilities = torch.sigmoid(logits) |
|
|
|
|
|
# Get predicted labels (threshold at 0.5) |
|
|
predicted_labels = (probabilities >= 0.5).long().cpu().numpy() |
|
|
print(predicted_labels) |
|
|
``` |
|
|
|
|
|
## Conclusion |
|
|
|
|
|
This model provides an efficient solution for classifying scientific papers into multiple categories based on their content. It uses state-of-the-art transformer-based techniques and is fine-tuned on a real-world dataset of ERC-related scientific papers. |
|
|
|