File size: 4,760 Bytes
5212e2e 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba 4d66700 981d3ba |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
{}
---
# ERC Classifiers
This repository contains a model trained for multi-label classification of scientific papers in the ERC (European Research Council) context. The model predicts multiple categories for a paper, such as its research domain or topic, based on the abstract and title.
## Model Description
The model is based on **SPECTER** (a transformer-based model pre-trained on scientific literature), fine-tuned for **multi-label classification** on a dataset of scientific papers. The model classifies papers into several categories, which are defined by the **ERC categories**. The fine-tuned model is trained to predict these categories given the title and abstract of each paper.
### Preprocessing
The preprocessing pipeline involves:
1. **Data Loading**: Papers are loaded from a Parquet file containing the title, abstract, and their respective categories.
2. **Label Cleaning**: Labels (categories) are processed to remove any unnecessary information (like content within parentheses).
3. **Label Encoding**: Categories are transformed into a binary matrix using the **MultiLabelBinarizer** from scikit-learn. Each category corresponds to a column, and the value is `1` if the paper belongs to that category, `0` otherwise.
4. **Statistics and Visualization**: Basic statistics and visualizations, such as label distributions, are generated to help understand the dataset better.
### Training
The model is fine-tuned on the preprocessed dataset using the following setup:
* **Base Model**: The model uses the `allenai/specter` transformer as the base model for sequence classification.
* **Optimizer**: AdamW optimizer with a learning rate of `5e-5` is used.
* **Loss Function**: Binary Cross-Entropy with logits (`BCEWithLogitsLoss`) is employed, as the task is multi-label classification.
* **Epochs**: The model is trained for **5 epochs** with a batch size of 4.
* **Training Data**: The model is trained on a processed dataset stored in `train_ready.parquet`.
### Evaluation
The model is evaluated using both **single-label** and **multi-label** metrics:
#### Single-Label Evaluation
* **Accuracy**: The accuracy is measured by checking how often the true label appears in the predicted labels.
* **Precision, Recall, F1**: These metrics are calculated for each class and averaged for the entire dataset.
#### Multi-Label Evaluation
* **Micro and Macro Metrics**: Precision, recall, and F1 scores are computed using both micro-averaging (overall performance) and macro-averaging (performance per label).
* **Label Frequency Plot**: A plot showing the frequency distribution of labels in the test set.
* **Top and Bottom F1 Plot**: A plot visualizing the top and bottom labels based on their F1 scores.
## Dataset
The dataset consists of scientific papers, each with the following columns:
* **title**: The title of the paper.
* **abstract**: The abstract of the paper.
* **label**: A list of categories (labels) assigned to the paper.
The dataset is preprocessed and stored in a `train_ready.parquet` file.
## Files
* `config.json`: Model configuration file.
* `model.safetensors`: Saved fine-tuned model weights.
* `tokenizer.json`: Tokenizer configuration for the fine-tuned model.
* `tokenizer_config.json`: Tokenizer settings.
* `special_tokens_map.json`: Special tokens used by the tokenizer.
* `vocab.txt`: Vocabulary file for the fine-tuned tokenizer.
## Usage
To use the model, follow these steps:
1. **Install Dependencies**:
```bash
pip install transformers torch datasets
```
2. **Load the Model and Tokenizer**:
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "SIRIS-Lab/erc-classifiers"
# Load fine-tuned model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```
3. **Use the Model for Prediction**:
```python
# Example paper title and abstract
text = "Example title and abstract of a scientific paper."
# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Make predictions
with torch.no_grad():
logits = model(**inputs).logits
# Apply sigmoid activation to get probabilities
probabilities = torch.sigmoid(logits)
# Get predicted labels (threshold at 0.5)
predicted_labels = (probabilities >= 0.5).long().cpu().numpy()
print(predicted_labels)
```
## Conclusion
This model provides an efficient solution for classifying scientific papers into multiple categories based on their content. It uses state-of-the-art transformer-based techniques and is fine-tuned on a real-world dataset of ERC-related scientific papers.
|