File size: 4,760 Bytes
5212e2e
 
 
981d3ba
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
 
 
 
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
 
 
 
 
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
 
4d66700
981d3ba
4d66700
981d3ba
 
 
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
 
 
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
 
 
 
 
 
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
4d66700
981d3ba
 
 
4d66700
981d3ba
4d66700
981d3ba
 
4d66700
981d3ba
4d66700
981d3ba
 
 
 
4d66700
981d3ba
4d66700
981d3ba
 
 
4d66700
981d3ba
 
4d66700
981d3ba
 
 
4d66700
981d3ba
 
4d66700
981d3ba
 
 
 
4d66700
981d3ba
4d66700
981d3ba
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
{}
---
# ERC Classifiers

This repository contains a model trained for multi-label classification of scientific papers in the ERC (European Research Council) context. The model predicts multiple categories for a paper, such as its research domain or topic, based on the abstract and title.

## Model Description

The model is based on **SPECTER** (a transformer-based model pre-trained on scientific literature), fine-tuned for **multi-label classification** on a dataset of scientific papers. The model classifies papers into several categories, which are defined by the **ERC categories**. The fine-tuned model is trained to predict these categories given the title and abstract of each paper.

### Preprocessing

The preprocessing pipeline involves:

1. **Data Loading**: Papers are loaded from a Parquet file containing the title, abstract, and their respective categories.
2. **Label Cleaning**: Labels (categories) are processed to remove any unnecessary information (like content within parentheses).
3. **Label Encoding**: Categories are transformed into a binary matrix using the **MultiLabelBinarizer** from scikit-learn. Each category corresponds to a column, and the value is `1` if the paper belongs to that category, `0` otherwise.
4. **Statistics and Visualization**: Basic statistics and visualizations, such as label distributions, are generated to help understand the dataset better.

### Training

The model is fine-tuned on the preprocessed dataset using the following setup:

* **Base Model**: The model uses the `allenai/specter` transformer as the base model for sequence classification.
* **Optimizer**: AdamW optimizer with a learning rate of `5e-5` is used.
* **Loss Function**: Binary Cross-Entropy with logits (`BCEWithLogitsLoss`) is employed, as the task is multi-label classification.
* **Epochs**: The model is trained for **5 epochs** with a batch size of 4.
* **Training Data**: The model is trained on a processed dataset stored in `train_ready.parquet`.

### Evaluation

The model is evaluated using both **single-label** and **multi-label** metrics:

#### Single-Label Evaluation

* **Accuracy**: The accuracy is measured by checking how often the true label appears in the predicted labels.
* **Precision, Recall, F1**: These metrics are calculated for each class and averaged for the entire dataset.

#### Multi-Label Evaluation

* **Micro and Macro Metrics**: Precision, recall, and F1 scores are computed using both micro-averaging (overall performance) and macro-averaging (performance per label).
* **Label Frequency Plot**: A plot showing the frequency distribution of labels in the test set.
* **Top and Bottom F1 Plot**: A plot visualizing the top and bottom labels based on their F1 scores.

## Dataset

The dataset consists of scientific papers, each with the following columns:

* **title**: The title of the paper.
* **abstract**: The abstract of the paper.
* **label**: A list of categories (labels) assigned to the paper.

The dataset is preprocessed and stored in a `train_ready.parquet` file.

## Files

* `config.json`: Model configuration file.
* `model.safetensors`: Saved fine-tuned model weights.
* `tokenizer.json`: Tokenizer configuration for the fine-tuned model.
* `tokenizer_config.json`: Tokenizer settings.
* `special_tokens_map.json`: Special tokens used by the tokenizer.
* `vocab.txt`: Vocabulary file for the fine-tuned tokenizer.

## Usage

To use the model, follow these steps:

1. **Install Dependencies**:

   ```bash
   pip install transformers torch datasets
   ```

2. **Load the Model and Tokenizer**:

   ```python
   from transformers import AutoModelForSequenceClassification, AutoTokenizer

   model_name = "SIRIS-Lab/erc-classifiers"

   # Load fine-tuned model and tokenizer
   model = AutoModelForSequenceClassification.from_pretrained(model_name)
   tokenizer = AutoTokenizer.from_pretrained(model_name)
   ```

3. **Use the Model for Prediction**:

   ```python
   # Example paper title and abstract
   text = "Example title and abstract of a scientific paper."

   # Tokenize the input text
   inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

   # Make predictions
   with torch.no_grad():
       logits = model(**inputs).logits

   # Apply sigmoid activation to get probabilities
   probabilities = torch.sigmoid(logits)

   # Get predicted labels (threshold at 0.5)
   predicted_labels = (probabilities >= 0.5).long().cpu().numpy()
   print(predicted_labels)
   ```

## Conclusion

This model provides an efficient solution for classifying scientific papers into multiple categories based on their content. It uses state-of-the-art transformer-based techniques and is fine-tuned on a real-world dataset of ERC-related scientific papers.