SIRIS-Lab
/

erc-classifiers

Safetensors

bert

Model card Files Files and versions

xet

Community

madisonchester commited on Oct 23, 2025

Commit

981d3ba

verified ·

1 Parent(s): e5d07c8

Update README.md

Browse files

Files changed (1) hide show

README.md +70 -162

README.md CHANGED Viewed

@@ -1,199 +1,107 @@
----
-library_name: transformers
-tags: []
----
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

+# ERC Classifiers
+This repository contains a model trained for multi-label classification of scientific papers in the ERC (European Research Council) context. The model predicts multiple categories for a paper, such as its research domain or topic, based on the abstract and title.
+## Model Description
+The model is based on **SPECTER** (a transformer-based model pre-trained on scientific literature), fine-tuned for **multi-label classification** on a dataset of scientific papers. The model classifies papers into several categories, which are defined by the **ERC categories**. The fine-tuned model is trained to predict these categories given the title and abstract of each paper.
+### Preprocessing
+The preprocessing pipeline involves:
+1. **Data Loading**: Papers are loaded from a Parquet file containing the title, abstract, and their respective categories.
+2. **Label Cleaning**: Labels (categories) are processed to remove any unnecessary information (like content within parentheses).
+3. **Label Encoding**: Categories are transformed into a binary matrix using the **MultiLabelBinarizer** from scikit-learn. Each category corresponds to a column, and the value is `1` if the paper belongs to that category, `0` otherwise.
+4. **Statistics and Visualization**: Basic statistics and visualizations, such as label distributions, are generated to help understand the dataset better.
+### Training
+The model is fine-tuned on the preprocessed dataset using the following setup:
+* **Base Model**: The model uses the `allenai/specter` transformer as the base model for sequence classification.
+* **Optimizer**: AdamW optimizer with a learning rate of `5e-5` is used.
+* **Loss Function**: Binary Cross-Entropy with logits (`BCEWithLogitsLoss`) is employed, as the task is multi-label classification.
+* **Epochs**: The model is trained for **5 epochs** with a batch size of 4.
+* **Training Data**: The model is trained on a processed dataset stored in `train_ready.parquet`.
+### Evaluation
+The model is evaluated using both **single-label** and **multi-label** metrics:
+#### Single-Label Evaluation
+* **Accuracy**: The accuracy is measured by checking how often the true label appears in the predicted labels.
+* **Precision, Recall, F1**: These metrics are calculated for each class and averaged for the entire dataset.
+#### Multi-Label Evaluation
+* **Micro and Macro Metrics**: Precision, recall, and F1 scores are computed using both micro-averaging (overall performance) and macro-averaging (performance per label).
+* **Label Frequency Plot**: A plot showing the frequency distribution of labels in the test set.
+* **Top and Bottom F1 Plot**: A plot visualizing the top and bottom labels based on their F1 scores.
+## Dataset
+The dataset consists of scientific papers, each with the following columns:
+* **title**: The title of the paper.
+* **abstract**: The abstract of the paper.
+* **label**: A list of categories (labels) assigned to the paper.
+The dataset is preprocessed and stored in a `train_ready.parquet` file.
+## Files
+* `config.json`: Model configuration file.
+* `model.safetensors`: Saved fine-tuned model weights.
+* `tokenizer.json`: Tokenizer configuration for the fine-tuned model.
+* `tokenizer_config.json`: Tokenizer settings.
+* `special_tokens_map.json`: Special tokens used by the tokenizer.
+* `vocab.txt`: Vocabulary file for the fine-tuned tokenizer.
+## Usage
+To use the model, follow these steps:
+1. **Install Dependencies**:
+   ```bash
+   pip install transformers torch datasets
+   ```
+2. **Load the Model and Tokenizer**:
+   ```python
+   from transformers import AutoModelForSequenceClassification, AutoTokenizer
+   model_name = "SIRIS-Lab/erc-classifiers"
+   # Load fine-tuned model and tokenizer
+   model = AutoModelForSequenceClassification.from_pretrained(model_name)
+   tokenizer = AutoTokenizer.from_pretrained(model_name)
+   ```
+3. **Use the Model for Prediction**:
+   ```python
+   # Example paper title and abstract
+   text = "Example title and abstract of a scientific paper."
+   # Tokenize the input text
+   inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
+   # Make predictions
+   with torch.no_grad():
+       logits = model(**inputs).logits
+   # Apply sigmoid activation to get probabilities
+   probabilities = torch.sigmoid(logits)
+   # Get predicted labels (threshold at 0.5)
+   predicted_labels = (probabilities >= 0.5).long().cpu().numpy()
+   print(predicted_labels)
+   ```
+## Conclusion
+This model provides an efficient solution for classifying scientific papers into multiple categories based on their content. It uses state-of-the-art transformer-based techniques and is fine-tuned on a real-world dataset of ERC-related scientific papers.