|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- synapti/nci-propaganda-production |
|
|
base_model: answerdotai/ModernBERT-base |
|
|
tags: |
|
|
- transformers |
|
|
- modernbert |
|
|
- text-classification |
|
|
- propaganda-detection |
|
|
- binary-classification |
|
|
- nci-protocol |
|
|
library_name: transformers |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# NCI Binary Detector |
|
|
|
|
|
Fast binary classifier that detects whether text contains propaganda techniques. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is **Stage 1** of the NCI (Narrative Credibility Index) two-stage propaganda detection pipeline: |
|
|
|
|
|
- **Stage 1 (this model)**: Fast binary detection - "Does this text contain propaganda?" |
|
|
- **Stage 2**: Multi-label technique classification - "Which specific techniques are used?" |
|
|
|
|
|
The binary detector serves as a fast filter with high recall, passing flagged content to the more detailed technique classifier. |
|
|
|
|
|
## Labels |
|
|
|
|
|
| Label | Description | |
|
|
|-------|-------------| |
|
|
| `no_propaganda` | Text does not contain propaganda techniques | |
|
|
| `has_propaganda` | Text contains one or more propaganda techniques | |
|
|
|
|
|
## Performance |
|
|
|
|
|
**Test Set Results:** |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| Accuracy | 99.5% | |
|
|
| F1 Score | 99.6% | |
|
|
| Precision | 99.2% | |
|
|
| Recall | 100.0% | |
|
|
| ROC AUC | 99.9% | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
detector = pipeline( |
|
|
"text-classification", |
|
|
model="synapti/nci-binary-detector" |
|
|
) |
|
|
|
|
|
text = "The radical left is DESTROYING our country!" |
|
|
result = detector(text)[0] |
|
|
|
|
|
print(f"Label: {result['label']}") # 'has_propaganda' or 'no_propaganda' |
|
|
print(f"Confidence: {result['score']:.2%}") |
|
|
``` |
|
|
|
|
|
### Two-Stage Pipeline |
|
|
|
|
|
For best results, use with the technique classifier: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Stage 1: Binary detection |
|
|
detector = pipeline("text-classification", model="synapti/nci-binary-detector") |
|
|
|
|
|
# Stage 2: Technique classification (only if propaganda detected) |
|
|
classifier = pipeline("text-classification", model="synapti/nci-technique-classifier", top_k=None) |
|
|
|
|
|
text = "Your text to analyze..." |
|
|
|
|
|
# Quick check first |
|
|
detection = detector(text)[0] |
|
|
if detection["label"] == "has_propaganda" and detection["score"] > 0.5: |
|
|
# Detailed technique analysis |
|
|
techniques = classifier(text)[0] |
|
|
detected = [t for t in techniques if t["score"] > 0.3] |
|
|
for t in detected: |
|
|
print(f"{t['label']}: {t['score']:.2%}") |
|
|
else: |
|
|
print("No propaganda detected") |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production): |
|
|
|
|
|
- **23,000+ examples** from multiple sources |
|
|
- **Positive examples**: Text with 1+ propaganda techniques (from SemEval-2020, augmented data) |
|
|
- **Hard negatives**: Factual content from LIAR2, QBias datasets |
|
|
- **Class-weighted Focal Loss** to handle imbalance (gamma=2.0) |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Base Model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) |
|
|
- **Parameters**: 149.6M |
|
|
- **Max Sequence Length**: 512 tokens |
|
|
- **Output**: 2 labels (binary classification) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Loss Function**: Focal Loss (gamma=2.0, alpha=0.25) |
|
|
- **Optimizer**: AdamW |
|
|
- **Learning Rate**: 2e-5 |
|
|
- **Batch Size**: 16 (effective 32 with gradient accumulation) |
|
|
- **Epochs**: 5 with early stopping (patience=3) |
|
|
- **Hardware**: NVIDIA A10G GPU |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained primarily on English text |
|
|
- Works best on content similar to training distribution (news articles, social media posts) |
|
|
- May not detect subtle or novel propaganda techniques not in training data |
|
|
- Should be used alongside human review for high-stakes applications |
|
|
|
|
|
## Related Models |
|
|
|
|
|
- [synapti/nci-technique-classifier](https://huggingface.co/synapti/nci-technique-classifier) - Stage 2 multi-label technique classifier |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{da-san-martino-etal-2020-semeval, |
|
|
title = "{S}em{E}val-2020 Task 11: Detection of Propaganda Techniques in News Articles", |
|
|
author = "Da San Martino, Giovanni and others", |
|
|
booktitle = "Proceedings of SemEval-2020", |
|
|
year = "2020", |
|
|
} |
|
|
|
|
|
@misc{nci-binary-detector, |
|
|
author = {NCI Protocol Team}, |
|
|
title = {NCI Binary Detector}, |
|
|
year = {2024}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/synapti/nci-binary-detector} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|