Horizon 1

A larger and more modern variant of Constellation-One for Cockatoo from answerdotai/modernBERT-large

This model is licensed under the Apache-2.0 license

Note:

lmsys/toxic-chat is licensed under CC-BY-NC-4.0, meaning this model cannot be legally used for commercial purposes.

Hardware:

This model was fine-tuned on two NVIDIA A40s with a batch size of 32 and gradient accumulation of 2, totaling to an effective batch size of (32*2) * 2 = 128

Fine-tuned on a dataset size of 232k entries aggregated from:

- ealvaradob/phishing-dataset
- ucberkeley-dlab/measuring-hate-speech
- cardiffnlp/tweet_eval
- lmsys/toxic-chat
- tasksource/jigsaw_toxicity

Software

Training was executed on the Cockatoo_ML_Training server. Metrics are publicly visible at Cockatoo.dev .

Techniques: or label merging, merge_labels on conflict. There have been no manual intervention in data sanitization before/after merging.

Asymmetric losses:

γ- = 3.5
γ+ = 0.5
clipping = 0.05

Optimizer:

adamw

betas = (0.9, 0.999)
eps = 1e-8
momentum = 0.9

LLRD:

decay_factor = 0.98

Hyperparameters:

epoch = 3

batch_size = 32
gradient_accumulation = 2

learning_rate = 5e-5
weight_decay = 0.1
warmup_ratio = 0.1

fp16 = false
bf16 = true
tf32 = true

gradient_checkpointng = false
gradient_clipping = true
gradient_clipping_val = 1.0

attention_implementation = "flash_attention_2"

Available Labels:

"id2label": {
  "0": "scam",
  "1": "violence",
  "2": "harassment",
  "3": "hate_speech",
  "4": "toxicity",
  "5": "obscenity",
  "6": "genocide" # genocide is a new addition compared to Constellation
}

Performance

All evaluation metrics are from macro averaging, may contain slight deviations with other data entries due to the discrepancy in different evaluation runs. Metrics from zero-shot evaluation split (not present in training data)

Horizon 1 achieves very high recall values out of the box (0.94 raw) with a comparable precision compared to Constellation (0.566 raw vs. 0.605).

However, this model really shines when trigger thresholds have been fine-tuned:

Default:

Category	Threshold	F1-Score
scam	0.5	0.8758
violence	0.5	0.6891
harassment	0.5	0.8279
hate_speech	0.5	0.6581
toxicity	0.5	0.6430
obscenity	0.5	0.6428
genocide	0.5	0.5630
Average	-	0.7000

Tuned:

Category	Threshold	F1-Score	Delta (vs. default)
scam	0.7129	0.9131	+0.0373
violence	0.6238	0.7252	+0.0361
harassment	0.6535	0.8712	+0.0433
hate_speech	0.6040	0.7082	+0.0501
toxicity	0.6238	0.7371	+0.0941
obscenity	0.6238	0.7309	+0.0881
genocide	0.6337	0.5929	+0.0299
Average	-	0.7541	+0.0541

Comparison with Constellation One (tuned):

Metric	Constellation One	Horizon 1	Delta (H1 - C1)
Loss	0.1603	0.0245	-0.1358
Overall Precision	0.6940	0.6809	-0.0131
Overall Recall	0.8151	0.8554	+0.0403
Overall F1	0.7475	0.7448	-0.0027
Scam Precision	0.9255	0.9330	+0.0075
Scam Recall	0.9467	0.9009	-0.0459
Scam F1	0.9360	0.9167	-0.0194
Violence Precision	0.5141	0.6293	+0.1152
Violence Recall	0.7191	0.8828	+0.1637
Violence F1	0.5995	0.7348	+0.1353
Harassment Precision	0.8238	0.8329	+0.0091
Harassment Recall	0.8830	0.9240	+0.0410
Harassment F1	0.8524	0.8761	+0.0237
Hate Speech Precision	0.5607	0.5965	+0.0358
Hate Speech Recall	0.6960	0.8652	+0.1692
Hate Speech F1	0.6211	0.7061	+0.0850
Toxicity Precision	0.6891	0.6946	+0.0056
Toxicity Recall	0.8025	0.7481	-0.0544
Toxicity F1	0.7415	0.7204	-0.0211
Obscenity Precision	0.6507	0.6828	+0.0321
Obscenity Recall	0.8431	0.7160	-0.1271
Obscenity F1	0.7345	0.6990	-0.0355
Genocide Precision	N/A	0.3972	N/A
Genocide Recall	N/A	0.9511	N/A
Genocide F1	N/A	0.5604	N/A

This model is more "trigger-happy" compared to Constellation One, albeit this can be mitigated in production by increasing thresholds (current values optimized for macro F1).

A newer version is planned to mitigate this behavior.

Resources:

Training/Inferencing server: https://github.com/DominicTWHV/Cockatoo_ML_Training/

Training Metrics: https://cockatoo.dev/ml-training.html

Datasets Used | Citations

Dataset	License	Link
Phishing Dataset	MIT	Hugging Face
Measuring Hate Speech	CC-BY-4.0	Hugging Face
Tweet Eval (SemEval-2019)	[See Citation]*	Hugging Face
Toxic Chat	CC-BY-NC-4.0	Hugging Face
Jigsaw Toxicity	Apache-2.0	Hugging Face

Citation: ucberkeley-dlab/measuring-hate-speech

@article{kennedy2020constructing,
  title={Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application},
  author={Kennedy, Chris J and Bacon, Geoff and Sahn, Alexander and von Vacano, Claudia},
  journal={arXiv preprint arXiv:2009.10277},
  year={2020}
}

Citation: cardiffnlp/tweet_eval

@inproceedings{basile-etal-2019-semeval,
    title = "{S}em{E}val-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in {T}witter",
    author = "Basile, Valerio and Bosco, Cristina and Fersini, Elisabetta and Nozza, Debora and Patti, Viviana and Rangel Pardo, Francisco Manuel and Rosso, Paolo and Sanguinetti, Manuela",
    booktitle = "Proceedings of the 13th International Workshop on Semantic Evaluation",
    year = "2019",
    address = "Minneapolis, Minnesota, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/S19-2007",
    doi = "10.18653/v1/S19-2007",
    pages = "54--63"
}

Citation: lmsys/toxic-chat

@misc{lin2023toxicchat,
      title={ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation}, 
      author={Zi Lin and Zihan Wang and Yongqi Tong and Yangkun Wang and Yuxin Guo and Yujia Wang and Jingbo Shang},
      year={2023},
      eprint={2310.17389},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Downloads last month: 33

Safetensors

Model size

0.4B params

Tensor type

BF16

Model tree for DominicTWHV/Horizon-1-Text-Large

Base model

answerdotai/ModernBERT-large

Finetuned

(242)

this model

Datasets used to train DominicTWHV/Horizon-1-Text-Large

Papers for DominicTWHV/Horizon-1-Text-Large

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation

Paper • 2310.17389 • Published Oct 26, 2023

Constructing interval variables via faceted Rasch measurement and multitask deep learning: a hate speech application

Paper • 2009.10277 • Published Sep 22, 2020