AnnyNguyen commited on
Commit
981aabb
verified
1 Parent(s): 92c0af7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +103 -31
README.md CHANGED
@@ -1,58 +1,130 @@
1
  ---
2
- language: vi
 
3
  tags:
4
- - spam-detection
5
  - vietnamese
6
- - bartpho
7
- license: apache-2.0
 
8
  datasets:
9
- - visolex/ViSpamReviews
10
  metrics:
11
  - accuracy
12
- - f1
 
 
13
  model-index:
14
  - name: bartpho-spam-binary
15
  results:
16
  - task:
17
  type: text-classification
18
- name: Spam Detection (Binary)
19
  dataset:
20
  name: ViSpamReviews
21
- type: custom
22
  metrics:
23
- - name: Accuracy
24
- type: accuracy
25
- value: <INSERT_ACCURACY>
26
- - name: F1 Score
27
- type: f1
28
- value: <INSERT_F1_SCORE>
29
- base_model:
30
- - vinai/bartpho-syllable
31
- pipeline_tag: text-classification
32
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- # BARTPho-Spam-Binary
35
 
36
- Fine-tuned from [`vinai/bartpho-syllable`](https://huggingface.co/vinai/bartpho-syllable) on **ViSpamReviews** (binary).
37
 
38
- * **Task**: Binary classification
39
- * **Dataset**: [ViSpamReviews](https://huggingface.co/datasets/visolex/ViSpamReviews)
40
- * **Hyperparameters**
41
 
42
- * Batch size: 32
43
- * LR: 3e-5
44
- * Epochs: 100
45
- * Max seq len: 256
46
  ## Usage
47
 
 
 
48
  ```python
49
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
 
 
 
 
 
 
50
 
51
- tokenizer = AutoTokenizer.from_pretrained("visolex/bartpho-spam-binary")
52
- model = AutoModelForSequenceClassification.from_pretrained("visolex/bartpho-spam-binary")
53
 
54
- text = "Review n脿y kh么ng c贸 th岷璽."
55
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
56
- pred = model(**inputs).logits.argmax(dim=-1).item()
57
- print("Spam" if pred==1 else "Non-spam")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ base_model: vinai/bartpho-syllable
4
  tags:
 
5
  - vietnamese
6
+ - spam-detection
7
+ - text-classification
8
+ - e-commerce
9
  datasets:
10
+ - ViSpamReviews
11
  metrics:
12
  - accuracy
13
+ - macro-f1
14
+ - macro-precision
15
+ - macro-recall
16
  model-index:
17
  - name: bartpho-spam-binary
18
  results:
19
  - task:
20
  type: text-classification
21
+ name: Spam Review Detection
22
  dataset:
23
  name: ViSpamReviews
24
+ type: ViSpamReviews
25
  metrics:
26
+ - type: accuracy
27
+ value: N/A
28
+ - type: macro-f1
29
+ value: N/A
 
 
 
 
 
30
  ---
31
+ # bartpho-spam-binary: Spam Review Detection for Vietnamese Text
32
+
33
+ This model is a fine-tuned version of [vinai/bartpho-syllable](https://huggingface.co/vinai/bartpho-syllable) on the **ViSpamReviews** dataset for spam review detection in Vietnamese e-commerce reviews.
34
+
35
+ ## Model Details
36
+
37
+ * **Base Model**: `vinai/bartpho-syllable`
38
+ * **Description**: BART Pho - Vietnamese BART model
39
+ * **Dataset**: ViSpamReviews (Vietnamese Spam Review Dataset)
40
+ * **Fine-tuning Framework**: HuggingFace Transformers
41
+ * **Task**: Spam Review Detection (binary)
42
+ * **Number of Classes**: 2
43
+
44
+ ### Hyperparameters
45
+
46
+ * Max sequence length: `256`
47
+ * Learning rate: `5e-5`
48
+ * Batch size: `32`
49
+ * Epochs: `100`
50
+ * Early stopping patience: `5`
51
+
52
+ ## Dataset
53
+
54
+ The model was trained on the **ViSpamReviews** dataset, which contains 19,860 Vietnamese e-commerce review samples. The dataset includes:
55
+
56
+ * **Train set**: 14,299 samples (72%)
57
+ * **Validation set**: 1,590 samples (8%)
58
+ * **Test set**: 3,971 samples (20%)
59
+
60
+ ### Label Distribution
61
+
62
+
63
+ * **Non-spam** (0): Genuine product reviews
64
+ * **Spam** (1): Fake or promotional reviews
65
 
66
+ ## Results
67
 
68
+ The model was evaluated on the test set with the following metrics:
69
 
70
+ * Results: <INSERT_METRICS>
 
 
71
 
 
 
 
 
72
  ## Usage
73
 
74
+ You can use this model for spam review detection in Vietnamese text. Below is an example:
75
+
76
  ```python
77
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
78
+ import torch
79
+
80
+ # Load model and tokenizer
81
+ model_name = "visolex/bartpho-spam-binary"
82
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
83
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
84
 
85
+ # Example review text
86
+ text = "S岷 ph岷﹎ n脿y r岷 t峄憈, shop giao h脿ng nhanh!"
87
 
88
+ # Tokenize
89
  inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
90
+
91
+ # Predict
92
+ with torch.no_grad():
93
+ outputs = model(**inputs)
94
+ predicted_class = outputs.logits.argmax(dim=-1).item()
95
+ probabilities = torch.softmax(outputs.logits, dim=-1)
96
+
97
+
98
+ # Map to label
99
+ label_map = {0: "Non-spam", 1: "Spam"}
100
+ predicted_label = label_map[predicted_class]
101
+ confidence = probabilities[0][predicted_class].item()
102
+
103
+ print(f"Text: {text}")
104
+ print(f"Predicted: {predicted_label} (confidence: {confidence:.2%})")
105
+
106
  ```
107
+
108
+ ## Citation
109
+
110
+ If you use this model, please cite:
111
+
112
+ ```bibtex
113
+ @misc{{
114
+ {model_key}_spam_detection,
115
+ title={{{description}}},
116
+ author={{ViSoLex Team}},
117
+ year={{2025}},
118
+ howpublished={{\url{{https://huggingface.co/{visolex/bartpho-spam-binary}}}}}
119
+ }}
120
+ ```
121
+
122
+ ## License
123
+
124
+ This model is released under the Apache-2.0 license.
125
+
126
+ ## Acknowledgments
127
+
128
+ * Base model: [{base_model}](https://huggingface.co/{base_model})
129
+ * Dataset: ViSpamReviews (Vietnamese Spam Review Dataset)
130
+ * ViSoLex Toolkit for Vietnamese NLP