Raj411 commited on
Commit
9fcb8fe
·
verified ·
1 Parent(s): 7205a02

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -0
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ datasets:
4
+ - DSL-13-SRMAP/Telugu-Dataset
5
+ language:
6
+ - te
7
+ tags:
8
+ - sentiment-analysis
9
+ - text-classification
10
+ - telugu
11
+ - multilingual
12
+ - xlm-roberta
13
+ - baseline
14
+ base_model: xlm-roberta-base
15
+ pipeline_tag: text-classification
16
+ metrics:
17
+ - accuracy
18
+ - f1
19
+ - auroc
20
+ ---
21
+
22
+ # XLM-R_WOR
23
+
24
+ ## Model Description
25
+
26
+ **XLM-R_WOR** is a Telugu sentiment classification model built on **XLM-RoBERTa (XLM-R)**, a large-scale multilingual Transformer model developed by Facebook AI. XLM-R is designed to enhance cross-lingual understanding by leveraging a substantially larger and more diverse pretraining corpus than mBERT.
27
+
28
+ The base model is pretrained on approximately **2.5 TB of filtered Common Crawl data** covering **100+ languages**, including Telugu. Unlike mBERT, XLM-R is trained **exclusively with the Masked Language Modeling (MLM) objective**, without using the Next Sentence Prediction (NSP) task. This design choice enables stronger contextual representations and improved transfer learning.
29
+
30
+ The suffix **WOR** denotes **Without Rationale supervision**. This model is fine-tuned using only sentiment labels, without incorporating human-annotated rationales, and serves as a **label-only baseline**.
31
+
32
+ ---
33
+
34
+ ## Pretraining Details
35
+
36
+ - **Pretraining corpus:** Filtered Common Crawl (≈2.5 TB, 100+ languages)
37
+ - **Training objective:**
38
+ - Masked Language Modeling (MLM)
39
+ - **Next Sentence Prediction:** Not used
40
+ - **Language coverage:** Telugu included, but not exclusively targeted
41
+
42
+ ---
43
+
44
+ ## Training Data
45
+
46
+ - **Fine-tuning dataset:** Telugu-Dataset
47
+ - **Task:** Sentiment classification
48
+ - **Supervision type:** Label-only (no rationale supervision)
49
+
50
+ ---
51
+
52
+ ## Intended Use
53
+
54
+ This model is intended for:
55
+
56
+ - Telugu sentiment classification
57
+ - Cross-lingual and multilingual NLP benchmarking
58
+ - Baseline comparisons for explainability and rationale-supervision studies
59
+ - Low-resource Telugu NLP research
60
+
61
+ Due to its large-scale multilingual pretraining, XLM-R_WOR is particularly effective for transfer learning scenarios where Telugu-specific labeled data is limited.
62
+
63
+ ---
64
+
65
+ ## Performance Characteristics
66
+
67
+ XLM-R generally provides stronger contextual modeling and improved downstream performance compared to mBERT, owing to its larger and more diverse pretraining corpus and exclusive focus on the MLM objective.
68
+
69
+ ### Strengths
70
+
71
+ - Strong cross-lingual transfer learning
72
+ - Improved contextual representations over mBERT
73
+ - Reliable baseline for multilingual sentiment analysis
74
+
75
+ ### Limitations
76
+
77
+ - Not explicitly optimized for Telugu morphology or syntax
78
+ - May underperform compared to Telugu-specialized models such as MuRIL or L3Cube-Telugu-BERT
79
+ - Limited ability to capture fine-grained cultural and regional linguistic nuances
80
+
81
+ ---
82
+
83
+ ## Use as a Baseline
84
+
85
+ **XLM-R_WOR** serves as a robust and widely accepted baseline for:
86
+
87
+ - Comparing multilingual models against Telugu-specialized architectures
88
+ - Evaluating the impact of rationale supervision (WOR vs. WR)
89
+ - Benchmarking sentiment classification performance in low-resource Telugu settings
90
+
91
+ ---
92
+
93
+ ## References
94
+
95
+ - Conneau et al., 2019
96
+ - Hedderich et al., 2021
97
+ - Kulkarni et al., 2021
98
+ - Joshi, 2022
99
+ - Das et al., 2022
100
+ - Rajalakshmi et al., 2023