File size: 8,612 Bytes
8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 b95016d 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb 8346f83 38795bb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 |
---
tags:
- sentence-transformers
- sentence-similarity
- dataset_size:120000
- multilingual
base_model: Alibaba-NLP/gte-multilingual-base
widget:
- source_sentence: Who is filming along?
sentences:
- Wién filmt mat?
- >-
Weider huet den Tatarescu drop higewisen, datt Rumänien durch seng
krichsbedélegong op de 6eite vun den allie'erten 110.000 mann verluer hätt.
- Brambilla 130.08.03 St.
- source_sentence: 'Four potential scenarios could still play out: Jean Asselborn.'
sentences:
- >-
Dann ass nach eng Antenne hei um Kierchbierg virgesi Richtung RTL Gebai, do
gëtt jo een ganz neie Wunnquartier gebaut.
- >-
D'bedélegong un de wählen wir ganz stärk gewiéscht a munche ge'genden wor re
eso'gucr me' we' 90 prozent.
- Jean Asselborn gesäit 4 Méiglechkeeten, wéi et kéint virugoen.
- source_sentence: >-
Non-profit organisation Passerell, which provides legal council to refugees
in Luxembourg, announced that it has to make four employees redundant in
August due to a lack of funding.
sentences:
- Oetringen nach Remich....8.20» 215»
- >-
D'ASBL Passerell, déi sech ëm d'Berodung vu Refugiéeën a Saache Rechtsfroe
këmmert, wäert am August mussen hir véier fix Salariéen entloossen.
- D'Regierung huet allerdéngs "just" 180.041 Doudeger verzeechent.
- source_sentence: This regulation was temporarily lifted during the Covid pandemic.
sentences:
- Six Jours vu New-York si fir d’équipe Girgetti — Debacco
- Dës Reegelung gouf wärend der Covid-Pandemie ausgesat.
- ING-Marathon ouni gréisser Tëschefäll ofgelaf - 18 Leit hospitaliséiert.
- source_sentence: The cross-border workers should also receive more wages.
sentences:
- D'grenzarbechetr missten och me' lo'n kre'en.
- >-
De Néckel: Firun! Dât ass jo ailes, wèll 't get dach neischt un der Bréck
gemâcht!
- >-
D'Grande-Duchesse Josephine Charlotte an hir Ministeren hunn d'Land
verlooss, et war den Optakt vun der Zäit am Exil.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
model-index:
- name: >-
SentenceTransformer based on
Alibaba-NLP/gte-multilingual-base
results:
- task:
type: contemporary-lb
name: Contemporary-lb
dataset:
name: Contemporary-lb
type: contemporary-lb
metrics:
- type: accuracy
value: 0.6216
name: SIB-200(LB) accuracy
- type: accuracy
value: 0.6282
name: ParaLUX accuracy
- task:
type: bitext-mining
name: LBHistoricalBitextMining
dataset:
name: LBHistoricalBitextMining
type: lb-en
metrics:
- type: accuracy
value: 0.9683
name: LB<->FR accuracy
- type: accuracy
value: 0.9715
name: LB<->EN accuracy
- type: mean_accuracy
value: 0.9793
name: LB<->DE accuracy
license: agpl-3.0
datasets:
- impresso-project/HistLuxAlign
- fredxlpy/LuxAlign
language:
- lb
---
# THIS IS A PREVIEW MODEL for the IMPRESSO HALLOWEEN WORKSHOP
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) further adapted to support Historical and Contemporary Luxembourgish. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for (cross-lingual) semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
This model is specialised to perform cross-lingual semantic search to and from Historical/Contemporary Luxembourgish. This model would be particularly useful for libraries and archives that want to perform semantic search and longitudinal studies within their collections.
This is an [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) model that was further adapted by (Michail et al., 2025)
## Limitations
We also release a model that performs better (18pp) on ParaLUX. If finding monolingual exact matches within adversarial collections is of at-most importance, please use [histlux-paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/impresso-project/histlux-paraphrase-multilingual-mpnet-base-v2)
### Model Description
- **Model Type:** GTE-Multilingual-Base
- **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base)
- **Maximum Sequence Length:** 8192 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:** See below
## Usage (Sentence-Transformers)
Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('impresso-project/halloween_workshop_ocr_robust_with_lux_preview', trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)
```
## Training Details
### Training Dataset
The parallel sentences data mix is the following:
impresso-project/HistLuxAlign:
- LB-FR (x20,000)
- LB-EN (x20,000)
- LB-DE (x20,000)
fredxlpy/LuxAlign:
- LB-FR (x40,000)
- LB-EN (x20,000)
Total: 120 000 Sentence pairs in mixed batches of size 8
### Contrastive Training
The model was trained with the parameters:
```
**Loss**:
`sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
```
{'scale': 20.0, 'similarity_fct': 'cos_sim'}
```
Parameters of the fit()-Method:
```
{
"epochs": 1,
"evaluation_steps": 520,
"max_grad_norm": 1,
"optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
"optimizer_params": {
"lr": 2e-05
},
"scheduler": "WarmupLinear",
}
```
```
## Citation
### BibTeX
#### Adapting Multilingual Embedding Models to Historical Luxembourgish (introducing paper)
```bibtex
@inproceedings{michail-etal-2025-adapting,
title = "Adapting Multilingual Embedding Models to Historical {L}uxembourgish",
author = "Michail, Andrianos and
Racl{\'e}, Corina and
Opitz, Juri and
Clematide, Simon",
editor = "Kazantseva, Anna and
Szpakowicz, Stan and
Degaetano-Ortlieb, Stefania and
Bizzoni, Yuri and
Pagel, Janis",
booktitle = "Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)",
month = may,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.latechclfl-1.26/",
doi = "10.18653/v1/2025.latechclfl-1.26",
pages = "291--298",
ISBN = "979-8-89176-241-1"
}
```
#### Original Multilingual GTE Model
```bibtex
@inproceedings{zhang2024mgte,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
pages={1393--1412},
year={2024}
}
```
## About Impresso
### Impresso project
[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.
### Copyright
Copyright (C) 2025 The Impresso team.
### License
This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.
---
<p align="center">
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p> |