A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus
Abstract
A hybrid methodology combining FastText embeddings, clustering, and AI classification generates a large-scale Turkish semantic relations dataset with high accuracy validation.
We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms) representing a 10x scale increase over existing resources at minimal cost ($65). We validate the dataset through two downstream tasks: an embedding model achieving 90% top-1 retrieval accuracy and a classification model attaining 90% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.
Community
Addressing data scarcity in low-resource languages, this paper introduces a cost-effective ($65) pipeline for generating large-scale semantic datasets. By integrating FastText clustering, Gemini 2.5-Flash labeling, and dictionary curation, the authors release a Turkish corpus of 843,000 pairs (synonyms, antonyms, co-hyponyms), achieving 90% F1-macro on classification benchmarks.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper