πŸ‡ΈπŸ‡΄ gpt2-large-somali-summarization-model

This is a high-quality Somali Abstractive Summarization Model fine-tuned on the Limeso/GPT-2 Large Somali base model.

The model is designed to read a long Somali article (Qoraalka) and generate a concise, coherent summary (Soo koobid). This model utilizes a custom instruction-following format with specific control tokens to guide the generation process.

πŸš€ Model Usage

You can easily load and use this model for inference using the Hugging Face transformers library.

Installation

pip install torch transformers


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Configuration
HUB_MODEL_ID = "sahal1/gpt2-large-somali-summarization-model" 
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Special tokens MUST match the format used during fine-tuning
SEPARATOR_TOKEN = "<|sep|>"
EOSS_TOKEN = "<|eoss|>"

# 2. Load Model and Tokenizer
try:
    tokenizer = AutoTokenizer.from_pretrained(HUB_MODEL_ID)
    model = AutoModelForCausalLM.from_pretrained(HUB_MODEL_ID).to(DEVICE)
    model.eval()
    
    # Define token IDs
    eos_token_id = tokenizer.convert_tokens_to_ids(EOSS_TOKEN)
    # Use EOS token as PAD token if PAD token is not defined
    pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else eos_token_id 

except Exception as e:
    print(f"Error loading model: {e}")
    exit()

# 3. Generation Function
def summarize_somali(article_text, max_new_tokens=150, num_beams=4):
    # Construct the prompt exactly as used during training
    prompt = f"Qoraalka soo koob: {article_text.strip()}{SEPARATOR_TOKEN}Soo koobid: "
    
    inputs = tokenizer(
        prompt, 
        return_tensors="pt", 
        return_attention_mask=True,
        truncation=True,
        max_length=1024 
    ).to(DEVICE)
    
    with torch.no_grad():
        output_sequences = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_new_tokens=max_new_tokens,
            eos_token_id=eos_token_id,
            pad_token_id=pad_token_id,
            num_beams=num_beams, 
            early_stopping=True
        )
    
    # Decode only the generated part
    generated_token_ids = output_sequences[0][inputs["input_ids"].shape[-1]:]
    summary = tokenizer.decode(generated_token_ids, skip_special_tokens=True)
    
    return summary.strip()

# 4. Example Usage
article_text = """
Dad ka badan 20 qof oo iskugu jira rag iyo dumar ayaa ku sumoobay degmada Xarshin ee gobolka Faafan ee Dowlad Deegaanka Soomaalida Itoobiya kadib markii ay cuneen cunto sumeysan
Dadkaa ayaa oo ka shaqeynayey Berkad halkaas ku taal oo ay raggu qodayeen halka duamrkuna ay raashin u karinayeen kuwaas oo ku sumoobay raashin ay karsadeen sida lagasoo xigtay xafiiska caafimaadka ee degmda Xarshin ee gobolka Faafan
Cabdiraxman Jamaal Cabdi Madaxa Xafiiska Caafimaadka ee degmada Xarshin ee gobolka Faafan oo BBcda la hadlay ayaa sheegay in dadkaasi ay ku sumoobeen Bariis ay karsadeen oo caano lagu daray laakin waxa uu sheegay in waxa saxda ah ee ay ku sumoobeen aan la garaneynin in yihiin bariiksa ama caanaha .
Mar la weeydiiyay Cabdiraxmaan Jamaal sida ay ku xaqiijiyeen in dadku ay cuntadaas ku sumoobeen iyo in ay wax kale ku sumoobeen, waxa uu sheegay, in wixii ugu danbeeyay ee ay cunaan ay ahayd bariisksaasi caanaha lagu daray
Dadkaasi oo ka koobnaa 20 rag ah iyo afar dumara,waxay ahaayeen shaqaalle ka shaqeynayey berkad halkaasi laga dhisayey.
Cabdiraxman Jamaal Cabdi Madaxa Xafiiska Caafimaadka ee degmada Xarshin ayaa sintaasi ku daray in bariisku uu ahaa mid ay maalin walba karsanayeen taariikh ahaanna uusan ahayn mid dhacsan balse ay ka shakisanyihiin in canaaha ay ku darsadeen bariiska ay wax ka qaldanaayeen oo ay dadkaasi ku sumoobeen.
"""
article_text = article_text.replace(".", "")

summary = summarize_somali(article_text)

print("\n--- Original Text Snippet ---")
print(article_text.strip()[:100] + "...")
print("\n--- Generated Summary ---")
print(summary)
Downloads last month
5
Safetensors
Model size
0.8B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sahal1/gpt2-large-somali-summarization-model

Finetuned
(1)
this model