Spaces:

Papaflessas
/

gotti_signal_gen

Running

App Files Files Community

gotti_signal_gen / src /news_scraper /nlp_models /RETRAIN.md

Papaflessas

Deploy Signal Generator app

3fe0726 6 days ago

preview code

raw

history blame contribute delete

7.98 kB

Overview

You can continue training Farshid/roberta-large-financial-phrasebank-allagree1 on your own POSITIVE/NEGATIVE/NEUTRAL labels by treating it as a standard sequence‐classification fine-tuning task. The high-level steps are:

Prepare your labeled data in a Hugging Face Dataset or as a CSV/JSONL with columns text and label.
Load the pretrained model with num_labels=3 and its tokenizer.
Tokenize your data via dataset.map().
Configure TrainingArguments (learning rate, batch size, epochs, output directory).
Instantiate a Trainer with model, tokenized datasets, and metric callbacks.
Call trainer.train() to fine-tune and then trainer.save_model() to persist your new checkpoint.
Optionally, for faster or memory-efficient training, you can use LoRA (PEFT) to update only low-rank adapter weights instead of the full model.

1. Prepare Your Human-Labeled Dataset

Your labels must map to integers 0, 1, 2 (e.g. NEGATIVE=0, NEUTRAL=1, POSITIVE=2). You can:

Use the datasets library to load a CSV/JSONL:

from datasets import load_dataset

dataset = load_dataset("csv", data_files="my_finance_news.csv")  
# expect columns: ["text","label"]

(Fine-tuning - Hugging Face)

Or convert an existing pandas DataFrame:

import pandas as pd
from datasets import Dataset

df = pd.read_csv("my_finance_news.csv")
dataset = Dataset.from_pandas(df)

(Fine-tuning a model with the Trainer API - Hugging Face NLP Course)

2. Load the Pretrained Model & Tokenizer

Leverage the Hugging Face Transformers API:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "Farshid/roberta-large-financial-phrasebank-allagree1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=3
)

This ensures the classification head is set for three classes. (Text classification - Hugging Face, Text classification - Hugging Face)

3. Tokenize the Dataset

Use the tokenizer to preprocess the text (padding/truncation):

def preprocess(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

tokenized = dataset.map(preprocess, batched=True)

This applies tokenization across splits efficiently. (Fine-tuning - Hugging Face)

4. Configure TrainingArguments

Decide hyperparameters and output paths:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./finetuned-roberta",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    logging_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
)

You can adjust epochs, batch size, learning rate based on your dataset size. (Text classification - Hugging Face)

5. Define Metrics & Instantiate Trainer

Compute accuracy on validation set:

import numpy as np
from datasets import load_metric

accuracy = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    return accuracy.compute(predictions=preds, references=labels)

Create the Trainer:

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

The Trainer handles the training loop, evaluation, saving, and logging. (Fine-tuning a model with the Trainer API - Hugging Face NLP Course)

6. Fine-Tune and Save Your Model

Start training:

trainer.train()
trainer.save_model("./finetuned-roberta")

After this, you’ll have a checkpoint tuned on your human labels. (Fine-tune Hugging Face models for a single GPU)

7. (Optional) Parameter-Efficient Fine-Tuning with LoRA

If GPU memory or training time is constrained, use the PEFT library’s LoRA adapters to update only low-rank matrices:

from peft import get_peft_config, get_peft_model, LoraConfig

peft_config = LoraConfig(
    r=16,                            # rank
    lora_alpha=32, 
    target_modules=["query", "value"],  
    lora_dropout=0.05, 
    bias="none"
)
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()

LoRA reduces trainable parameters drastically while maintaining performance(Hugging Face). (Efficient Large Language Model training with LoRA and Hugging Face)

References

Fine-tuning guide, Hugging Face Transformers docs (Fine-tuning - Hugging Face)
Text classification task, Transformers docs (Text classification - Hugging Face)
Hugging Face NLP Course: Trainer, Hugging Face (Fine-tuning a model with the Trainer API - Hugging Face NLP Course)
Sequence classification, Transformers task recipes (Text classification - Hugging Face)
Databricks fine-tuning notebook, Databricks docs (Fine-tune Hugging Face models for a single GPU)
LoRA conceptual guide, Hugging Face PEFT docs (LoRA - Hugging Face)
PEFT/LoRA implementation, Phil Schmid blog (Efficient Large Language Model training with LoRA and Hugging Face)
Fine-tune LLMs blog, HF blog (dvgodoy) (Fine-Tuning Your First Large Language Model (LLM) with PyTorch ...)
Fine-tuning overview, Wikipedia (Fine-tuning (deep learning))
Medium tutorial, Sandeep.ai (Text Classification Using Hugging Face(Fine-Tuning) - Medium)