Spaces:
Running
Overview
You can continue training Farshid/roberta-large-financial-phrasebank-allagree1 on your own POSITIVE/NEGATIVE/NEUTRAL labels by treating it as a standard sequence‐classification fine-tuning task. The high-level steps are:
- Prepare your labeled data in a Hugging Face Dataset or as a CSV/JSONL with columns
textandlabel. - Load the pretrained model with
num_labels=3and its tokenizer. - Tokenize your data via
dataset.map(). - Configure
TrainingArguments(learning rate, batch size, epochs, output directory). - Instantiate a
Trainerwith model, tokenized datasets, and metric callbacks. - Call
trainer.train()to fine-tune and thentrainer.save_model()to persist your new checkpoint.
Optionally, for faster or memory-efficient training, you can use LoRA (PEFT) to update only low-rank adapter weights instead of the full model.
1. Prepare Your Human-Labeled Dataset
Your labels must map to integers 0, 1, 2 (e.g. NEGATIVE=0, NEUTRAL=1, POSITIVE=2). You can:
Use the
datasetslibrary to load a CSV/JSONL:from datasets import load_dataset dataset = load_dataset("csv", data_files="my_finance_news.csv") # expect columns: ["text","label"]Or convert an existing pandas DataFrame:
import pandas as pd from datasets import Dataset df = pd.read_csv("my_finance_news.csv") dataset = Dataset.from_pandas(df)(Fine-tuning a model with the Trainer API - Hugging Face NLP Course)
2. Load the Pretrained Model & Tokenizer
Leverage the Hugging Face Transformers API:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "Farshid/roberta-large-financial-phrasebank-allagree1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=3
)
This ensures the classification head is set for three classes. (Text classification - Hugging Face, Text classification - Hugging Face)
3. Tokenize the Dataset
Use the tokenizer to preprocess the text (padding/truncation):
def preprocess(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=128
)
tokenized = dataset.map(preprocess, batched=True)
This applies tokenization across splits efficiently. (Fine-tuning - Hugging Face)
4. Configure TrainingArguments
Decide hyperparameters and output paths:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./finetuned-roberta",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
learning_rate=2e-5,
evaluation_strategy="steps",
eval_steps=500,
save_steps=1000,
logging_steps=100,
load_best_model_at_end=True,
metric_for_best_model="accuracy",
)
You can adjust epochs, batch size, learning rate based on your dataset size. (Text classification - Hugging Face)
5. Define Metrics & Instantiate Trainer
Compute accuracy on validation set:
import numpy as np
from datasets import load_metric
accuracy = load_metric("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
preds = np.argmax(logits, axis=1)
return accuracy.compute(predictions=preds, references=labels)
Create the Trainer:
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
The Trainer handles the training loop, evaluation, saving, and logging. (Fine-tuning a model with the Trainer API - Hugging Face NLP Course)
6. Fine-Tune and Save Your Model
Start training:
trainer.train()
trainer.save_model("./finetuned-roberta")
After this, you’ll have a checkpoint tuned on your human labels. (Fine-tune Hugging Face models for a single GPU)
7. (Optional) Parameter-Efficient Fine-Tuning with LoRA
If GPU memory or training time is constrained, use the PEFT library’s LoRA adapters to update only low-rank matrices:
from peft import get_peft_config, get_peft_model, LoraConfig
peft_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["query", "value"],
lora_dropout=0.05,
bias="none"
)
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()
trainer = Trainer(
model=peft_model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
LoRA reduces trainable parameters drastically while maintaining performance(Hugging Face). (Efficient Large Language Model training with LoRA and Hugging Face)
References
- Fine-tuning guide, Hugging Face Transformers docs (Fine-tuning - Hugging Face)
- Text classification task, Transformers docs (Text classification - Hugging Face)
- Hugging Face NLP Course: Trainer, Hugging Face (Fine-tuning a model with the Trainer API - Hugging Face NLP Course)
- Sequence classification, Transformers task recipes (Text classification - Hugging Face)
- Databricks fine-tuning notebook, Databricks docs (Fine-tune Hugging Face models for a single GPU)
- LoRA conceptual guide, Hugging Face PEFT docs (LoRA - Hugging Face)
- PEFT/LoRA implementation, Phil Schmid blog (Efficient Large Language Model training with LoRA and Hugging Face)
- Fine-tune LLMs blog, HF blog (dvgodoy) (Fine-Tuning Your First Large Language Model (LLM) with PyTorch ...)
- Fine-tuning overview, Wikipedia (Fine-tuning (deep learning))
- Medium tutorial, Sandeep.ai (Text Classification Using Hugging Face(Fine-Tuning) - Medium)