Qwen3-VL-8B-Instruct Fine-tuned Model
This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct on a custom vision-language dataset.
Model Details
Model Description
- Model Type: Vision-Language Model
- Base Model: Qwen3-VL-8B-Instruct
- Fine-tuning Method: LoRA (Low-Rank Adaptation) with Parameter-Efficient Fine-Tuning (PEFT)
- Training Framework: TRL (Transformer Reinforcement Learning) with SFTTrainer
- Language(s): English
- License: Apache 2.0
Training Details
Training Data
The model was fine-tuned on a custom dataset containing vision-language pairs designed for specific downstream tasks.
Training Procedure
- Training Framework: Hugging Face TRL (Transformer Reinforcement Learning)
- Fine-tuning Method: Supervised Fine-Tuning (SFT) with LoRA adapters
- Optimizer: AdamW
- Precision: Mixed precision training (bfloat16)
- Hardware: NVIDIA GPUs with distributed training support
Hyperparameters
The model was trained using LoRA with the following configuration:
- LoRA rank (r): Configured for efficient parameter updates
- LoRA alpha: Scaled for optimal learning
- Target modules: Attention and feed-forward layers
- Dropout: Applied for regularization
Usage
Direct Use
This model can be used for vision-language understanding tasks, including:
- Image captioning
- Visual question answering
- Image-text retrieval
- Multimodal conversation
Example Code
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
"enpeizhao/qwen3-8b-instruct-trl-sft-20-bf16-merged",
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("enpeizhao/qwen3-8b-instruct-trl-sft-20-bf16-merged")
# Prepare input
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=512)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(output[0])
Limitations and Bias
- This model inherits limitations from the base Qwen3-VL-8B-Instruct model
- Performance may vary depending on the domain and task
- The model's outputs should be evaluated for potential biases present in training data
Citation
If you use this model, please cite the original Qwen3-VL paper and acknowledge this fine-tuned version:
@article{qwen3vl2024,
title={Qwen3-VL: A Versatile Vision-Language Model},
author={Qwen Team},
year={2024}
}
Model Card Authors
This model card was created as part of a fine-tuning project.
Model Card Contact
For questions and feedback about this fine-tuned model, please open an issue in the model repository.
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for enpeizhao/qwen3-8b-instruct-trl-sft-20-bf16-merged
Base model
Qwen/Qwen3-VL-8B-Instruct