Qwen3-VL-8B-Instruct Fine-tuned Model

This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct on a custom vision-language dataset.

Model Details

Model Description

Model Type: Vision-Language Model
Base Model: Qwen3-VL-8B-Instruct
Fine-tuning Method: LoRA (Low-Rank Adaptation) with Parameter-Efficient Fine-Tuning (PEFT)
Training Framework: TRL (Transformer Reinforcement Learning) with SFTTrainer
Language(s): English
License: Apache 2.0

Training Details

Training Data

The model was fine-tuned on a custom dataset containing vision-language pairs designed for specific downstream tasks.

Training Procedure

Training Framework: Hugging Face TRL (Transformer Reinforcement Learning)
Fine-tuning Method: Supervised Fine-Tuning (SFT) with LoRA adapters
Optimizer: AdamW
Precision: Mixed precision training (bfloat16)
Hardware: NVIDIA GPUs with distributed training support

Hyperparameters

The model was trained using LoRA with the following configuration:

LoRA rank (r): Configured for efficient parameter updates
LoRA alpha: Scaled for optimal learning
Target modules: Attention and feed-forward layers
Dropout: Applied for regularization

Usage

Direct Use

This model can be used for vision-language understanding tasks, including:

Image captioning
Visual question answering
Image-text retrieval
Multimodal conversation

Example Code

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "enpeizhao/qwen3-8b-instruct-trl-sft-20-bf16-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("enpeizhao/qwen3-8b-instruct-trl-sft-20-bf16-merged")

# Prepare input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=512)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(output[0])

Limitations and Bias

This model inherits limitations from the base Qwen3-VL-8B-Instruct model
Performance may vary depending on the domain and task
The model's outputs should be evaluated for potential biases present in training data

Citation

If you use this model, please cite the original Qwen3-VL paper and acknowledge this fine-tuned version:

@article{qwen3vl2024,
  title={Qwen3-VL: A Versatile Vision-Language Model},
  author={Qwen Team},
  year={2024}
}

Model Card Authors

This model card was created as part of a fine-tuning project.

Model Card Contact

For questions and feedback about this fine-tuned model, please open an issue in the model repository.

Downloads last month: 3

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for enpeizhao/qwen3-8b-instruct-trl-sft-20-bf16-merged

Base model

Qwen/Qwen3-VL-8B-Instruct

Adapter

(17)

this model