Qwen3-VL-8B-Instruct Fine-tuned Model

This model is a fine-tuned version of Qwen/Qwen3-VL-8B-Instruct on a custom vision-language dataset.

Model Details

Model Description

  • Model Type: Vision-Language Model
  • Base Model: Qwen3-VL-8B-Instruct
  • Fine-tuning Method: LoRA (Low-Rank Adaptation) with Parameter-Efficient Fine-Tuning (PEFT)
  • Training Framework: TRL (Transformer Reinforcement Learning) with SFTTrainer
  • Language(s): English
  • License: Apache 2.0

Training Details

Training Data

The model was fine-tuned on a custom dataset containing vision-language pairs designed for specific downstream tasks.

Training Procedure

  • Training Framework: Hugging Face TRL (Transformer Reinforcement Learning)
  • Fine-tuning Method: Supervised Fine-Tuning (SFT) with LoRA adapters
  • Optimizer: AdamW
  • Precision: Mixed precision training (bfloat16)
  • Hardware: NVIDIA GPUs with distributed training support

Hyperparameters

The model was trained using LoRA with the following configuration:

  • LoRA rank (r): Configured for efficient parameter updates
  • LoRA alpha: Scaled for optimal learning
  • Target modules: Attention and feed-forward layers
  • Dropout: Applied for regularization

Usage

Direct Use

This model can be used for vision-language understanding tasks, including:

  • Image captioning
  • Visual question answering
  • Image-text retrieval
  • Multimodal conversation

Example Code

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "enpeizhao/qwen3-8b-instruct-trl-sft-20-bf16-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("enpeizhao/qwen3-8b-instruct-trl-sft-20-bf16-merged")

# Prepare input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=512)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(output[0])

Limitations and Bias

  • This model inherits limitations from the base Qwen3-VL-8B-Instruct model
  • Performance may vary depending on the domain and task
  • The model's outputs should be evaluated for potential biases present in training data

Citation

If you use this model, please cite the original Qwen3-VL paper and acknowledge this fine-tuned version:

@article{qwen3vl2024,
  title={Qwen3-VL: A Versatile Vision-Language Model},
  author={Qwen Team},
  year={2024}
}

Model Card Authors

This model card was created as part of a fine-tuning project.

Model Card Contact

For questions and feedback about this fine-tuned model, please open an issue in the model repository.

Downloads last month
3
Safetensors
Model size
9B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for enpeizhao/qwen3-8b-instruct-trl-sft-20-bf16-merged

Adapter
(17)
this model