Merged LLaMA 3.1 Vision + KoEn NLP Model

This repository contains a merged model of qresearch/llama-3.1-8B-vision-378 (a vision-enhanced LLaMA model) and muzerai/Deep-Llama-3.1-KoEn-8B-SiSai (a Korean-English NLP model). The goal of this merge is to enhance the vision model with improved natural language understanding and generation capabilities using a robust multilingual NLP model.

πŸš€ Why This Merge?

The original LLaMA 3.1 Vision model excels at image understanding but lacks strong text generation capabilities in Korean and English.
Meanwhile, Deep-Llama-3.1-KoEn-8B-SiSai is optimized for Korean and English NLP tasks but lacks multimodal capabilities.

By merging these models:

  • We retain the powerful vision capabilities of the Vision model.
  • We enhance text generation and reasoning using the NLP model's pre-trained weights.
  • The text encoder (text_model) is now optimized for Korean-English tasks, improving multilingual support.

πŸ“Œ Model Details

  • Base Vision Model: qresearch/llama-3.1-8B-vision-378
  • Base NLP Model: muzerai/Deep-Llama-3.1-KoEn-8B-SiSai
  • Merged Components:
    • Vision processing layers are retained from the original Vision model.
    • text_model weights are replaced with those from the NLP model to improve text understanding.
  • File Format: .safetensors (optimized for fast and secure model loading)

Test (MAC M1)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO

# βœ… 이미지 λ‹€μš΄λ‘œλ“œ
url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# βœ… MPS 지원 확인 ν›„ λ””λ°”μ΄μŠ€ μ„€μ •
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")

# βœ… λͺ¨λΈ λ‘œλ“œ
model = AutoModelForCausalLM.from_pretrained(
    "muzerai/Deep-Llama-3.1-KoEn-8B-SiSai-Vision",
    trust_remote_code=True,
    torch_dtype=torch.float16,
).to(device)

# βœ… ν† ν¬λ‚˜μ΄μ € λ‘œλ“œ
tokenizer = AutoTokenizer.from_pretrained("muzerai/Deep-Llama-3.1-KoEn-8B-SiSai-Vision", use_fast=True)

# βœ… ν•œκ΅­μ–΄ 질문 μΆ”κ°€
question = "이 이미지λ₯Ό ν•œκ΅­μ–΄λ‘œ μ„€λͺ…ν•΄μ£Όμ„Έμš”." // Briefly describe the image (english)

# βœ… λͺ¨λΈ μ‹€ν–‰
output = model.answer_question(
    image, question, tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
)

print(output)
Using device: mps
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00,  3.42it/s]
이 μ΄λ―Έμ§€λŠ” 일본의 λ§Œν™”λ‚˜ μ• λ‹ˆλ©”μ΄μ…˜μ—μ„œ 자주 λ“±μž₯ν•˜λŠ” 여주인 μΊλ¦­ν„°μž…λ‹ˆλ‹€. 여주인 μΊλ¦­ν„°λŠ” 머리와 옷이 흰색인 것을 λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€. 여주인 μΊλ¦­ν„°λŠ” 손에 빡을 λ“€κ³  μžˆλŠ” 것을 λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.
The image is of a young woman with a kind face, dressed in a medieval-inspired outfit. She is holding a large hamburger in her hand and has a happy expression on her face. The background is a warm, cozy room with a wooden table and chairs.

Comments

Performance is just... it depends on you ^^

Use

Research & Educational Purposes: AI research, academic use, and educational content creation.

For questions about licensing, please contact my channel.

Downloads last month
5
Safetensors
Model size
8B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support