|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- atasoglu/flickr8k-turkish-detailed-captions |
|
|
language: |
|
|
- tr |
|
|
base_model: |
|
|
- ytu-ce-cosmos/Turkish-LLaVA-v0.1 |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
tags: |
|
|
- Turkish |
|
|
- turkish |
|
|
- LLaVA |
|
|
- conversational |
|
|
--- |
|
|
|
|
|
# Turkish-LLaVA-v0.1-detailed-captions |
|
|
|
|
|
This model is a fine-tuned version of the [ytu-ce-cosmos/Turkish-LLaVA-v0.1](https://huggingface.co/ytu-ce-cosmos/Turkish-LLaVA-v0.1) model, fine-tuned on the [atasoglu/flickr8k-turkish-detailed-captions](https://huggingface.co/datasets/atasoglu/flickr8k-turkish-detailed-captions) dataset to generate detailed and comprehensive captions based on a given image input. |
|
|
|
|
|
[](https://colab.research.google.com/drive/1rgdK6-HVHYapmlBw04Lf66kgwhl7VSb1?usp=sharing) \ |
|
|
You can also check the repository [here](https://github.com/atasoglu/turkish-llava-notebooks/tree/main). |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can use the model with [llava](https://github.com/haotian-liu/LLaVA) package as following. |
|
|
|
|
|
Load the model first: |
|
|
|
|
|
```python |
|
|
import os |
|
|
import torch |
|
|
from transformers import BitsAndBytesConfig |
|
|
from llava.model.builder import load_pretrained_model |
|
|
from llava.utils import disable_torch_init |
|
|
|
|
|
model_path = "atasoglu/Turkish-LLaVA-v0.1-detailed-captions" |
|
|
|
|
|
# apply 4-bit quantization |
|
|
quantization_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
|
bnb_4bit_use_double_quant=True, |
|
|
) |
|
|
|
|
|
disable_torch_init() # for inference |
|
|
tokenizer, model, image_processor, context_len = load_pretrained_model( |
|
|
model_path, |
|
|
None, |
|
|
"llava_llama", |
|
|
quantization_config=quantization_config, |
|
|
) |
|
|
``` |
|
|
|
|
|
Run inference code with text and image: |
|
|
|
|
|
```python |
|
|
import requests |
|
|
from PIL import Image |
|
|
from io import BytesIO |
|
|
from llava.mm_utils import process_images, tokenizer_image_token |
|
|
|
|
|
# download an example image |
|
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-im-captioning.jpg" |
|
|
image = Image.open(BytesIO(requests.get(url).content)) |
|
|
|
|
|
# create a prompt with system and user messages |
|
|
system_prompt = "Sen yardımsever bir asistansın." |
|
|
user_prompt = "Görüntüyü detaylı olarak açıkla." |
|
|
prompt = ( |
|
|
"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n" |
|
|
f"{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n" |
|
|
f"<image>\n{user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" |
|
|
) |
|
|
|
|
|
# create prompt and image tokens |
|
|
input_ids = ( |
|
|
tokenizer_image_token( |
|
|
prompt, |
|
|
tokenizer, |
|
|
return_tensors="pt", |
|
|
) |
|
|
.unsqueeze(0) |
|
|
.cuda() |
|
|
) |
|
|
image_tensor = process_images([image], image_processor, model.config).to( |
|
|
model.device, |
|
|
dtype=torch.float16, |
|
|
) |
|
|
|
|
|
# start generation |
|
|
with torch.inference_mode(): |
|
|
output_ids = model.generate( |
|
|
input_ids, |
|
|
images=image_tensor, |
|
|
image_sizes=[image.size], |
|
|
do_sample=False, |
|
|
max_new_tokens=256, |
|
|
) |
|
|
output = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() |
|
|
print(output) |
|
|
``` |
|
|
|
|
|
Output: |
|
|
|
|
|
```text |
|
|
Güneşli bir günde, rengarenk çiçeklerle dolu bir bahçede, sarı tüyleriyle dikkat çeken bir köpek yavrusu, çiçeklerin arasında saklanmış halde. Köpeğin büyük, meraklı gözleri, çiçeklerin arasında hafifçe açıkta kalmış. Çiçekler, sarı ve beyaz tonlarında açmış, etrafa neşe saçıyor. Köpeğin etrafındaki çiçekler, papatyalar ve daisy gibi çeşitli türlerden oluşuyor. Arka planda, bahçenin doğal dokusu, yeşil yapraklar ve ağaç gövdesi ile zenginleşiyor. Köpeğin vücut dili, merak ve neşe dolu; sanki çevresini keşfetmek için sabırsızlanıyor. Bahçenin huzurlu atmosferi, köpeğin sevimli görünümüyle birleşerek, sıcak ve samimi bir görüntü oluşturuyor. |
|
|
``` |