Update README.md
Browse files
README.md
CHANGED
|
@@ -8,10 +8,10 @@ This project adapts general Multimodal Large Language Models (MLLMs) to specific
|
|
| 8 |
|
| 9 |
### 1. Data Synthesis
|
| 10 |
- We create a **generate-then-filter pipeline** using open-source models to make diverse visual tasks from domain-specific image-caption pairs.
|
| 11 |
-
- This data works better than data made by hand or closed-source models (e.g., GPT-4V).
|
| 12 |
|
| 13 |
### 2. Training Pipeline
|
| 14 |
-
- Instead of the usual two-step training (image-caption pairs first, then visual tasks), we use a **single-
|
| 15 |
|
| 16 |
### 3. Task Evaluation
|
| 17 |
- We test our method in important fields like **biomedicine, food, and remote sensing**.
|
|
@@ -24,15 +24,20 @@ This project adapts general Multimodal Large Language Models (MLLMs) to specific
|
|
| 24 |
| Model | Repo ID in HF 🤗 | Domain | Base Model | Training Data | Evaluation Benchmark |
|
| 25 |
|:----------------------------------------------------------------------------|:--------------------------------------------|:--------------|:-------------------------|:------------------------------------------------------------------------------------------------|-----------------------|
|
| 26 |
| [Visual Instruction Synthesizer](https://huggingface.co/AdaptLLM/visual-instruction-synthesizer) | AdaptLLM/visual-instruction-synthesizer | - | open-llava-next-llama3-8b | VisionFLAN and ALLaVA | - |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
| [AdaMLLM-med-2B](https://huggingface.co/AdaptLLM/biomed-Qwen2-VL-2B-Instruct) | AdaptLLM/biomed-Qwen2-VL-2B-Instruct | Biomedicine | Qwen2-VL-2B-Instruct | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
|
| 28 |
| [AdaMLLM-food-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/food-Qwen2-VL-2B-Instruct | Food | Qwen2-VL-2B-Instruct | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
|
| 29 |
-
| [AdaMLLM-remote-sensing-2B](https://huggingface.co/AdaptLLM/
|
| 30 |
| [AdaMLLM-med-8B](https://huggingface.co/AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B) | AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B | Biomedicine | open-llava-next-llama3-8b | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
|
| 31 |
| [AdaMLLM-food-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/food-LLaVA-NeXT-Llama3-8B | Food | open-llava-next-llama3-8b | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
|
| 32 |
-
| [AdaMLLM-remote-sensing-8B](https://huggingface.co/AdaptLLM/
|
| 33 |
| [AdaMLLM-med-11B](https://huggingface.co/AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct | Biomedicine | Llama-3.2-11B-Vision-Instruct | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
|
| 34 |
| [AdaMLLM-food-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/food-Llama-3.2-11B-Vision-Instruct | Food | Llama-3.2-11B-Vision-Instruct | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
|
| 35 |
-
| [AdaMLLM-remote-sensing-11B](https://huggingface.co/AdaptLLM/
|
| 36 |
|
| 37 |
**Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)
|
| 38 |
|
|
|
|
| 8 |
|
| 9 |
### 1. Data Synthesis
|
| 10 |
- We create a **generate-then-filter pipeline** using open-source models to make diverse visual tasks from domain-specific image-caption pairs.
|
| 11 |
+
- This data works better than data made by hand or closed-source models (e.g., GPT-4V/o).
|
| 12 |
|
| 13 |
### 2. Training Pipeline
|
| 14 |
+
- Instead of the usual two-step training (image-caption pairs first, then visual tasks), we use a **single-stage training** to handle more tasks for specific domains.
|
| 15 |
|
| 16 |
### 3. Task Evaluation
|
| 17 |
- We test our method in important fields like **biomedicine, food, and remote sensing**.
|
|
|
|
| 24 |
| Model | Repo ID in HF 🤗 | Domain | Base Model | Training Data | Evaluation Benchmark |
|
| 25 |
|:----------------------------------------------------------------------------|:--------------------------------------------|:--------------|:-------------------------|:------------------------------------------------------------------------------------------------|-----------------------|
|
| 26 |
| [Visual Instruction Synthesizer](https://huggingface.co/AdaptLLM/visual-instruction-synthesizer) | AdaptLLM/visual-instruction-synthesizer | - | open-llava-next-llama3-8b | VisionFLAN and ALLaVA | - |
|
| 27 |
+
| [AdaMLLM-med-1B](https://huggingface.co/AdaptLLM/biomed-InternVL3-1B) | AdaptLLM/biomed-InternVL3-1B | Biomedicine | InternVL3-1B | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
|
| 28 |
+
| [AdaMLLM-med-4B](https://huggingface.co/AdaptLLM/biomed-gemma-3-4b-it) | AdaptLLM/biomed-gemma-3-4b-it | Biomedicine | gemma-3-4b-it | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
|
| 29 |
+
| [AdaMLLM-med-3B](https://huggingface.co/AdaptLLM/biomed-Qwen2.5-VL-3B-Instruct) | AdaptLLM/biomed-Qwen2.5-VL-3B-Instruct | Biomedicine | Qwen2.5-VL-3B-Instruct | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
|
| 30 |
+
| [AdaMLLM-food-3B](https://huggingface.co/AdaptLLM/food-Qwen2.5-VL-3B-Instruct) | AdaptLLM/food-Qwen2.5-VL-3B-Instruct | Food | Qwen2.5-VL-3B-Instruct | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
|
| 31 |
+
| [AdaMLLM-remote-sensing-3B](https://huggingface.co/AdaptLLM/remote-sensing-Qwen2.5-VL-3B-Instruct) | AdaptLLM/remote-sensing-Qwen2.5-VL-3B-Instruct | Remote Sensing | Qwen2.5-VL-3B-Instruct | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/remote-sensing-visual-instructions) | [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/remote-sensing-VQA-benchmark) |
|
| 32 |
| [AdaMLLM-med-2B](https://huggingface.co/AdaptLLM/biomed-Qwen2-VL-2B-Instruct) | AdaptLLM/biomed-Qwen2-VL-2B-Instruct | Biomedicine | Qwen2-VL-2B-Instruct | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
|
| 33 |
| [AdaMLLM-food-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/food-Qwen2-VL-2B-Instruct | Food | Qwen2-VL-2B-Instruct | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
|
| 34 |
+
| [AdaMLLM-remote-sensing-2B](https://huggingface.co/AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct) | AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct | Remote Sensing | Qwen2-VL-2B-Instruct | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/remote-sensing-visual-instructions) | [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/remote-sensing-VQA-benchmark) |
|
| 35 |
| [AdaMLLM-med-8B](https://huggingface.co/AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B) | AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B | Biomedicine | open-llava-next-llama3-8b | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
|
| 36 |
| [AdaMLLM-food-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/food-LLaVA-NeXT-Llama3-8B | Food | open-llava-next-llama3-8b | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
|
| 37 |
+
| [AdaMLLM-remote-sensing-8B](https://huggingface.co/AdaptLLM/remote-sensing-LLaVA-NeXT-Llama3-8B) |AdaptLLM/remote-sensing-LLaVA-NeXT-Llama3-8B | Remote Sensing | open-llava-next-llama3-8b | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/remote-sensing-visual-instructions) | [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/remote-sensing-VQA-benchmark) |
|
| 38 |
| [AdaMLLM-med-11B](https://huggingface.co/AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct | Biomedicine | Llama-3.2-11B-Vision-Instruct | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark) |
|
| 39 |
| [AdaMLLM-food-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/food-Llama-3.2-11B-Vision-Instruct | Food | Llama-3.2-11B-Vision-Instruct | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark) |
|
| 40 |
+
| [AdaMLLM-remote-sensing-11B](https://huggingface.co/AdaptLLM/remote-sensing-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/remote-sensing-Llama-3.2-11B-Vision-Instruct | Remote Sensing | Llama-3.2-11B-Vision-Instruct | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/remote-sensing-visual-instructions) | [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/remote-sensing-VQA-benchmark) |
|
| 41 |
|
| 42 |
**Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)
|
| 43 |
|