Updated
Browse files
README.md
CHANGED
|
@@ -11,172 +11,153 @@ pipeline_tag: object-detection
|
|
| 11 |
tags:
|
| 12 |
- object-detection
|
| 13 |
- document-layout
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
-
# YOLOv11 Document Layout
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
This repository
|
| 21 |
|
| 22 |
-
The
|
| 23 |
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
| 25 |
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
### Installation
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
```bash
|
| 34 |
pip install ultralytics huggingface_hub
|
| 35 |
```
|
| 36 |
|
| 37 |
-
### Inference Example
|
| 38 |
|
| 39 |
-
This Python snippet
|
| 40 |
|
| 41 |
```python
|
| 42 |
from pathlib import Path
|
| 43 |
from huggingface_hub import hf_hub_download
|
| 44 |
from ultralytics import YOLO
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
|
|
|
|
|
|
| 49 |
"yolo11n_doc_layout.pt",
|
| 50 |
"yolo11s_doc_layout.pt",
|
| 51 |
"yolo11m_doc_layout.pt",
|
| 52 |
]
|
|
|
|
| 53 |
|
|
|
|
| 54 |
model_path = hf_hub_download(
|
| 55 |
repo_id="Armaggheddon/yolo11-document-layout",
|
| 56 |
-
filename=
|
| 57 |
repo_type="model",
|
| 58 |
local_dir=DOWNLOAD_PATH,
|
| 59 |
)
|
| 60 |
|
| 61 |
-
# Initialize the model
|
| 62 |
model = YOLO(model_path)
|
| 63 |
|
| 64 |
-
#
|
|
|
|
| 65 |
results = model('path/to/your/document.jpg')
|
| 66 |
|
| 67 |
# Process and display results
|
| 68 |
-
results.print()
|
| 69 |
-
results.show()
|
| 70 |
```
|
| 71 |
|
| 72 |
-
|
| 73 |
-
## Dataset Overview: DocLayNet
|
| 74 |
-
The models were trained on the [DocLayNet dataset](https://huggingface.co/datasets/ds4sd/DocLayNet), which contains a diverse collection of document images annotated with various layout elements. The dataset includes the following key labels:
|
| 75 |
-
- **Text:** Regular paragraphs.
|
| 76 |
-
- **Picture:** A graphic or photograph.
|
| 77 |
-
- **Caption:** Special text outside a picture or table that introduces this picture or table.
|
| 78 |
-
- **Section-header:** Any kind of heading in the text, except overall document title.
|
| 79 |
-
- **Footnote:** Typically small text at the bottom of a page, with a number or symbol that is referred to in the text above.
|
| 80 |
-
- **Formula:** Mathematical equation on its own line.
|
| 81 |
-
- **Table:** Material arranged in a grid alignment with rows and columns, often with separator lines.
|
| 82 |
-
- **List-item:** One element of a list, in a hanging shape, i.e., from the second line onwards the paragraph is indented more than the first line.
|
| 83 |
-
- **Page-header:** Repeating elements like page number at the top, outside of the normal text flow.
|
| 84 |
-
- **Page-footer:** Repeating elements like page number at the bottom, outside of the normal text flow.
|
| 85 |
-
* **Title:** Overall title of a document, (almost) exclusively on the first page and typically appearing in large font.
|
| 86 |
-
|
| 87 |
-
All the images of the dataset are 1250x1250 pixels and therefore the training resolution was set to 1280x1280. This is also driven by initial evaluations showing that using the default 640x640 resolution led to a significant drop in performance, especially for smaller elements like `footnote` and `caption`.
|
| 88 |
-
|
| 89 |
-
More information about the dataset and how it has been created can be found in the [DocLayNet Labeling Guide](https://raw.githubusercontent.com/DS4SD/DocLayNet/main/assets/DocLayNet_Labeling_Guide_Public.pdf).
|
| 90 |
-
|
| 91 |
-
The dataset labels were mapped to the YOLO format through [`doclaynet_to_yolo.py`](https://github.com/Armaggheddon/yolo11_doc_layout), ensuring compatibility with the YOLO training pipeline.
|
| 92 |
-
The dataset class distribution is visualized in the image below.
|
| 93 |
-
|
| 94 |
-

|
| 95 |
---
|
| 96 |
|
| 97 |
-
##
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
- **YOLOv11n (train4)**
|
| 102 |
-
- **YOLOv11s (train5)**
|
| 103 |
-
- **YOLOv11m (train6)**
|
| 104 |
-
|
| 105 |
-
In reality, the performance gains stepping up the model sizes are marginal especially when considering the increased resource demands. This is notable evident in the improvements from the `s` to `m` model size increase. With all the results considered, the `n` model family (nano) is the most efficient and effective choice for deployment providing a good balance of speed and accuracy.
|
| 106 |
-
|
| 107 |
-
### Training and Evaluation at a Glance
|
| 108 |
-
|
| 109 |
-
The plots below illustrate the core convergence metrics (precision, recall, and mAP) as the models learned over time. The normalized confusion matrices provide a visual breakdown of how accurately the models distinguish between different document layout elementsβa strong diagonal line indicates robust classification.
|
| 110 |
-
|
| 111 |
-
| Model | Training Metrics | Normalized Confusion Matrix |
|
| 112 |
-
| :---: | :---: | :---: |
|
| 113 |
-
| **`train4`** | <img src="runs/train4/results.png" alt="train4 results" height="200"> | <img src="runs/train4/confusion_matrix_normalized.png" alt="train4 confusion matrix" height="200"> |
|
| 114 |
-
| **`train5`** | <img src="runs/train5/results.png" alt="train5 results" height="200"> | <img src="runs/train5/confusion_matrix_normalized.png" alt="train5 confusion matrix" height="200"> |
|
| 115 |
-
| **`train6`** | <img src="runs/train6/results.png" alt="train6 results" height="200"> | <img src="runs/train6/confusion_matrix_normalized.png" alt="train6 confusion matrix" height="200"> |
|
| 116 |
-
| **`train9`** | <img src="runs/train9/results.png" alt="train9 results" height="200"> | <img src="runs/train9/confusion_matrix_normalized.png" alt="train9 confusion matrix" height="200"> |
|
| 117 |
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
-
|
| 120 |
|
| 121 |
-
### Nano vs. Small vs. Medium
|
| 122 |
|
| 123 |
-
|
| 124 |
|
| 125 |
-
| **mAP@50-95** (Strict
|
| 126 |
| :---: | :---: |
|
| 127 |
-
| <img src="
|
| 128 |
|
| 129 |
| **Precision** (Box Quality) | **Recall** (Detection Coverage) |
|
| 130 |
| :---: | :---: |
|
| 131 |
-
| <img src="
|
| 132 |
|
| 133 |
-
|
|
|
|
| 134 |
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
-
|
| 138 |
|
| 139 |
-
|
| 140 |
-
| :---: | :---: |
|
| 141 |
-
| <img src="plots/yolo11n_scores/map50_95_per_label.png" alt="mAP@50-95" height="200"> | <img src="plots/yolo11n_scores/map50_per_label.png" alt="mAP@50" height="200"> |
|
| 142 |
|
| 143 |
-
|
| 144 |
-
| :---: | :---: |
|
| 145 |
-
| <img src="plots/yolo11n_scores/box_precision_per_label.png" alt="Precision" height="200"> | <img src="plots/yolo11n_scores/recall_per_label.png" alt="Recall" height="200"> |
|
| 146 |
-
|
| 147 |
-
#### Justification for `train4` and `train9` Selection
|
| 148 |
|
| 149 |
-
|
| 150 |
|
| 151 |
-
|
| 152 |
-
|
|
|
|
| 153 |
|
| 154 |
-
|
|
|
|
|
|
|
| 155 |
|
| 156 |
-
|
| 157 |
|
| 158 |
-
|
| 159 |
|
| 160 |
-
|
| 161 |
|
| 162 |
-
The
|
| 163 |
|
| 164 |
-
|
| 165 |
-
|
|
|
|
|
|
|
| 166 |
|
| 167 |
-
|
| 168 |
-
| :---: | :---: | :---: |
|
| 169 |
-
| <img src="plots/yolo11n_best/box_precision_percentage_improvement_per_label.png" alt="Box Precision Improvement"> | <img src="plots/yolo11n_best/map50_percentage_improvement_per_label.png" alt="mAP50 Improvement"> | <img src="plots/yolo11n_best/map50_95_percentage_improvement_per_label.png" alt="mAP50-95 Improvement"> |
|
| 170 |
|
| 171 |
-
|
| 172 |
|
| 173 |
---
|
| 174 |
|
| 175 |
-
##
|
| 176 |
-
|
| 177 |
-
This project successfully demonstrates the advanced capabilities of YOLOv11 for document layout analysis. While larger models offer higher raw accuracy, the YOLOv11n model (`train4`) stands out, providing an excellent compromise between performance and efficiency. The detailed analysis underscores the critical importance of prioritizing **localization accuracy (precision)** over sheer detection coverage (recall) when deploying models for mission-critical data extraction tasks.
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
-
|
| 182 |
-
For full training scripts, dataset conversion utilities, end-to-end examples and more, see the [GitHub repository](https://github.com/Armaggheddon/yolo11_doc_layout).
|
|
|
|
| 11 |
tags:
|
| 12 |
- object-detection
|
| 13 |
- document-layout
|
| 14 |
+
- yolov11
|
| 15 |
+
- ultralytics
|
| 16 |
+
- document-layout-analysis
|
| 17 |
+
- document-ai
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# YOLOv11 for Advanced Document Layout Analysis
|
| 21 |
|
| 22 |
+
<p align="center">
|
| 23 |
+
<img src="images/logo.png" alt="Logo" width="100%"/>
|
| 24 |
+
</p>
|
| 25 |
|
| 26 |
+
This repository hosts three YOLOv11 models (**nano, small, and medium**) fine-tuned for high-performance **Document Layout Analysis** on the challenging [DocLayNet dataset](https://huggingface.co/datasets/ds4sd/DocLayNet).
|
| 27 |
|
| 28 |
+
The goal is to accurately detect and classify key layout elements in a document, such as text, tables, figures, and titles. This is a fundamental task for document understanding and information extraction pipelines.
|
| 29 |
|
| 30 |
+
### β¨ Model Highlights
|
| 31 |
+
* **π Three Powerful Variants:** Choose between `nano`, `small`, and `medium` models to fit your performance needs.
|
| 32 |
+
* **π― High Accuracy:** Trained on the comprehensive DocLayNet dataset to recognize 11 distinct layout types.
|
| 33 |
+
* β‘ **Optimized for Efficiency:** The recommended **`yolo11n` (nano) model** offers an exceptional balance of speed and accuracy, making it ideal for production environments.
|
| 34 |
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## π Get Started
|
| 38 |
|
| 39 |
+
Get up and running with just a few lines of code.
|
| 40 |
|
| 41 |
+
### 1. Installation
|
| 42 |
|
| 43 |
+
First, install the necessary libraries.
|
| 44 |
|
| 45 |
```bash
|
| 46 |
pip install ultralytics huggingface_hub
|
| 47 |
```
|
| 48 |
|
| 49 |
+
### 2. Inference Example
|
| 50 |
|
| 51 |
+
This Python snippet shows how to download a model from the Hub and run inference on a local document image.
|
| 52 |
|
| 53 |
```python
|
| 54 |
from pathlib import Path
|
| 55 |
from huggingface_hub import hf_hub_download
|
| 56 |
from ultralytics import YOLO
|
| 57 |
|
| 58 |
+
# Define the local directory to save models
|
| 59 |
+
DOWNLOAD_PATH = Path("./models")
|
| 60 |
+
DOWNLOAD_PATH.mkdir(exist_ok=True)
|
| 61 |
|
| 62 |
+
# Choose which model to use
|
| 63 |
+
# 0: nano, 1: small, 2: medium
|
| 64 |
+
model_files = [
|
| 65 |
"yolo11n_doc_layout.pt",
|
| 66 |
"yolo11s_doc_layout.pt",
|
| 67 |
"yolo11m_doc_layout.pt",
|
| 68 |
]
|
| 69 |
+
selected_model_file = model_files[0] # Using the recommended nano model
|
| 70 |
|
| 71 |
+
# Download the model from the Hugging Face Hub
|
| 72 |
model_path = hf_hub_download(
|
| 73 |
repo_id="Armaggheddon/yolo11-document-layout",
|
| 74 |
+
filename=selected_model_file,
|
| 75 |
repo_type="model",
|
| 76 |
local_dir=DOWNLOAD_PATH,
|
| 77 |
)
|
| 78 |
|
| 79 |
+
# Initialize the YOLO model
|
| 80 |
model = YOLO(model_path)
|
| 81 |
|
| 82 |
+
# Run inference on an image
|
| 83 |
+
# Replace 'path/to/your/document.jpg' with your file
|
| 84 |
results = model('path/to/your/document.jpg')
|
| 85 |
|
| 86 |
# Process and display results
|
| 87 |
+
results[0].print() # Print detection details
|
| 88 |
+
results[0].show() # Display the image with bounding boxes
|
| 89 |
```
|
| 90 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
---
|
| 92 |
|
| 93 |
+
## π Model Performance & Evaluation
|
| 94 |
|
| 95 |
+
We fine-tuned three YOLOv11 variants, allowing you to choose the best model for your use case.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
* **`yolo11n_doc_layout.pt` (train4)**: **Recommended.** The nano model offers the best trade-off between speed and accuracy.
|
| 98 |
+
* **`yolo11s_doc_layout.pt` (train5)**: A larger, slightly more accurate model.
|
| 99 |
+
* **`yolo11m_doc_layout.pt` (train6)**: The largest model, providing the highest accuracy with a corresponding increase in computational cost.
|
| 100 |
|
| 101 |
+
As shown in the analysis below, performance gains are marginal when moving from the `small` to the `medium` model, making the `nano` and `small` variants the most practical choices.
|
| 102 |
|
| 103 |
+
### Nano vs. Small vs. Medium Comparison
|
| 104 |
|
| 105 |
+
Here's how the three models stack up across key metrics. The plots compare their performance for each document layout label.
|
| 106 |
|
| 107 |
+
| **mAP@50-95** (Strict IoU) | **mAP@50** (Standard IoU) |
|
| 108 |
| :---: | :---: |
|
| 109 |
+
| <img src="images/nsm_map50_95_per_label.png" alt="mAP@50-95" width="400"> | <img src="images/nsm_map50_per_label.png" alt="mAP@50" width="400"> |
|
| 110 |
|
| 111 |
| **Precision** (Box Quality) | **Recall** (Detection Coverage) |
|
| 112 |
| :---: | :---: |
|
| 113 |
+
| <img src="images/nsm_box_precision_per_label.png" alt="Precision" width="400"> | <img src="images/nsm_recall_per_label.png" alt="Recall" width="400"> |
|
| 114 |
|
| 115 |
+
<details>
|
| 116 |
+
<summary><b>Click to see detailed Training Metrics & Confusion Matrices</b></summary>
|
| 117 |
|
| 118 |
+
| Model | Training Metrics | Normalized Confusion Matrix |
|
| 119 |
+
| :---: | :---: | :---: |
|
| 120 |
+
| **`yolo11n`** (train4) | <img src="images/t4_results.png" alt="train4 results" height="200"> | <img src="images/t4_confusion_mat_normalized.png" alt="train4 confusion matrix" height="200"> |
|
| 121 |
+
| **`yolo11s`** (train5) | <img src="images/t5_results.png" alt="train5 results" height="200"> | <img src="images/t5_confusion_mat_normalized.png" alt="train5 confusion matrix" height="200"> |
|
| 122 |
+
| **`yolo11m`** (train6) | <img src="images/t6_results.png" alt="train6 results" height="200"> | <img src="images/t6_confusion_mat_normalized.png" alt="train6 confusion matrix" height="200"> |
|
| 123 |
|
| 124 |
+
</details>
|
| 125 |
|
| 126 |
+
### π The Champion: Why `train4` (Nano) is the Best Choice
|
|
|
|
|
|
|
| 127 |
|
| 128 |
+
While all nano-family models performed well, a deeper analysis revealed that **`train4`** stands out for its superior **localization quality**.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
+
We compared it against `train9` (another strong nano contender), which achieved a slightly higher recall by sacrificing bounding box precision. For applications where data integrity and accurate object boundaries are critical, `train4` is the clear winner.
|
| 131 |
|
| 132 |
+
**Key Advantages of `train4`:**
|
| 133 |
+
1. **Superior Box Precision:** It delivered significantly more accurate bounding boxes, with a **+9.0%** precision improvement for the `title` class and strong gains for `section-header` and `table`.
|
| 134 |
+
2. **Higher Quality Detections:** It achieved a **+2.4%** mAP50 and **+2.05%** mAP50-95 improvement for the difficult `footnote` class, proving its ability to meet stricter IoU thresholds.
|
| 135 |
|
| 136 |
+
| Box Precision Improvement | mAP50 Improvement | mAP50-95 Improvement |
|
| 137 |
+
| :---: | :---: | :---: |
|
| 138 |
+
| <img src="images/nbest_box_precision_percentage_improvement_per_label.png" alt="Box Precision Improvement"> | <img src="images/nbest_map50_percentage_improvement_per_label.png" alt="mAP50 Improvement"> | <img src="images/nbest_map50_95_percentage_improvement_per_label.png" alt="mAP50-95 Improvement"> |
|
| 139 |
|
| 140 |
+
In short, `train4` prioritizes **quality over quantity**, making it the most reliable and optimal choice for production systems.
|
| 141 |
|
| 142 |
+
---
|
| 143 |
|
| 144 |
+
## π About the Dataset: DocLayNet
|
| 145 |
|
| 146 |
+
The models were trained on the [DocLayNet dataset](https://huggingface.co/datasets/ds4sd/DocLayNet), which provides a rich and diverse collection of document images annotated with 11 layout categories:
|
| 147 |
|
| 148 |
+
* **Text**, **Title**, **Section-header**
|
| 149 |
+
* **Table**, **Picture**, **Caption**
|
| 150 |
+
* **List-item**, **Formula**
|
| 151 |
+
* **Page-header**, **Page-footer**, **Footnote**
|
| 152 |
|
| 153 |
+
**Training Resolution:** All models were trained at **1280x1280** resolution. Initial tests at the default 640x640 resulted in a significant performance drop, especially for smaller elements like `footnote` and `caption`.
|
|
|
|
|
|
|
| 154 |
|
| 155 |
+
<img src="images/class_distribution.jpg" alt="DocLayNet Samples" width="500px"/>
|
| 156 |
|
| 157 |
---
|
| 158 |
|
| 159 |
+
## π» Code & Training Details
|
|
|
|
|
|
|
| 160 |
|
| 161 |
+
This model card focuses on results and usage. For the complete end-to-end pipeline, including training scripts, dataset conversion utilities, and detailed examples, please visit the main GitHub repository:
|
| 162 |
|
| 163 |
+
β‘οΈ **[GitHub Repo: yolo11_doc_layout](https://github.com/Armaggheddon/yolo11_doc_layout)**
|
|
|