SpencerCPurdy commited on
Commit
3836fb0
Β·
verified Β·
1 Parent(s): 7cab859

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +225 -1
README.md CHANGED
@@ -11,4 +11,228 @@ license: mit
11
  short_description: Multi-agent ensemble system for document classification
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  short_description: Multi-agent ensemble system for document classification
12
  ---
13
 
14
+ # Multi-Agent AI Collaboration System for Document Classification
15
+
16
+ A machine learning system that implements genuine multi-agent collaboration for document classification. Three specialized ML models (agents) with different architectures work together through ensemble methods to classify documents into 20 categories from the newsgroups dataset.
17
+
18
+ ## About
19
+
20
+ This portfolio project demonstrates multi-agent machine learning by training three distinct models that collaborate to achieve better classification performance than individual models alone. Each agent specializes in different aspects of text analysis, and their predictions are combined through ensemble methods.
21
+
22
+ **Author:** Spencer Purdy
23
+ **Development Environment:** Google Colab Pro (A100 GPU, High RAM)
24
+
25
+ ## Features
26
+
27
+ - **Three Specialized Agents**:
28
+ - TF-IDF Agent: Uses statistical text features with Logistic Regression
29
+ - Embedding Agent: Leverages semantic embeddings with a neural network
30
+ - XGBoost Agent: Handles mixed features with gradient boosting
31
+ - **Ensemble Coordination**: Weighted voting and stacking meta-learner
32
+ - **Agent Voting System**: Shows individual agent predictions and consensus
33
+ - **Interactive Interface**: Gradio web application with real-time classification
34
+ - **Comprehensive Evaluation**: Performance metrics, confusion matrix, and agent comparison
35
+ - **Visualization**: Confidence scores and prediction distributions
36
+
37
+ ## Dataset
38
+
39
+ - **Source:** 20 Newsgroups Dataset (via scikit-learn)
40
+ - **License:** Public domain
41
+ - **Total Documents:** ~18,000 newsgroup posts
42
+ - **Categories:** 20 (technology, sports, politics, religion, science, etc.)
43
+ - **Task:** Multi-class text classification
44
+ - **Preprocessing:** Removal of headers, footers, and quotes
45
+
46
+ ## System Performance
47
+
48
+ Performance on held-out test set (3,770 documents):
49
+
50
+ | Model | Accuracy | F1-Score (Weighted) |
51
+ |-------|----------|---------------------|
52
+ | TF-IDF Agent | 66.18% | 0.6589 |
53
+ | Embedding Agent | 72.45% | 0.7224 |
54
+ | XGBoost Agent | 61.44% | 0.6147 |
55
+ | Weighted Voting | ~71% | ~0.70 |
56
+ | Stacking Ensemble | **73%** | **0.73** |
57
+
58
+ **Best Performance:** Stacking ensemble achieves 73% accuracy by learning optimal agent weighting
59
+
60
+ **Training Set Size:** 12,060 documents
61
+ **Validation Set Size:** 3,016 documents
62
+ **Test Set Size:** 3,770 documents
63
+
64
+ ## Agent Architectures
65
+
66
+ ### TF-IDF Agent
67
+ - **Feature Extraction:** 5,000 TF-IDF features with bigrams
68
+ - **Model:** Logistic Regression with L2 regularization
69
+ - **Training Time:** ~16.53 seconds
70
+ - **Strengths:** Fast, interpretable, keyword-based classification
71
+
72
+ ### Embedding Agent
73
+ - **Feature Extraction:** 384-dimensional sentence embeddings (all-MiniLM-L6-v2)
74
+ - **Model:** 2-layer neural network (384 β†’ 256 β†’ 128 β†’ 20)
75
+ - **Training Time:** ~7.74 seconds
76
+ - **Strengths:** Captures semantic similarity, handles paraphrasing
77
+
78
+ ### XGBoost Agent
79
+ - **Features:** Combined TF-IDF + embeddings + metadata
80
+ - **Model:** Gradient boosting (200 estimators, max depth 6)
81
+ - **Training Time:** ~632.16 seconds
82
+ - **Strengths:** Robust with mixed features, handles complex patterns
83
+
84
+ ### Meta-Learner (Stacking)
85
+ - **Input:** Predictions from all three agents
86
+ - **Model:** Logistic Regression
87
+ - **Purpose:** Learns optimal combination of agent predictions
88
+
89
+ ## Technical Stack
90
+
91
+ - **ML Frameworks:** scikit-learn, PyTorch, XGBoost
92
+ - **NLP:** sentence-transformers, nltk
93
+ - **Data Processing:** pandas, numpy
94
+ - **Class Balancing:** imbalanced-learn (SMOTE)
95
+ - **UI Framework:** Gradio
96
+ - **Visualization:** matplotlib, seaborn, plotly
97
+ - **Development:** Google Colab Pro with A100 GPU
98
+
99
+ ## Setup and Usage
100
+
101
+ ### Running in Google Colab
102
+
103
+ 1. Clone this repository or download the notebook file
104
+ 2. Upload `Multi-Agent AI Collaboration System for Document Classification.ipynb` to Google Colab
105
+ 3. Select Runtime > Change runtime type > A100 GPU (or T4 GPU for free tier)
106
+ 4. Run all cells sequentially
107
+
108
+ The notebook will automatically:
109
+ - Install required dependencies
110
+ - Download the 20 Newsgroups dataset
111
+ - Train all three agents
112
+ - Train ensemble methods
113
+ - Evaluate on test set
114
+ - Launch a Gradio interface with a shareable link
115
+
116
+ ### Running Locally
117
+
118
+ ```bash
119
+ # Clone the repository
120
+ git clone https://github.com/SpencerCPurdy/Multi-Agent_AI_Collaboration_System_for_Document_Classification.git
121
+ cd Multi-Agent_AI_Collaboration_System_for_Document_Classification
122
+
123
+ # Install dependencies
124
+ pip install scikit-learn==1.3.0 numpy==1.24.3 pandas==2.0.3 torch==2.1.0 transformers==4.35.0 gradio==4.7.1 sentence-transformers==2.2.2 imbalanced-learn==0.11.0 xgboost==2.0.1 plotly==5.18.0 seaborn==0.13.0 nltk==3.8.1
125
+
126
+ # Run the notebook
127
+ jupyter notebook "Multi-Agent AI Collaboration System for Document Classification.ipynb"
128
+ ```
129
+
130
+ **Note:** Training takes approximately 10-15 minutes depending on hardware.
131
+
132
+ ## Project Structure
133
+
134
+ ```
135
+ β”œβ”€β”€ Multi-Agent AI Collaboration System for Document Classification.ipynb
136
+ β”œβ”€β”€ README.md
137
+ β”œβ”€β”€ LICENSE
138
+ └── .gitignore
139
+ ```
140
+
141
+ The notebook contains the following components:
142
+
143
+ 1. **Configuration & Setup**: System parameters and reproducibility settings
144
+ 2. **Data Loading**: 20 Newsgroups dataset with preprocessing
145
+ 3. **Feature Engineering**: TF-IDF, embeddings, and metadata features
146
+ 4. **Agent Training**: Three specialized models trained independently
147
+ 5. **Ensemble Methods**: Voting and stacking implementation
148
+ 6. **Evaluation**: Comprehensive metrics and visualizations
149
+ 7. **Gradio Interface**: Interactive web application
150
+
151
+ ## Key Implementation Details
152
+
153
+ - **Reproducibility:** All random seeds set to 42 for deterministic results
154
+ - **Cross-Validation:** 5-fold stratified cross-validation for model selection
155
+ - **Feature Engineering:** Combined TF-IDF (5,000 features), sentence embeddings (384-d), and document metadata
156
+ - **Class Balancing:** SMOTE applied to handle class imbalance
157
+ - **Neural Network:** Dropout (0.3) and early stopping (patience: 3 epochs) to prevent overfitting
158
+
159
+ ## Performance by Category
160
+
161
+ The system achieves varying performance across categories:
162
+
163
+ **Strong Performance (>85% precision):**
164
+ - rec.sport.hockey: 94% precision
165
+ - rec.sport.baseball: 89% precision
166
+ - comp.windows.x: 87% precision
167
+
168
+ **Moderate Performance (70-85% precision):**
169
+ - sci.crypt: 84% precision
170
+ - sci.med: 83% precision
171
+ - comp.graphics: 70% precision
172
+
173
+ **Challenging Categories (<60% precision):**
174
+ - talk.religion.misc: 38% precision
175
+ - comp.os.ms-windows.misc: Lower performance due to overlapping topics
176
+
177
+ ## Limitations
178
+
179
+ ### Domain Specificity
180
+ - Trained on newsgroup data; may not generalize well to significantly different domains (e.g., legal documents, medical reports)
181
+
182
+ ### Performance Constraints
183
+ - 73% accuracy is solid but not state-of-the-art for text classification
184
+ - Performance degrades on very short documents (<50 words)
185
+ - Ambiguous documents covering multiple topics may be misclassified
186
+
187
+ ### Known Issues
188
+ - Training data bias reflected in model predictions
189
+ - English text only
190
+ - Very long documents (>10,000 words) may lose context
191
+ - Sarcasm and irony not reliably detected
192
+
193
+ ### Uncertainty Indicators
194
+ - Confidence <50%: Highly uncertain prediction, consider human review
195
+ - Close top-2 predictions: Document may belong to multiple categories
196
+ - Agent disagreement: Complex or ambiguous document
197
+
198
+ ## Ensemble Strategy
199
+
200
+ The system uses two ensemble approaches:
201
+
202
+ 1. **Weighted Voting**: Combines predictions based on validation performance
203
+ - Simple and interpretable
204
+ - Each agent weighted by validation accuracy
205
+
206
+ 2. **Stacking**: Meta-learner optimally combines agent predictions
207
+ - Learns complex agent interaction patterns
208
+ - Achieves best performance (~73% accuracy)
209
+ - Meta-learner uses Logistic Regression with 5-fold cross-validation
210
+
211
+ ## Use Cases
212
+
213
+ This multi-agent approach is applicable to:
214
+ - Customer support ticket routing
215
+ - Email categorization
216
+ - Content moderation
217
+ - Document management systems
218
+ - News article classification
219
+
220
+ ## License
221
+
222
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
223
+
224
+ ## Acknowledgments
225
+
226
+ - 20 Newsgroups dataset creators
227
+ - scikit-learn team for dataset hosting
228
+ - Hugging Face for sentence-transformers
229
+ - Open-source ML community
230
+
231
+ ## Contact
232
+
233
+ **Spencer Purdy**
234
+ GitHub: [@SpencerCPurdy](https://github.com/SpencerCPurdy)
235
+
236
+ ---
237
+
238
+ *This is a portfolio project developed to demonstrate multi-agent machine learning and ensemble methods. The system is designed for educational and demonstrational purposes. Performance metrics reflect results on the specific dataset used.*