# Intent Classification Improvements ## Overview This document describes the improvements made to intent classification in Plan 5. ## Problem Identified Query "Cảnh báo lừa đảo giả danh công an" was being classified as `search_office` instead of `search_advisory`. ### Root Cause 1. **Keyword Conflict**: The keyword "công an" appears in both `search_office` and queries about `search_advisory` 2. **Order of Checks**: The code checked `has_office_keywords` before `has_advisory_keywords`, causing office keywords to match first 3. **Limited Training Data**: The `search_advisory` intent had only 7 examples, compared to more examples in other intents ## Solutions Implemented ### 1. Improved Keyword Matching Logic **File**: `backend/hue_portal/chatbot/chatbot.py` - Changed order: Check `has_advisory_keywords` **before** `has_office_keywords` - Added more keywords for advisory: "mạo danh", "thủ đoạn", "cảnh giác" - This ensures advisory queries are matched first when they contain both advisory and office keywords ### 2. Enhanced Training Data **File**: `backend/hue_portal/chatbot/training/intent_dataset.json` - Expanded `search_advisory` examples from 7 to 23 examples - Added specific examples: - "cảnh báo lừa đảo giả danh công an" - "mạo danh cán bộ công an" - "lừa đảo mạo danh" - And 15 more variations ### 3. Retrained Model - Retrained intent classification model with improved training data - Model accuracy improved - Better handling of edge cases ## Results ### Before Improvements - Query "Cảnh báo lừa đảo giả danh công an" → `search_office` (incorrect) - Limited training examples for `search_advisory` ### After Improvements - Query "Cảnh báo lừa đảo giả danh công an" → `search_advisory` (correct) - More balanced training data across all intents - Better keyword matching logic ## Testing Test queries that now work correctly: - "Cảnh báo lừa đảo giả danh công an" → `search_advisory` - "Lừa đảo mạo danh cán bộ" → `search_advisory` - "Mạo danh cán bộ công an" → `search_advisory` ## 2025-11-14 Update — Serialization & API Regression - Added `_serialize_document` in `backend/hue_portal/chatbot/chatbot.py` so RAG responses return JSON-safe payloads (no more `TypeError: Object of type type is not JSON serializable` when embeddings include model instances). - Re-tested intents end-to-end via `scripts/test_api_endpoint.py` (6 queries spanning all intents): - **Result:** 6/6 passed, 100 % intent accuracy. - **Latency:** avg ~3.7 s (note: first call warms up `keepitreal/vietnamese-sbert-v2`, subsequent calls ≤1.8 s). - Health checklist before testing: 1. `POSTGRES_HOST=localhost POSTGRES_PORT=5433 ../../.venv/bin/python manage.py runserver 0.0.0.0:8090` 2. `API_BASE_URL=http://localhost:8090 python scripts/test_api_endpoint.py` 3. Watch server logs for any serialization warnings (none observed after fix). ## Files Modified 1. `backend/hue_portal/chatbot/training/intent_dataset.json` - Enhanced training data 2. `backend/hue_portal/chatbot/chatbot.py` - Improved keyword matching logic 3. `backend/hue_portal/chatbot/training/artifacts/intent_model.joblib` - Retrained model ## Future Improvements - Continue to add more training examples as edge cases are discovered - Consider using more sophisticated ML models (e.g., transformer-based) - Implement active learning to automatically improve from user feedback