Intent Classification Improvements
Overview
This document describes the improvements made to intent classification in Plan 5.
Problem Identified
Query "Cảnh báo lừa đảo giả danh công an" was being classified as search_office instead of search_advisory.
Root Cause
- Keyword Conflict: The keyword "công an" appears in both
search_officeand queries aboutsearch_advisory - Order of Checks: The code checked
has_office_keywordsbeforehas_advisory_keywords, causing office keywords to match first - Limited Training Data: The
search_advisoryintent had only 7 examples, compared to more examples in other intents
Solutions Implemented
1. Improved Keyword Matching Logic
File: backend/hue_portal/chatbot/chatbot.py
- Changed order: Check
has_advisory_keywordsbeforehas_office_keywords - Added more keywords for advisory: "mạo danh", "thủ đoạn", "cảnh giác"
- This ensures advisory queries are matched first when they contain both advisory and office keywords
2. Enhanced Training Data
File: backend/hue_portal/chatbot/training/intent_dataset.json
- Expanded
search_advisoryexamples from 7 to 23 examples - Added specific examples:
- "cảnh báo lừa đảo giả danh công an"
- "mạo danh cán bộ công an"
- "lừa đảo mạo danh"
- And 15 more variations
3. Retrained Model
- Retrained intent classification model with improved training data
- Model accuracy improved
- Better handling of edge cases
Results
Before Improvements
- Query "Cảnh báo lừa đảo giả danh công an" →
search_office(incorrect) - Limited training examples for
search_advisory
After Improvements
- Query "Cảnh báo lừa đảo giả danh công an" →
search_advisory(correct) - More balanced training data across all intents
- Better keyword matching logic
Testing
Test queries that now work correctly:
- "Cảnh báo lừa đảo giả danh công an" →
search_advisory - "Lừa đảo mạo danh cán bộ" →
search_advisory - "Mạo danh cán bộ công an" →
search_advisory
2025-11-14 Update — Serialization & API Regression
- Added
_serialize_documentinbackend/hue_portal/chatbot/chatbot.pyso RAG responses return JSON-safe payloads (no moreTypeError: Object of type type is not JSON serializablewhen embeddings include model instances). - Re-tested intents end-to-end via
scripts/test_api_endpoint.py(6 queries spanning all intents):- Result: 6/6 passed, 100 % intent accuracy.
- Latency: avg ~3.7 s (note: first call warms up
keepitreal/vietnamese-sbert-v2, subsequent calls ≤1.8 s).
- Health checklist before testing:
POSTGRES_HOST=localhost POSTGRES_PORT=5433 ../../.venv/bin/python manage.py runserver 0.0.0.0:8090API_BASE_URL=http://localhost:8090 python scripts/test_api_endpoint.py- Watch server logs for any serialization warnings (none observed after fix).
Files Modified
backend/hue_portal/chatbot/training/intent_dataset.json- Enhanced training databackend/hue_portal/chatbot/chatbot.py- Improved keyword matching logicbackend/hue_portal/chatbot/training/artifacts/intent_model.joblib- Retrained model
Future Improvements
- Continue to add more training examples as edge cases are discovered
- Consider using more sophisticated ML models (e.g., transformer-based)
- Implement active learning to automatically improve from user feedback