hue-portal-backend-v2 / backend /docs /INTENT_CLASSIFICATION_IMPROVEMENTS.md
davidtran999's picture
Push full code from hue-portal-backend folder
519b145

Intent Classification Improvements

Overview

This document describes the improvements made to intent classification in Plan 5.

Problem Identified

Query "Cảnh báo lừa đảo giả danh công an" was being classified as search_office instead of search_advisory.

Root Cause

  1. Keyword Conflict: The keyword "công an" appears in both search_office and queries about search_advisory
  2. Order of Checks: The code checked has_office_keywords before has_advisory_keywords, causing office keywords to match first
  3. Limited Training Data: The search_advisory intent had only 7 examples, compared to more examples in other intents

Solutions Implemented

1. Improved Keyword Matching Logic

File: backend/hue_portal/chatbot/chatbot.py

  • Changed order: Check has_advisory_keywords before has_office_keywords
  • Added more keywords for advisory: "mạo danh", "thủ đoạn", "cảnh giác"
  • This ensures advisory queries are matched first when they contain both advisory and office keywords

2. Enhanced Training Data

File: backend/hue_portal/chatbot/training/intent_dataset.json

  • Expanded search_advisory examples from 7 to 23 examples
  • Added specific examples:
    • "cảnh báo lừa đảo giả danh công an"
    • "mạo danh cán bộ công an"
    • "lừa đảo mạo danh"
    • And 15 more variations

3. Retrained Model

  • Retrained intent classification model with improved training data
  • Model accuracy improved
  • Better handling of edge cases

Results

Before Improvements

  • Query "Cảnh báo lừa đảo giả danh công an" → search_office (incorrect)
  • Limited training examples for search_advisory

After Improvements

  • Query "Cảnh báo lừa đảo giả danh công an" → search_advisory (correct)
  • More balanced training data across all intents
  • Better keyword matching logic

Testing

Test queries that now work correctly:

  • "Cảnh báo lừa đảo giả danh công an" → search_advisory
  • "Lừa đảo mạo danh cán bộ" → search_advisory
  • "Mạo danh cán bộ công an" → search_advisory

2025-11-14 Update — Serialization & API Regression

  • Added _serialize_document in backend/hue_portal/chatbot/chatbot.py so RAG responses return JSON-safe payloads (no more TypeError: Object of type type is not JSON serializable when embeddings include model instances).
  • Re-tested intents end-to-end via scripts/test_api_endpoint.py (6 queries spanning all intents):
    • Result: 6/6 passed, 100 % intent accuracy.
    • Latency: avg ~3.7 s (note: first call warms up keepitreal/vietnamese-sbert-v2, subsequent calls ≤1.8 s).
  • Health checklist before testing:
    1. POSTGRES_HOST=localhost POSTGRES_PORT=5433 ../../.venv/bin/python manage.py runserver 0.0.0.0:8090
    2. API_BASE_URL=http://localhost:8090 python scripts/test_api_endpoint.py
    3. Watch server logs for any serialization warnings (none observed after fix).

Files Modified

  1. backend/hue_portal/chatbot/training/intent_dataset.json - Enhanced training data
  2. backend/hue_portal/chatbot/chatbot.py - Improved keyword matching logic
  3. backend/hue_portal/chatbot/training/artifacts/intent_model.joblib - Retrained model

Future Improvements

  • Continue to add more training examples as edge cases are discovered
  • Consider using more sophisticated ML models (e.g., transformer-based)
  • Implement active learning to automatically improve from user feedback