hue-portal-backend-v2 / backend /OPTIMIZE_CHATBOT_PERFORMANCE.md
davidtran999's picture
Push full code from hue-portal-backend folder
519b145

Tối ưu Tốc độ và Độ chính xác Chatbot

Ngày tạo: 2025-01-27

1. Phân tích Bottlenecks hiện tại

1.1 Intent Classification

Vấn đề:

  • Loop qua nhiều keywords mỗi lần (fine_keywords: 9 items, fine_single_words: 7 items)
  • Tính _remove_accents() nhiều lần cho cùng keyword
  • Không có compiled regex patterns

Impact: ~5-10ms mỗi query

1.2 Search Pipeline

Vấn đề:

  • list(queryset) - Load TẤT CẢ objects vào memory trước khi search
  • TF-IDF vectorization cho toàn bộ dataset mỗi lần
  • Không có early exit khi tìm thấy kết quả tốt
  • Query expansion query database mỗi lần

Impact: ~100-500ms cho dataset lớn

1.3 LLM Generation

Vấn đề:

  • Prompt được build lại mỗi lần (không cache)
  • Không có streaming response
  • max_new_tokens=150 (OK) nhưng có thể tối ưu thêm
  • Không cache generated responses

Impact: ~1-5s cho local model, ~2-10s cho API

1.4 Không có Response Caching

Vấn đề:

  • Cùng query được xử lý lại từ đầu
  • Search results không được cache
  • Intent classification không được cache

Impact: ~100-500ms cho duplicate queries

2. Tối ưu Intent Classification

2.1 Pre-compile Keyword Patterns

# backend/hue_portal/core/chatbot.py

import re
from functools import lru_cache

class Chatbot:
    def __init__(self):
        self.intent_classifier = None
        self.vectorizer = None
        # Pre-compile keyword patterns
        self._compile_keyword_patterns()
        self._train_classifier()
    
    def _compile_keyword_patterns(self):
        """Pre-compile regex patterns for faster matching."""
        # Fine keywords (multi-word first, then single)
        self.fine_patterns_multi = [
            re.compile(r'\b' + re.escape(kw) + r'\b', re.IGNORECASE)
            for kw in ["mức phạt", "vi phạm", "đèn đỏ", "nồng độ cồn", 
                      "mũ bảo hiểm", "tốc độ", "bằng lái", "vượt đèn"]
        ]
        self.fine_patterns_single = [
            re.compile(r'\b' + re.escape(kw) + r'\b', re.IGNORECASE)
            for kw in ["phạt", "vượt", "đèn", "mức"]
        ]
        
        # Pre-compute accent-free versions
        self.fine_keywords_ascii = [self._remove_accents(kw) for kw in 
                                    ["mức phạt", "vi phạm", "đèn đỏ", ...]]
        
        # Procedure, Office, Advisory patterns...
        # Similar pattern compilation
    
    @lru_cache(maxsize=1000)
    def classify_intent(self, query: str) -> Tuple[str, float]:
        """Cached intent classification."""
        query_lower = query.lower().strip()
        
        # Fast path: Check compiled patterns
        for pattern in self.fine_patterns_multi:
            if pattern.search(query_lower):
                return ("search_fine", 0.95)
        
        # ... rest of logic

Lợi ích:

  • Giảm ~50% thời gian intent classification
  • Cache kết quả cho duplicate queries

2.2 Early Exit Strategy

def _keyword_based_intent(self, query: str) -> Tuple[str, float]:
    query_lower = query.lower().strip()
    
    # Fast path: Check most common intents first
    # Fine queries are most common → check first
    if any(pattern.search(query_lower) for pattern in self.fine_patterns_multi):
        return ("search_fine", 0.95)
    
    # Early exit for very short queries (likely greeting)
    if len(query.split()) <= 2:
        if any(greeting in query_lower for greeting in ["xin chào", "chào", "hello"]):
            return ("greeting", 0.9)
    
    # ... rest

3. Tối ưu Search Pipeline

3.1 Limit QuerySet trước khi Load

# backend/hue_portal/core/search_ml.py

def search_with_ml(queryset, query, text_fields, top_k=20, min_score=0.1, use_hybrid=True):
    if not query:
        return queryset[:top_k]
    
    # OPTIMIZATION: Limit queryset early for large datasets
    # Only search in first N records if dataset is huge
    MAX_SEARCH_CANDIDATES = 1000
    total_count = queryset.count()
    
    if total_count > MAX_SEARCH_CANDIDATES:
        # Use database-level filtering first
        # Try exact match on primary field first
        primary_field = text_fields[0] if text_fields else None
        if primary_field:
            exact_matches = queryset.filter(
                **{f"{primary_field}__icontains": query}
            )[:top_k * 2]
            
            if exact_matches.count() >= top_k:
                # We have enough exact matches, return them
                return exact_matches[:top_k]
        
        # Limit candidates for ML search
        queryset = queryset[:MAX_SEARCH_CANDIDATES]
    
    # Continue with existing search logic...

3.2 Cache Search Results

# backend/hue_portal/core/search_ml.py

from functools import lru_cache
import hashlib
import json

def _get_query_hash(query: str, model_name: str, text_fields: tuple) -> str:
    """Generate hash for query caching."""
    key = f"{query}|{model_name}|{':'.join(text_fields)}"
    return hashlib.md5(key.encode()).hexdigest()

# Cache search results for 1 hour
@lru_cache(maxsize=500)
def _cached_search(query_hash: str, queryset_ids: tuple, top_k: int):
    """Cached search results."""
    # This will be called with actual queryset in wrapper
    pass

def search_with_ml(queryset, query, text_fields, top_k=20, min_score=0.1, use_hybrid=True):
    # Check cache first
    query_hash = _get_query_hash(query, queryset.model.__name__, tuple(text_fields))
    
    # Try to get from cache (if queryset hasn't changed)
    # Note: Full caching requires tracking queryset state
    
    # ... existing search logic

3.3 Optimize TF-IDF Calculation

# Pre-compute TF-IDF vectors for common queries
# Use incremental TF-IDF instead of recalculating

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

class CachedTfidfVectorizer:
    """TF-IDF vectorizer with caching."""
    
    def __init__(self):
        self.vectorizer = None
        self.doc_vectors = None
        self.doc_ids = None
    
    def fit_transform_cached(self, documents: List[str], doc_ids: List[int]):
        """Fit and cache document vectors."""
        if self.doc_ids == tuple(doc_ids):
            # Same documents, reuse vectors
            return self.doc_vectors
        
        # New documents, recompute
        self.vectorizer = TfidfVectorizer(
            analyzer='word',
            ngram_range=(1, 2),
            min_df=1,
            max_df=0.95,
            lowercase=True
        )
        self.doc_vectors = self.vectorizer.fit_transform(documents)
        self.doc_ids = tuple(doc_ids)
        return self.doc_vectors

3.4 Early Exit khi có Exact Match

def search_with_ml(queryset, query, text_fields, top_k=20, min_score=0.1, use_hybrid=True):
    # OPTIMIZATION: Check exact matches first (fastest)
    query_normalized = normalize_text(query)
    
    # Try exact match on primary field
    primary_field = text_fields[0] if text_fields else None
    if primary_field:
        exact_qs = queryset.filter(**{f"{primary_field}__iexact": query})
        if exact_qs.exists():
            # Found exact match, return immediately
            return exact_qs[:top_k]
        
        # Try case-insensitive contains (faster than ML)
        contains_qs = queryset.filter(**{f"{primary_field}__icontains": query})
        if contains_qs.count() <= top_k * 2:
            # Small result set, return directly
            return contains_qs[:top_k]
    
    # Only use ML search if no good exact matches
    # ... existing ML search logic

4. Tối ưu LLM Generation

4.1 Prompt Caching

# backend/hue_portal/chatbot/llm_integration.py

from functools import lru_cache
import hashlib

class LLMGenerator:
    def __init__(self, provider: Optional[str] = None):
        self.provider = provider or LLM_PROVIDER
        self.prompt_cache = {}  # Cache prompts by hash
        self.response_cache = {}  # Cache responses
    
    def _get_prompt_hash(self, query: str, documents: List[Any]) -> str:
        """Generate hash for prompt caching."""
        doc_ids = [getattr(doc, 'id', None) for doc in documents[:5]]
        key = f"{query}|{doc_ids}"
        return hashlib.md5(key.encode()).hexdigest()
    
    def generate_answer(self, query: str, context: Optional[List[Dict]], documents: Optional[List[Any]]):
        if not self.is_available():
            return None
        
        # Check cache first
        prompt_hash = self._get_prompt_hash(query, documents or [])
        if prompt_hash in self.response_cache:
            cached_response = self.response_cache[prompt_hash]
            # Check if cache is still valid (e.g., < 1 hour old)
            if cached_response.get('timestamp', 0) > time.time() - 3600:
                return cached_response['response']
        
        # Build prompt (may be cached)
        prompt = self._build_prompt(query, context, documents)
        response = self._generate_from_prompt(prompt, context=context)
        
        # Cache response
        if response:
            self.response_cache[prompt_hash] = {
                'response': response,
                'timestamp': time.time()
            }
        
        return response

4.2 Optimize Local Model Generation

def _generate_local(self, prompt: str) -> Optional[str]:
    # OPTIMIZATION: Use faster generation parameters
    with torch.no_grad():
        outputs = self.local_model.generate(
            **inputs,
            max_new_tokens=100,  # Reduced from 150
            temperature=0.5,  # Lower for faster generation
            top_p=0.8,  # Lower top_p
            do_sample=False,  # Greedy decoding (faster)
            use_cache=True,
            pad_token_id=self.local_tokenizer.eos_token_id,
            repetition_penalty=1.1,
            # OPTIMIZATION: Early stopping
            eos_token_id=self.local_tokenizer.eos_token_id,
        )

4.3 Streaming Response (for better UX)

# For API endpoints, support streaming
def generate_answer_streaming(self, query: str, context, documents):
    """Generate answer with streaming for better UX."""
    if self.provider == LLM_PROVIDER_LOCAL:
        # Use generate with stream=True
        for token in self._generate_local_streaming(prompt):
            yield token
    elif self.provider == LLM_PROVIDER_OPENAI:
        # Use OpenAI streaming API
        for chunk in self.client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        ):
            yield chunk.choices[0].delta.content

5. Response Caching Strategy

5.1 Multi-level Caching

# backend/hue_portal/core/cache_utils.py

from functools import lru_cache
from django.core.cache import cache
import hashlib
import json

class ChatbotCache:
    """Multi-level caching for chatbot responses."""
    
    CACHE_TIMEOUT = 3600  # 1 hour
    
    @staticmethod
    def get_cache_key(query: str, intent: str, session_id: str = None) -> str:
        """Generate cache key."""
        key_parts = [query.lower().strip(), intent]
        if session_id:
            key_parts.append(session_id)
        key_str = "|".join(key_parts)
        return f"chatbot:{hashlib.md5(key_str.encode()).hexdigest()}"
    
    @staticmethod
    def get_cached_response(query: str, intent: str, session_id: str = None):
        """Get cached response."""
        cache_key = ChatbotCache.get_cache_key(query, intent, session_id)
        return cache.get(cache_key)
    
    @staticmethod
    def set_cached_response(query: str, intent: str, response: dict, session_id: str = None):
        """Cache response."""
        cache_key = ChatbotCache.get_cache_key(query, intent, session_id)
        cache.set(cache_key, response, ChatbotCache.CACHE_TIMEOUT)
    
    @staticmethod
    def get_cached_search_results(query: str, model_name: str, text_fields: tuple):
        """Get cached search results."""
        key = f"search:{hashlib.md5(f'{query}|{model_name}|{text_fields}'.encode()).hexdigest()}"
        return cache.get(key)
    
    @staticmethod
    def set_cached_search_results(query: str, model_name: str, text_fields: tuple, results):
        """Cache search results."""
        key = f"search:{hashlib.md5(f'{query}|{model_name}|{text_fields}'.encode()).hexdigest()}"
        cache.set(key, results, ChatbotCache.CACHE_TIMEOUT)

5.2 Integrate vào Chatbot

# backend/hue_portal/core/chatbot.py

from .cache_utils import ChatbotCache

class Chatbot:
    def generate_response(self, query: str, session_id: str = None) -> Dict[str, Any]:
        query = query.strip()
        
        # Classify intent
        intent, confidence = self.classify_intent(query)
        
        # Check cache first
        cached_response = ChatbotCache.get_cached_response(query, intent, session_id)
        if cached_response:
            return cached_response
        
        # ... existing logic
        
        # Cache response before returning
        response = {
            "message": message,
            "intent": intent,
            "confidence": confidence,
            "results": search_result["results"],
            "count": search_result["count"]
        }
        
        ChatbotCache.set_cached_response(query, intent, response, session_id)
        return response

6. Tối ưu Query Expansion

6.1 Cache Synonyms

# backend/hue_portal/core/search_ml.py

from django.core.cache import cache

@lru_cache(maxsize=1)
def get_all_synonyms():
    """Get all synonyms (cached)."""
    return list(Synonym.objects.all())

def expand_query_with_synonyms(query: str) -> List[str]:
    """Expand query using cached synonyms."""
    query_normalized = normalize_text(query)
    expanded = [query_normalized]
    
    # Use cached synonyms
    synonyms = get_all_synonyms()
    
    for synonym in synonyms:
        keyword = normalize_text(synonym.keyword)
        alias = normalize_text(synonym.alias)
        
        if keyword in query_normalized:
            expanded.append(query_normalized.replace(keyword, alias))
        if alias in query_normalized:
            expanded.append(query_normalized.replace(alias, keyword))
    
    return list(set(expanded))

7. Database Query Optimization

7.1 Use select_related / prefetch_related

# backend/hue_portal/core/chatbot.py

def search_by_intent(self, intent: str, query: str, limit: int = 5):
    if intent == "search_fine":
        qs = Fine.objects.all().select_related('decree')  # If has FK
        # ... rest
    
    elif intent == "search_legal":
        qs = LegalSection.objects.all().select_related('document')
        # ... rest

7.2 Add Database Indexes

# backend/hue_portal/core/models.py

class Fine(models.Model):
    name = models.CharField(max_length=500, db_index=True)  # Add index
    code = models.CharField(max_length=50, db_index=True)   # Add index
    
    class Meta:
        indexes = [
            models.Index(fields=['name', 'code']),
            models.Index(fields=['min_fine', 'max_fine']),
        ]

8. Tối ưu Frontend

8.1 Debounce Search Input

// frontend/src/pages/Chat.tsx

const [input, setInput] = useState('')
const debouncedInput = useDebounce(input, 300)  // Wait 300ms

useEffect(() => {
  if (debouncedInput) {
    // Trigger search suggestions
  }
}, [debouncedInput])

8.2 Optimistic UI Updates

const handleSend = async (messageText?: string) => {
  // Show message immediately (optimistic)
  setMessages(prev => [...prev, {
    role: 'user',
    content: textToSend,
    timestamp: new Date()
  }])
  
  // Then fetch response
  const response = await chat(textToSend, sessionId)
  // Update with actual response
}

9. Monitoring & Metrics

9.1 Add Performance Logging

# backend/hue_portal/chatbot/views.py

import time
from django.utils import timezone

@api_view(["POST"])
def chat(request: Request) -> Response:
    start_time = time.time()
    
    # ... existing logic
    
    # Log performance metrics
    elapsed = time.time() - start_time
    logger.info(f"[PERF] Chat response time: {elapsed:.3f}s | Intent: {intent} | Results: {count}")
    
    # Track slow queries
    if elapsed > 2.0:
        logger.warning(f"[SLOW] Query took {elapsed:.3f}s: {message[:100]}")
    
    return Response(response)

9.2 Track Cache Hit Rate

class ChatbotCache:
    cache_hits = 0
    cache_misses = 0
    
    @staticmethod
    def get_cached_response(query: str, intent: str, session_id: str = None):
        cached = cache.get(ChatbotCache.get_cache_key(query, intent, session_id))
        if cached:
            ChatbotCache.cache_hits += 1
            return cached
        ChatbotCache.cache_misses += 1
        return None
    
    @staticmethod
    def get_cache_stats():
        total = ChatbotCache.cache_hits + ChatbotCache.cache_misses
        if total == 0:
            return {"hit_rate": 0, "hits": 0, "misses": 0}
        return {
            "hit_rate": ChatbotCache.cache_hits / total,
            "hits": ChatbotCache.cache_hits,
            "misses": ChatbotCache.cache_misses
        }

10. Expected Performance Improvements

Optimization Current Optimized Improvement
Intent Classification 5-10ms 1-3ms 70% faster
Search (small dataset) 50-100ms 10-30ms 70% faster
Search (large dataset) 200-500ms 50-150ms 70% faster
LLM Generation (cached) 1-5s 0.01-0.1s 99% faster
LLM Generation (uncached) 1-5s 0.8-4s 20% faster
Total Response (cached) 100-500ms 10-50ms 90% faster
Total Response (uncached) 1-6s 0.5-3s 50% faster

11. Implementation Priority

Phase 1: Quick Wins (1-2 days)

  1. ✅ Add response caching (Django cache)
  2. ✅ Pre-compile keyword patterns
  3. ✅ Cache synonyms
  4. ✅ Add database indexes
  5. ✅ Early exit for exact matches

Phase 2: Medium Impact (3-5 days)

  1. ✅ Limit QuerySet before loading
  2. ✅ Optimize TF-IDF calculation
  3. ✅ Prompt caching for LLM
  4. ✅ Optimize local model generation
  5. ✅ Add performance logging

Phase 3: Advanced (1-2 weeks)

  1. ✅ Streaming responses
  2. ✅ Incremental TF-IDF
  3. ✅ Advanced caching strategies
  4. ✅ Query result pre-computation

12. Testing Performance

# backend/scripts/benchmark_chatbot.py

import time
import statistics

def benchmark_chatbot():
    chatbot = get_chatbot()
    test_queries = [
        "Mức phạt vượt đèn đỏ là bao nhiêu?",
        "Thủ tục đăng ký cư trú cần gì?",
        "Địa chỉ công an phường ở đâu?",
        # ... more queries
    ]
    
    times = []
    for query in test_queries:
        start = time.time()
        response = chatbot.generate_response(query)
        elapsed = time.time() - start
        times.append(elapsed)
        print(f"Query: {query[:50]}... | Time: {elapsed:.3f}s")
    
    print(f"\nAverage: {statistics.mean(times):.3f}s")
    print(f"Median: {statistics.median(times):.3f}s")
    print(f"P95: {statistics.quantiles(times, n=20)[18]:.3f}s")

Kết luận

Với các tối ưu trên, chatbot sẽ:

  • Nhanh hơn 50-90% cho cached queries
  • Nhanh hơn 20-70% cho uncached queries
  • Chính xác hơn với early exit và exact matching
  • Scalable hơn với database indexes và query limiting