Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
Implementation Plan: Maternal Health RAG Chatbot v3.0
1. Project Goal
To significantly enhance the quality, accuracy, and naturalness of the RAG chatbot by implementing a state-of-the-art document processing and retrieval pipeline. This version will address the shortcomings of v2, specifically the poor handling of complex document elements (tables, diagrams) and the rigid, templated nature of the LLM responses.
2. Core Problems to Solve
- Poor Data Quality: The current
pdfplumber-based processor loses critical information from tables, flowcharts, and diagrams, leading to low-quality, out-of-context chunks in the vector store. - Inaccurate Retrieval: As a result of poor data quality, the retrieval system often fails to find the most relevant context, even when the information exists in the source PDFs.
- Robotic LLM Responses: The current system prompt is too restrictive, forcing the LLM into a fixed template and preventing natural, conversational answers.
3. The "Version 3.0" Plan
This plan is divided into three main phases, designed to be implemented sequentially.
Phase 1: Advanced Document Processing (Completed)
We have replaced our entire PDF processing pipeline with a modern, machine-learning-based tool to handle complex documents. Note: The AMA citation generation feature is deferred to focus on core functionality first.
- Technology: We are using the
unstructured.iolibrary for parsing. It is a robust, industry-standard tool for extracting text, tables, and other elements from complex PDFs. - Why
unstructured.io? After failed attempts with other libraries (mineru,nougat-ocr) due to performance and dependency issues,unstructured.ioproved to be the most reliable and effective solution. It uses models like Detectron2 under the hood (via ONNX, simplifying installation) and provides the high-resolution extraction needed for quality results. - Implementation Steps (Completed):
- Create
src/enhanced_pdf_processor.py: A new script built to use theunstructured.iolibrary. It processes a directory of PDFs and outputs structured Markdown files. - Use High-Resolution Strategy: The script leverages the
hi_resstrategy inunstructuredto accurately parse document layouts, convert tables to HTML, and extract images. - Update Dependencies: Replaced all previous PDF processing dependencies with
unstructured[local-inference]inrequirements.txt. - Re-process all documents: Ran the new script on all PDFs in the
Obsdirectory, storing the resulting.mdfiles and associated images in thesrc/processed_markdown/directory.
- Create
Phase 2: High-Precision Retrieval with Re-ranking (In Progress)
Once we have high-quality Markdown, we need to ensure our retrieval system can leverage it effectively.
- Technology: We will implement a Cross-Encoder Re-ranking strategy using the
sentence-transformerslibrary. - Why Re-ranking? A simple vector search (like our current FAISS implementation) is fast but not always precise. It can retrieve documents that are semantically nearby but not the most relevant. A re-ranker adds a second, more powerful validation step to dramatically increase precision.
- Implementation Steps:
- Update Chunking Strategy (Completed): In
src/groq_medical_rag.py, the document loading was changed to read from the new.mdfiles usingUnstructuredMarkdownLoader. We now use aRecursiveCharacterTextSplitterto create semantically aware chunks. - Implement 2-Stage Retrieval (Completed):
- Stage 1 (Recall): Use the existing FAISS vector store to retrieve a large number of candidate documents (e.g., top 20).
- Stage 2 (Precision): Use a
Cross-Encodermodel (cross-encoder/ms-marco-MiniLM-L-6-v2) from thesentence-transformerslibrary to score the relevance of these candidates against the user's query. We then select the top 5 highest-scoring documents to pass to the LLM.
- Update the RAG System (Completed): The core logic in
src/groq_medical_rag.pyhas been refactored to accommodate this new two-stage process. The confidence calculation has also been updated to use the re-ranked scores.
- Update Chunking Strategy (Completed): In
Phase 3: Dynamic and Natural LLM Interaction
With high-quality context, we can "unleash" the LLM to provide more human-like responses.
- Technology: Advanced Prompt Engineering.
- Why a new prompt? To move the LLM from a "template filler" to a "reasoning engine." We will give it a persona and a goal, rather than a rigid set of formatting rules.
- Implementation Steps:
- Rewrite the System Prompt: The
SYSTEM_PROMPTinsrc/groq_medical_rag.pywill be replaced with a new version. - Draft of New Prompt:
"You are a world-class medical expert and a compassionate assistant for healthcare professionals in Sri Lanka. Your primary goal is to provide accurate, evidence-based clinical information based only on the provided context from Sri Lankan maternal health guidelines. Your tone should be professional, clear, and supportive. While accuracy is paramount, present the information in a natural, easy-to-understand manner. Feel free to use lists, bullet points, or paragraphs to best structure the answer for clarity. After providing the answer, you must cite the source using the AMA-formatted citation provided with the context. At the end of every response, include the following disclaimer: 'This information is for clinical reference based on Sri Lankan guidelines and does not replace professional medical judgment.'"
- Rewrite the System Prompt: The
Phase 4: Standardized Citation Formatting
This phase will address the user's feedback on improving citation quality. The current citations are too long and not in a standard scientific format.
- Goal: To format the source citations in a consistent, professional, and standardized scientific style (e.g., AMA or Vancouver).
- Problem: The current
sourcemetadata is just a file path, which is not user-friendly. The LLM needs structured metadata to create proper citations. - Implementation Steps:
- Extract Citation Metadata: Modify the document processing script (
src/enhanced_pdf_processor.py) to extract structured citation information (e.g., authors, title, journal, year, page numbers) from each document. This could involve looking for patterns or specific text on the first few pages of each PDF. If not available, we will use the filename as a fallback. - Store Metadata: Add the extracted metadata to the
metadatafield of each document chunk created insrc/groq_medical_rag.py. - Create Citation Formatting Prompt: Develop a new system prompt or enhance the existing one in
src/groq_medical_rag.pyto instruct the LLM on how to format the citation using the provided metadata. We will ask it to generate a citation in a standard style like AMA. - Testing and Refinement: Test the new citation generation with various documents and queries to ensure it is robust and consistently produces well-formatted citations.
- Extract Citation Metadata: Modify the document processing script (
4. Expected Outcome
By the end of this implementation, the chatbot should be able to:
- Answer questions that require information from complex tables and flowcharts.
- Provide more accurate and relevant answers due to the high-precision retrieval pipeline.
- Include proper AMA-style citations for all retrieved information, enhancing trustworthiness.
- Interact with users in a more natural, helpful, and less robotic tone.
- Have a robust, state-of-the-art foundation for any future enhancements.
5. Project Status Board
- Phase 3: Dynamic and Natural LLM Interaction
- Rewrite the
SYSTEM_PROMPTinsrc/groq_medical_rag.py.
- Rewrite the
- Phase 4: Standardized Citation Formatting
- Modify
src/enhanced_pdf_processor.pyto extract structured citation metadata. - Update
src/groq_medical_rag.pyto store this metadata in document chunks. - Enhance the system prompt in
src/groq_medical_rag.pyto instruct the LLM on citation formatting. - Test and refine the citation generation.
- Modify
6. Executor's Feedback or Assistance Requests
No feedback at this time.
7. Branch Name
feature/standard-citations
Current Status / Progress Tracking
- Initial deployment to Hugging Face Spaces
- Fixed import issues
- Added environment variables
- Implemented memory optimization
- Fix Gradio version compatibility issue
Key Challenges and Analysis
Gradio Version Compatibility Issue (2024-03-21)
- Current deployment is using Gradio 4.0.0 which has security vulnerabilities
- The error occurs after successful embedding creation (825 embeddings)
- Need to update Gradio to latest version (4.x) to resolve security issues
- Proposed solution:
- Update requirements.txt to specify latest Gradio version
- Add explicit Gradio version to Hugging Face Space configuration
- Clear cache and redeploy
Vector Store Persistence Issue (2024-03-21)
- Current implementation recreates vector store on every Space startup
- We already have processed and embedded documents locally
- Need to:
- Package pre-computed vector store with deployment
- Modify initialization to use pre-computed store
- Only recreate if vector store is missing/corrupted
High-level Task Breakdown
Update Gradio Version
- Update requirements.txt to use latest Gradio version
- Success Criteria: requirements.txt shows updated version
Update Hugging Face Space Configuration
- Add explicit Gradio version in Space configuration
- Success Criteria: Space configuration shows correct version
Clear Cache and Redeploy
- Clear Hugging Face Space cache
- Redeploy application
- Success Criteria: No Gradio version warning in logs
Package Pre-computed Vector Store
- Create vector store locally
- Add vector store files to git
- Success Criteria: Vector store files committed to repository
Modify Vector Store Initialization
- Update
_initialize_system()to prioritize pre-computed store - Add error handling for corrupted stores
- Success Criteria: System uses pre-computed store without recreation
- Update
Update Deployment Process
- Include vector store in Space deployment
- Add vector store path to Space configuration
- Success Criteria: Space starts without recreating embeddings