Spaces:
Sleeping
Sleeping
| # VedaMD Project Structure | |
| **Clean, organized codebase for production deployment** | |
| Last updated: October 23, 2025 | |
| --- | |
| ## Directory Structure | |
| ``` | |
| SL Clinical Assistant/ | |
| βββ app.py # Gradio interface (HF Spaces entry point) | |
| βββ requirements.txt # Python dependencies | |
| βββ .env.example # Environment variable template | |
| βββ .gitignore # Git ignore rules | |
| β | |
| βββ src/ # Core application code | |
| β βββ __init__.py | |
| β βββ enhanced_groq_medical_rag.py # Main RAG system (Cerebras-powered) | |
| β βββ enhanced_backend_api.py # FastAPI backend for frontend | |
| β βββ simple_vector_store.py # Vector store loader | |
| β βββ vector_store_compatibility.py # Compatibility wrapper (temporary) | |
| β βββ enhanced_medical_context.py # Medical context enhancement | |
| β βββ medical_response_verifier.py # Response verification & safety | |
| β | |
| βββ scripts/ # Automation scripts | |
| β βββ build_vector_store.py # Build complete vector store from PDFs | |
| β βββ add_document.py # Add single document incrementally | |
| β | |
| βββ frontend/ # Next.js frontend (separate deployment) | |
| β βββ src/ | |
| β β βββ app/ | |
| β β βββ components/ | |
| β β βββ lib/ | |
| β β βββ api.ts # API client (FastAPI + Gradio support) | |
| β βββ public/ | |
| β βββ package.json | |
| β βββ .env.local.example | |
| β | |
| βββ data/ # Data files (local only, not in git) | |
| β βββ guidelines/ # Source PDF files (moved from Obs/) | |
| β βββ vector_store/ # Built vector store (FAISS + metadata) | |
| β β βββ faiss_index.bin | |
| β β βββ documents.json | |
| β β βββ metadata.json | |
| β β βββ config.json | |
| β β βββ backups/ # Automatic backups | |
| β βββ processed/ # Processed documents (optional) | |
| β | |
| βββ docs/ # Documentation index | |
| β βββ README.md # Documentation directory index | |
| β | |
| βββ archive/ # Old/deprecated files (not in git) | |
| β βββ old_scripts/ # batch_ocr_pipeline.py, convert_pdf.py | |
| β βββ old_docs/ # output.md, cleanup_plan.md, etc. | |
| β | |
| βββ test_pdfs/ # Test files (not in git) | |
| βββ test_vector_store/ # Test vector store (not in git) | |
| β | |
| βββ Documentation Files # Root-level docs | |
| βββ README.md # Main project README | |
| βββ PIPELINE_GUIDE.md # Document pipeline usage guide | |
| βββ LOCAL_TESTING_GUIDE.md # Local development guide | |
| βββ IMPROVEMENT_PLAN.md # Project roadmap | |
| βββ DEPLOYMENT.md # Deployment instructions | |
| βββ SECURITY_SETUP.md # Security configuration | |
| βββ CEREBRAS_MIGRATION_GUIDE.md # Cerebras migration details | |
| βββ QUICK_START_CEREBRAS.md # Cerebras quickstart | |
| βββ PRODUCTION_READINESS_REPORT.md # Production assessment | |
| βββ CHANGES_SUMMARY.md # Summary of changes | |
| βββ CEREBRAS_SUMMARY.md # Cerebras integration summary | |
| ``` | |
| --- | |
| ## Core Files | |
| ### Application Entry Points | |
| | File | Purpose | Deployment | | |
| |------|---------|------------| | |
| | `app.py` | Gradio interface | Hugging Face Spaces | | |
| | `src/enhanced_backend_api.py` | FastAPI REST API | Hugging Face Spaces (port 7862) | | |
| | `frontend/` | Next.js frontend | Netlify / Vercel | | |
| ### RAG System | |
| | File | Purpose | Key Features | | |
| |------|---------|--------------| | |
| | `src/enhanced_groq_medical_rag.py` | Main RAG orchestrator | Cerebras integration, multi-stage retrieval, medical safety | | |
| | `src/simple_vector_store.py` | Vector store loader | HF Hub download, FAISS search | | |
| | `src/enhanced_medical_context.py` | Medical context enhancement | Entity extraction, relevance scoring | | |
| | `src/medical_response_verifier.py` | Response verification | Claim validation, source traceability | | |
| ### Automation Scripts | |
| | Script | Purpose | Usage | | |
| |--------|---------|-------| | |
| | `scripts/build_vector_store.py` | Build complete vector store | `python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store --upload` | | |
| | `scripts/add_document.py` | Add single document | `python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store --upload` | | |
| ### Startup Scripts | |
| | Script | Purpose | | |
| |--------|---------| | |
| | `run_backend.sh` | Start FastAPI backend (port 7862) | | |
| | `run_frontend.sh` | Start Next.js frontend (port 3000) | | |
| | `kill_backend.sh` | Stop backend processes | | |
| --- | |
| ## Data Files | |
| ### Vector Store Files (data/vector_store/) | |
| Generated by `build_vector_store.py`: | |
| | File | Purpose | Format | | |
| |------|---------|--------| | |
| | `faiss_index.bin` | FAISS vector index | Binary | | |
| | `documents.json` | Document chunks | JSON array of strings | | |
| | `metadata.json` | Document metadata | JSON array of objects | | |
| | `config.json` | Build configuration | JSON object | | |
| | `build_log.json` | Build information | JSON object | | |
| **Metadata Structure:** | |
| ```json | |
| { | |
| "source": "guideline.pdf", | |
| "section": "Management", | |
| "chunk_id": 0, | |
| "chunk_size": 1000, | |
| "file_hash": "a3f2c9d8...", | |
| "extraction_method": "pymupdf", | |
| "total_pages": 15, | |
| "citation": "SLCOG Guidelines 2025", | |
| "category": "Obstetrics", | |
| "processed_at": "2025-10-23T15:08:30.273544" | |
| } | |
| ``` | |
| --- | |
| ## Configuration Files | |
| ### Environment Variables | |
| **.env** (local development): | |
| ```bash | |
| CEREBRAS_API_KEY=csk_your_key_here | |
| HF_TOKEN=hf_your_token_here # For uploading vector store | |
| ``` | |
| **Hugging Face Spaces Secrets:** | |
| ``` | |
| CEREBRAS_API_KEY # Required | |
| HF_TOKEN # Optional (for vector store upload) | |
| ALLOWED_ORIGINS # Optional (CORS, comma-separated) | |
| ``` | |
| ### Requirements | |
| **requirements.txt** - Python dependencies: | |
| - cerebras-cloud-sdk - Cerebras API client | |
| - gradio - Web interface | |
| - fastapi - REST API | |
| - sentence-transformers - Embeddings | |
| - faiss-cpu - Vector search | |
| - huggingface-hub - Model/data hosting | |
| - PyMuPDF, pdfplumber - PDF extraction | |
| --- | |
| ## Git Ignore Strategy | |
| ### Ignored (Local Only) | |
| - `data/guidelines/` - Source PDFs | |
| - `data/vector_store/` - Built vector store | |
| - `archive/` - Old files | |
| - `test_pdfs/`, `test_vector_store/` - Test files | |
| - `frontend/` - Separate deployment | |
| - `.env` - Local environment variables | |
| - `*.log` - Log files | |
| ### Committed (Version Control) | |
| - `src/` - Application code | |
| - `scripts/` - Automation scripts | |
| - `app.py` - Gradio entry point | |
| - `requirements.txt` - Dependencies | |
| - `.env.example` - Environment template | |
| - `*.md` - Documentation | |
| --- | |
| ## Workflow | |
| ### Development Workflow | |
| 1. **Add new guideline:** | |
| ```bash | |
| cp ~/Downloads/new_guideline.pdf data/guidelines/ | |
| ``` | |
| 2. **Update vector store:** | |
| ```bash | |
| python scripts/add_document.py \ | |
| --file data/guidelines/new_guideline.pdf \ | |
| --citation "SLCOG Guidelines 2025" \ | |
| --vector-store-dir ./data/vector_store | |
| ``` | |
| 3. **Test locally:** | |
| ```bash | |
| # Terminal 1: Start backend | |
| ./run_backend.sh | |
| # Terminal 2: Start frontend | |
| ./run_frontend.sh | |
| # Or just test Gradio | |
| python app.py | |
| ``` | |
| 4. **Deploy to production:** | |
| ```bash | |
| # Upload vector store to HF Hub | |
| python scripts/build_vector_store.py \ | |
| --input-dir ./data/guidelines \ | |
| --output-dir ./data/vector_store \ | |
| --upload --repo-id sniro23/VedaMD-Vector-Store | |
| # Push code to HF Spaces | |
| git add src/ app.py requirements.txt | |
| git commit -m "Update: Add new guidelines" | |
| git push origin main | |
| ``` | |
| ### Production Deployment | |
| **Backend (Hugging Face Spaces):** | |
| - Gradio interface: Automatic from `app.py` | |
| - FastAPI API: Runs on port 7862 | |
| - Vector store: Downloaded from HF Hub on startup | |
| - Secrets: Set in HF Spaces settings | |
| **Frontend (Netlify):** | |
| - Build: `cd frontend && npm run build` | |
| - Deploy: Automatic from GitHub | |
| - Environment: `NEXT_PUBLIC_API_URL=https://sniro23-vedamd-enhanced.hf.space` | |
| --- | |
| ## Migration Notes | |
| ### From Old Structure | |
| **Moved:** | |
| - `Obs/*.pdf` β `data/guidelines/*.pdf` | |
| - Vector store logic remains in `src/` | |
| **Archived:** | |
| - `batch_ocr_pipeline.py` β `archive/old_scripts/` | |
| - `convert_pdf.py` β `archive/old_scripts/` | |
| - `output*.md` β `archive/old_docs/` | |
| - `cleanup_plan.md` β `archive/old_docs/` | |
| **Created New:** | |
| - `scripts/` - Automation scripts | |
| - `data/` - Data directory structure | |
| - `docs/` - Documentation index | |
| - `archive/` - Old files | |
| --- | |
| ## Key Improvements | |
| ### Before Cleanup | |
| ``` | |
| SL Clinical Assistant/ | |
| βββ app.py | |
| βββ src/ | |
| βββ Obs/ # Unclear name | |
| βββ batch_ocr_pipeline.py # Old script at root | |
| βββ convert_pdf.py # Old script at root | |
| βββ output.md # Temporary file | |
| βββ output_new.md # Temporary file | |
| βββ 15+ .md files at root # Disorganized docs | |
| ``` | |
| ### After Cleanup | |
| ``` | |
| SL Clinical Assistant/ | |
| βββ app.py # Clear entry point | |
| βββ src/ # Core code | |
| βββ scripts/ # Automation scripts | |
| βββ data/ # Data files | |
| β βββ guidelines/ # Clear purpose | |
| β βββ vector_store/ # Clear purpose | |
| βββ docs/ # Documentation index | |
| βββ archive/ # Old files preserved | |
| βββ Documentation files # Organized at root | |
| ``` | |
| --- | |
| ## Best Practices | |
| ### Code Organization | |
| 1. **Core Logic**: Keep in `src/` | |
| 2. **Automation**: Keep in `scripts/` | |
| 3. **Data**: Keep in `data/` (gitignored) | |
| 4. **Tests**: Keep in `tests/` (if created) | |
| ### Documentation | |
| 1. **User Guides**: Root level (PIPELINE_GUIDE.md, etc.) | |
| 2. **Technical Docs**: Root level (DEPLOYMENT.md, etc.) | |
| 3. **Code Docs**: Inline docstrings in Python files | |
| 4. **Index**: `docs/README.md` for navigation | |
| ### Data Management | |
| 1. **Source Data**: `data/guidelines/` | |
| 2. **Processed Data**: `data/vector_store/` | |
| 3. **Backups**: Automatic in `data/vector_store/backups/` | |
| 4. **Test Data**: `test_pdfs/`, `test_vector_store/` | |
| ### Version Control | |
| 1. **Commit Code**: `src/`, `scripts/`, `app.py` | |
| 2. **Ignore Data**: `data/`, `archive/`, `test_*/` | |
| 3. **Commit Docs**: All `.md` files | |
| 4. **Templates**: `.env.example`, not `.env` | |
| --- | |
| ## Quick Reference | |
| ### Common Commands | |
| ```bash | |
| # Build vector store from scratch | |
| python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store | |
| # Add single document | |
| python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store | |
| # Start backend | |
| ./run_backend.sh | |
| # Start frontend | |
| ./run_frontend.sh | |
| # Test Gradio interface | |
| python app.py | |
| # Upload to HF Hub | |
| python scripts/build_vector_store.py ... --upload --repo-id sniro23/VedaMD-Vector-Store | |
| ``` | |
| ### Important Paths | |
| - **PDFs**: `data/guidelines/` | |
| - **Vector Store**: `data/vector_store/` | |
| - **RAG System**: `src/enhanced_groq_medical_rag.py` | |
| - **API**: `src/enhanced_backend_api.py` | |
| - **Scripts**: `scripts/` | |
| - **Docs**: Root level + `docs/README.md` | |
| --- | |
| **Clean codebase = Maintainable codebase = Production-ready codebase** | |