VedaMD-Backend-v2 / PROJECT_STRUCTURE.md
sniro23's picture
Production ready: Clean codebase + Cerebras + Automated pipeline
b4971bd
# VedaMD Project Structure
**Clean, organized codebase for production deployment**
Last updated: October 23, 2025
---
## Directory Structure
```
SL Clinical Assistant/
β”œβ”€β”€ app.py # Gradio interface (HF Spaces entry point)
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ .env.example # Environment variable template
β”œβ”€β”€ .gitignore # Git ignore rules
β”‚
β”œβ”€β”€ src/ # Core application code
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ enhanced_groq_medical_rag.py # Main RAG system (Cerebras-powered)
β”‚ β”œβ”€β”€ enhanced_backend_api.py # FastAPI backend for frontend
β”‚ β”œβ”€β”€ simple_vector_store.py # Vector store loader
β”‚ β”œβ”€β”€ vector_store_compatibility.py # Compatibility wrapper (temporary)
β”‚ β”œβ”€β”€ enhanced_medical_context.py # Medical context enhancement
β”‚ └── medical_response_verifier.py # Response verification & safety
β”‚
β”œβ”€β”€ scripts/ # Automation scripts
β”‚ β”œβ”€β”€ build_vector_store.py # Build complete vector store from PDFs
β”‚ └── add_document.py # Add single document incrementally
β”‚
β”œβ”€β”€ frontend/ # Next.js frontend (separate deployment)
β”‚ β”œβ”€β”€ src/
β”‚ β”‚ β”œβ”€β”€ app/
β”‚ β”‚ β”œβ”€β”€ components/
β”‚ β”‚ └── lib/
β”‚ β”‚ └── api.ts # API client (FastAPI + Gradio support)
β”‚ β”œβ”€β”€ public/
β”‚ β”œβ”€β”€ package.json
β”‚ └── .env.local.example
β”‚
β”œβ”€β”€ data/ # Data files (local only, not in git)
β”‚ β”œβ”€β”€ guidelines/ # Source PDF files (moved from Obs/)
β”‚ β”œβ”€β”€ vector_store/ # Built vector store (FAISS + metadata)
β”‚ β”‚ β”œβ”€β”€ faiss_index.bin
β”‚ β”‚ β”œβ”€β”€ documents.json
β”‚ β”‚ β”œβ”€β”€ metadata.json
β”‚ β”‚ β”œβ”€β”€ config.json
β”‚ β”‚ └── backups/ # Automatic backups
β”‚ └── processed/ # Processed documents (optional)
β”‚
β”œβ”€β”€ docs/ # Documentation index
β”‚ └── README.md # Documentation directory index
β”‚
β”œβ”€β”€ archive/ # Old/deprecated files (not in git)
β”‚ β”œβ”€β”€ old_scripts/ # batch_ocr_pipeline.py, convert_pdf.py
β”‚ └── old_docs/ # output.md, cleanup_plan.md, etc.
β”‚
β”œβ”€β”€ test_pdfs/ # Test files (not in git)
β”œβ”€β”€ test_vector_store/ # Test vector store (not in git)
β”‚
└── Documentation Files # Root-level docs
β”œβ”€β”€ README.md # Main project README
β”œβ”€β”€ PIPELINE_GUIDE.md # Document pipeline usage guide
β”œβ”€β”€ LOCAL_TESTING_GUIDE.md # Local development guide
β”œβ”€β”€ IMPROVEMENT_PLAN.md # Project roadmap
β”œβ”€β”€ DEPLOYMENT.md # Deployment instructions
β”œβ”€β”€ SECURITY_SETUP.md # Security configuration
β”œβ”€β”€ CEREBRAS_MIGRATION_GUIDE.md # Cerebras migration details
β”œβ”€β”€ QUICK_START_CEREBRAS.md # Cerebras quickstart
β”œβ”€β”€ PRODUCTION_READINESS_REPORT.md # Production assessment
β”œβ”€β”€ CHANGES_SUMMARY.md # Summary of changes
└── CEREBRAS_SUMMARY.md # Cerebras integration summary
```
---
## Core Files
### Application Entry Points
| File | Purpose | Deployment |
|------|---------|------------|
| `app.py` | Gradio interface | Hugging Face Spaces |
| `src/enhanced_backend_api.py` | FastAPI REST API | Hugging Face Spaces (port 7862) |
| `frontend/` | Next.js frontend | Netlify / Vercel |
### RAG System
| File | Purpose | Key Features |
|------|---------|--------------|
| `src/enhanced_groq_medical_rag.py` | Main RAG orchestrator | Cerebras integration, multi-stage retrieval, medical safety |
| `src/simple_vector_store.py` | Vector store loader | HF Hub download, FAISS search |
| `src/enhanced_medical_context.py` | Medical context enhancement | Entity extraction, relevance scoring |
| `src/medical_response_verifier.py` | Response verification | Claim validation, source traceability |
### Automation Scripts
| Script | Purpose | Usage |
|--------|---------|-------|
| `scripts/build_vector_store.py` | Build complete vector store | `python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store --upload` |
| `scripts/add_document.py` | Add single document | `python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store --upload` |
### Startup Scripts
| Script | Purpose |
|--------|---------|
| `run_backend.sh` | Start FastAPI backend (port 7862) |
| `run_frontend.sh` | Start Next.js frontend (port 3000) |
| `kill_backend.sh` | Stop backend processes |
---
## Data Files
### Vector Store Files (data/vector_store/)
Generated by `build_vector_store.py`:
| File | Purpose | Format |
|------|---------|--------|
| `faiss_index.bin` | FAISS vector index | Binary |
| `documents.json` | Document chunks | JSON array of strings |
| `metadata.json` | Document metadata | JSON array of objects |
| `config.json` | Build configuration | JSON object |
| `build_log.json` | Build information | JSON object |
**Metadata Structure:**
```json
{
"source": "guideline.pdf",
"section": "Management",
"chunk_id": 0,
"chunk_size": 1000,
"file_hash": "a3f2c9d8...",
"extraction_method": "pymupdf",
"total_pages": 15,
"citation": "SLCOG Guidelines 2025",
"category": "Obstetrics",
"processed_at": "2025-10-23T15:08:30.273544"
}
```
---
## Configuration Files
### Environment Variables
**.env** (local development):
```bash
CEREBRAS_API_KEY=csk_your_key_here
HF_TOKEN=hf_your_token_here # For uploading vector store
```
**Hugging Face Spaces Secrets:**
```
CEREBRAS_API_KEY # Required
HF_TOKEN # Optional (for vector store upload)
ALLOWED_ORIGINS # Optional (CORS, comma-separated)
```
### Requirements
**requirements.txt** - Python dependencies:
- cerebras-cloud-sdk - Cerebras API client
- gradio - Web interface
- fastapi - REST API
- sentence-transformers - Embeddings
- faiss-cpu - Vector search
- huggingface-hub - Model/data hosting
- PyMuPDF, pdfplumber - PDF extraction
---
## Git Ignore Strategy
### Ignored (Local Only)
- `data/guidelines/` - Source PDFs
- `data/vector_store/` - Built vector store
- `archive/` - Old files
- `test_pdfs/`, `test_vector_store/` - Test files
- `frontend/` - Separate deployment
- `.env` - Local environment variables
- `*.log` - Log files
### Committed (Version Control)
- `src/` - Application code
- `scripts/` - Automation scripts
- `app.py` - Gradio entry point
- `requirements.txt` - Dependencies
- `.env.example` - Environment template
- `*.md` - Documentation
---
## Workflow
### Development Workflow
1. **Add new guideline:**
```bash
cp ~/Downloads/new_guideline.pdf data/guidelines/
```
2. **Update vector store:**
```bash
python scripts/add_document.py \
--file data/guidelines/new_guideline.pdf \
--citation "SLCOG Guidelines 2025" \
--vector-store-dir ./data/vector_store
```
3. **Test locally:**
```bash
# Terminal 1: Start backend
./run_backend.sh
# Terminal 2: Start frontend
./run_frontend.sh
# Or just test Gradio
python app.py
```
4. **Deploy to production:**
```bash
# Upload vector store to HF Hub
python scripts/build_vector_store.py \
--input-dir ./data/guidelines \
--output-dir ./data/vector_store \
--upload --repo-id sniro23/VedaMD-Vector-Store
# Push code to HF Spaces
git add src/ app.py requirements.txt
git commit -m "Update: Add new guidelines"
git push origin main
```
### Production Deployment
**Backend (Hugging Face Spaces):**
- Gradio interface: Automatic from `app.py`
- FastAPI API: Runs on port 7862
- Vector store: Downloaded from HF Hub on startup
- Secrets: Set in HF Spaces settings
**Frontend (Netlify):**
- Build: `cd frontend && npm run build`
- Deploy: Automatic from GitHub
- Environment: `NEXT_PUBLIC_API_URL=https://sniro23-vedamd-enhanced.hf.space`
---
## Migration Notes
### From Old Structure
**Moved:**
- `Obs/*.pdf` β†’ `data/guidelines/*.pdf`
- Vector store logic remains in `src/`
**Archived:**
- `batch_ocr_pipeline.py` β†’ `archive/old_scripts/`
- `convert_pdf.py` β†’ `archive/old_scripts/`
- `output*.md` β†’ `archive/old_docs/`
- `cleanup_plan.md` β†’ `archive/old_docs/`
**Created New:**
- `scripts/` - Automation scripts
- `data/` - Data directory structure
- `docs/` - Documentation index
- `archive/` - Old files
---
## Key Improvements
### Before Cleanup
```
SL Clinical Assistant/
β”œβ”€β”€ app.py
β”œβ”€β”€ src/
β”œβ”€β”€ Obs/ # Unclear name
β”œβ”€β”€ batch_ocr_pipeline.py # Old script at root
β”œβ”€β”€ convert_pdf.py # Old script at root
β”œβ”€β”€ output.md # Temporary file
β”œβ”€β”€ output_new.md # Temporary file
└── 15+ .md files at root # Disorganized docs
```
### After Cleanup
```
SL Clinical Assistant/
β”œβ”€β”€ app.py # Clear entry point
β”œβ”€β”€ src/ # Core code
β”œβ”€β”€ scripts/ # Automation scripts
β”œβ”€β”€ data/ # Data files
β”‚ β”œβ”€β”€ guidelines/ # Clear purpose
β”‚ └── vector_store/ # Clear purpose
β”œβ”€β”€ docs/ # Documentation index
β”œβ”€β”€ archive/ # Old files preserved
└── Documentation files # Organized at root
```
---
## Best Practices
### Code Organization
1. **Core Logic**: Keep in `src/`
2. **Automation**: Keep in `scripts/`
3. **Data**: Keep in `data/` (gitignored)
4. **Tests**: Keep in `tests/` (if created)
### Documentation
1. **User Guides**: Root level (PIPELINE_GUIDE.md, etc.)
2. **Technical Docs**: Root level (DEPLOYMENT.md, etc.)
3. **Code Docs**: Inline docstrings in Python files
4. **Index**: `docs/README.md` for navigation
### Data Management
1. **Source Data**: `data/guidelines/`
2. **Processed Data**: `data/vector_store/`
3. **Backups**: Automatic in `data/vector_store/backups/`
4. **Test Data**: `test_pdfs/`, `test_vector_store/`
### Version Control
1. **Commit Code**: `src/`, `scripts/`, `app.py`
2. **Ignore Data**: `data/`, `archive/`, `test_*/`
3. **Commit Docs**: All `.md` files
4. **Templates**: `.env.example`, not `.env`
---
## Quick Reference
### Common Commands
```bash
# Build vector store from scratch
python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store
# Add single document
python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store
# Start backend
./run_backend.sh
# Start frontend
./run_frontend.sh
# Test Gradio interface
python app.py
# Upload to HF Hub
python scripts/build_vector_store.py ... --upload --repo-id sniro23/VedaMD-Vector-Store
```
### Important Paths
- **PDFs**: `data/guidelines/`
- **Vector Store**: `data/vector_store/`
- **RAG System**: `src/enhanced_groq_medical_rag.py`
- **API**: `src/enhanced_backend_api.py`
- **Scripts**: `scripts/`
- **Docs**: Root level + `docs/README.md`
---
**Clean codebase = Maintainable codebase = Production-ready codebase**