Spaces:

sniro23
/

VedaMD-Backend-v2

Sleeping

App Files Files Community

VedaMD-Backend-v2 / PROJECT_STRUCTURE.md

sniro23

Production ready: Clean codebase + Cerebras + Automated pipeline

b4971bd about 2 months ago

preview code

raw

history blame contribute delete

11.5 kB

	# VedaMD Project Structure

	Clean, organized codebase for production deployment

	Last updated: October 23, 2025

	---

	## Directory Structure

	```
	SL Clinical Assistant/
	├── app.py # Gradio interface (HF Spaces entry point)
	├── requirements.txt # Python dependencies
	├── .env.example # Environment variable template
	├── .gitignore # Git ignore rules
	│
	├── src/ # Core application code
	│ ├── __init__.py
	│ ├── enhanced_groq_medical_rag.py # Main RAG system (Cerebras-powered)
	│ ├── enhanced_backend_api.py # FastAPI backend for frontend
	│ ├── simple_vector_store.py # Vector store loader
	│ ├── vector_store_compatibility.py # Compatibility wrapper (temporary)
	│ ├── enhanced_medical_context.py # Medical context enhancement
	│ └── medical_response_verifier.py # Response verification & safety
	│
	├── scripts/ # Automation scripts
	│ ├── build_vector_store.py # Build complete vector store from PDFs
	│ └── add_document.py # Add single document incrementally
	│
	├── frontend/ # Next.js frontend (separate deployment)
	│ ├── src/
	│ │ ├── app/
	│ │ ├── components/
	│ │ └── lib/
	│ │ └── api.ts # API client (FastAPI + Gradio support)
	│ ├── public/
	│ ├── package.json
	│ └── .env.local.example
	│
	├── data/ # Data files (local only, not in git)
	│ ├── guidelines/ # Source PDF files (moved from Obs/)
	│ ├── vector_store/ # Built vector store (FAISS + metadata)
	│ │ ├── faiss_index.bin
	│ │ ├── documents.json
	│ │ ├── metadata.json
	│ │ ├── config.json
	│ │ └── backups/ # Automatic backups
	│ └── processed/ # Processed documents (optional)
	│
	├── docs/ # Documentation index
	│ └── README.md # Documentation directory index
	│
	├── archive/ # Old/deprecated files (not in git)
	│ ├── old_scripts/ # batch_ocr_pipeline.py, convert_pdf.py
	│ └── old_docs/ # output.md, cleanup_plan.md, etc.
	│
	├── test_pdfs/ # Test files (not in git)
	├── test_vector_store/ # Test vector store (not in git)
	│
	└── Documentation Files # Root-level docs
	├── README.md # Main project README
	├── PIPELINE_GUIDE.md # Document pipeline usage guide
	├── LOCAL_TESTING_GUIDE.md # Local development guide
	├── IMPROVEMENT_PLAN.md # Project roadmap
	├── DEPLOYMENT.md # Deployment instructions
	├── SECURITY_SETUP.md # Security configuration
	├── CEREBRAS_MIGRATION_GUIDE.md # Cerebras migration details
	├── QUICK_START_CEREBRAS.md # Cerebras quickstart
	├── PRODUCTION_READINESS_REPORT.md # Production assessment
	├── CHANGES_SUMMARY.md # Summary of changes
	└── CEREBRAS_SUMMARY.md # Cerebras integration summary
	```

	---

	## Core Files

	### Application Entry Points

	\| File \| Purpose \| Deployment \|
	\|------\|---------\|------------\|
	\| `app.py` \| Gradio interface \| Hugging Face Spaces \|
	\| `src/enhanced_backend_api.py` \| FastAPI REST API \| Hugging Face Spaces (port 7862) \|
	\| `frontend/` \| Next.js frontend \| Netlify / Vercel \|

	### RAG System

	\| File \| Purpose \| Key Features \|
	\|------\|---------\|--------------\|
	\| `src/enhanced_groq_medical_rag.py` \| Main RAG orchestrator \| Cerebras integration, multi-stage retrieval, medical safety \|
	\| `src/simple_vector_store.py` \| Vector store loader \| HF Hub download, FAISS search \|
	\| `src/enhanced_medical_context.py` \| Medical context enhancement \| Entity extraction, relevance scoring \|
	\| `src/medical_response_verifier.py` \| Response verification \| Claim validation, source traceability \|

	### Automation Scripts

	\| Script \| Purpose \| Usage \|
	\|--------\|---------\|-------\|
	\| `scripts/build_vector_store.py` \| Build complete vector store \| `python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store --upload` \|
	\| `scripts/add_document.py` \| Add single document \| `python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store --upload` \|

	### Startup Scripts

	\| Script \| Purpose \|
	\|--------\|---------\|
	\| `run_backend.sh` \| Start FastAPI backend (port 7862) \|
	\| `run_frontend.sh` \| Start Next.js frontend (port 3000) \|
	\| `kill_backend.sh` \| Stop backend processes \|

	---

	## Data Files

	### Vector Store Files (data/vector_store/)

	Generated by `build_vector_store.py`:

	\| File \| Purpose \| Format \|
	\|------\|---------\|--------\|
	\| `faiss_index.bin` \| FAISS vector index \| Binary \|
	\| `documents.json` \| Document chunks \| JSON array of strings \|
	\| `metadata.json` \| Document metadata \| JSON array of objects \|
	\| `config.json` \| Build configuration \| JSON object \|
	\| `build_log.json` \| Build information \| JSON object \|

	Metadata Structure:
	```json
	{
	"source": "guideline.pdf",
	"section": "Management",
	"chunk_id": 0,
	"chunk_size": 1000,
	"file_hash": "a3f2c9d8...",
	"extraction_method": "pymupdf",
	"total_pages": 15,
	"citation": "SLCOG Guidelines 2025",
	"category": "Obstetrics",
	"processed_at": "2025-10-23T15:08:30.273544"
	}
	```

	---

	## Configuration Files

	### Environment Variables

	.env (local development):
	```bash
	CEREBRAS_API_KEY=csk_your_key_here
	HF_TOKEN=hf_your_token_here # For uploading vector store
	```

	Hugging Face Spaces Secrets:
	```
	CEREBRAS_API_KEY # Required
	HF_TOKEN # Optional (for vector store upload)
	ALLOWED_ORIGINS # Optional (CORS, comma-separated)
	```

	### Requirements

	requirements.txt - Python dependencies:
	- cerebras-cloud-sdk - Cerebras API client
	- gradio - Web interface
	- fastapi - REST API
	- sentence-transformers - Embeddings
	- faiss-cpu - Vector search
	- huggingface-hub - Model/data hosting
	- PyMuPDF, pdfplumber - PDF extraction

	---

	## Git Ignore Strategy

	### Ignored (Local Only)

	- `data/guidelines/` - Source PDFs
	- `data/vector_store/` - Built vector store
	- `archive/` - Old files
	- `test_pdfs/`, `test_vector_store/` - Test files
	- `frontend/` - Separate deployment
	- `.env` - Local environment variables
	- `*.log` - Log files

	### Committed (Version Control)

	- `src/` - Application code
	- `scripts/` - Automation scripts
	- `app.py` - Gradio entry point
	- `requirements.txt` - Dependencies
	- `.env.example` - Environment template
	- `*.md` - Documentation

	---

	## Workflow

	### Development Workflow

	1. Add new guideline:
	```bash
	cp ~/Downloads/new_guideline.pdf data/guidelines/
	```

	2. Update vector store:
	```bash
	python scripts/add_document.py \
	--file data/guidelines/new_guideline.pdf \
	--citation "SLCOG Guidelines 2025" \
	--vector-store-dir ./data/vector_store
	```

	3. Test locally:
	```bash
	# Terminal 1: Start backend
	./run_backend.sh

	# Terminal 2: Start frontend
	./run_frontend.sh

	# Or just test Gradio
	python app.py
	```

	4. Deploy to production:
	```bash
	# Upload vector store to HF Hub
	python scripts/build_vector_store.py \
	--input-dir ./data/guidelines \
	--output-dir ./data/vector_store \
	--upload --repo-id sniro23/VedaMD-Vector-Store

	# Push code to HF Spaces
	git add src/ app.py requirements.txt
	git commit -m "Update: Add new guidelines"
	git push origin main
	```

	### Production Deployment

	Backend (Hugging Face Spaces):
	- Gradio interface: Automatic from `app.py`
	- FastAPI API: Runs on port 7862
	- Vector store: Downloaded from HF Hub on startup
	- Secrets: Set in HF Spaces settings

	Frontend (Netlify):
	- Build: `cd frontend && npm run build`
	- Deploy: Automatic from GitHub
	- Environment: `NEXT_PUBLIC_API_URL=https://sniro23-vedamd-enhanced.hf.space`

	---

	## Migration Notes

	### From Old Structure

	Moved:
	- `Obs/.pdf` → `data/guidelines/.pdf`
	- Vector store logic remains in `src/`

	Archived:
	- `batch_ocr_pipeline.py` → `archive/old_scripts/`
	- `convert_pdf.py` → `archive/old_scripts/`
	- `output*.md` → `archive/old_docs/`
	- `cleanup_plan.md` → `archive/old_docs/`

	Created New:
	- `scripts/` - Automation scripts
	- `data/` - Data directory structure
	- `docs/` - Documentation index
	- `archive/` - Old files

	---

	## Key Improvements

	### Before Cleanup
	```
	SL Clinical Assistant/
	├── app.py
	├── src/
	├── Obs/ # Unclear name
	├── batch_ocr_pipeline.py # Old script at root
	├── convert_pdf.py # Old script at root
	├── output.md # Temporary file
	├── output_new.md # Temporary file
	└── 15+ .md files at root # Disorganized docs
	```

	### After Cleanup
	```
	SL Clinical Assistant/
	├── app.py # Clear entry point
	├── src/ # Core code
	├── scripts/ # Automation scripts
	├── data/ # Data files
	│ ├── guidelines/ # Clear purpose
	│ └── vector_store/ # Clear purpose
	├── docs/ # Documentation index
	├── archive/ # Old files preserved
	└── Documentation files # Organized at root
	```

	---

	## Best Practices

	### Code Organization

	1. Core Logic: Keep in `src/`
	2. Automation: Keep in `scripts/`
	3. Data: Keep in `data/` (gitignored)
	4. Tests: Keep in `tests/` (if created)

	### Documentation

	1. User Guides: Root level (PIPELINE_GUIDE.md, etc.)
	2. Technical Docs: Root level (DEPLOYMENT.md, etc.)
	3. Code Docs: Inline docstrings in Python files
	4. Index: `docs/README.md` for navigation

	### Data Management

	1. Source Data: `data/guidelines/`
	2. Processed Data: `data/vector_store/`
	3. Backups: Automatic in `data/vector_store/backups/`
	4. Test Data: `test_pdfs/`, `test_vector_store/`

	### Version Control

	1. Commit Code: `src/`, `scripts/`, `app.py`
	2. Ignore Data: `data/`, `archive/`, `test_*/`
	3. Commit Docs: All `.md` files
	4. Templates: `.env.example`, not `.env`

	---

	## Quick Reference

	### Common Commands

	```bash
	# Build vector store from scratch
	python scripts/build_vector_store.py --input-dir ./data/guidelines --output-dir ./data/vector_store

	# Add single document
	python scripts/add_document.py --file new.pdf --vector-store-dir ./data/vector_store

	# Start backend
	./run_backend.sh

	# Start frontend
	./run_frontend.sh

	# Test Gradio interface
	python app.py

	# Upload to HF Hub
	python scripts/build_vector_store.py ... --upload --repo-id sniro23/VedaMD-Vector-Store
	```

	### Important Paths

	- PDFs: `data/guidelines/`
	- Vector Store: `data/vector_store/`
	- RAG System: `src/enhanced_groq_medical_rag.py`
	- API: `src/enhanced_backend_api.py`
	- Scripts: `scripts/`
	- Docs: Root level + `docs/README.md`

	---

	Clean codebase = Maintainable codebase = Production-ready codebase