# Testing Strategy ## ensuring DeepCritical is Ironclad --- ## Overview Our testing strategy follows a strict **Pyramid of Reliability**: 1. **Unit Tests**: Fast, isolated logic checks (60% of tests) 2. **Integration Tests**: Tool interactions & Agent loops (30% of tests) 3. **E2E / Regression Tests**: Full research workflows (10% of tests) **Goal**: Ship a research agent that doesn't hallucinate, crash on API limits, or burn $100 in tokens by accident. --- ## 1. Unit Tests (Fast & Cheap) **Location**: `tests/unit/` Focus on individual components without external network calls. Mock everything. ### Key Test Cases #### Agent Logic - **Initialization**: Verify default config loads correctly. - **State Updates**: Ensure `ResearchState` updates correctly (e.g., token counts increment). - **Budget Checks**: Test `should_continue()` returns `False` when budget exceeded. - **Error Handling**: Test partial failure recovery (e.g., one tool fails, agent continues). #### Tools (Mocked) - **Parser Logic**: Feed raw XML/JSON to tool parsers and verify `Evidence` objects. - **Validation**: Ensure tools reject invalid queries (empty strings, etc.). #### Judge Prompts - **Schema Compliance**: Verify prompt templates generate valid JSON structure instructions. - **Variable Injection**: Ensure `{question}` and `{context}` are injected correctly into prompts. ```python # Example: Testing State Logic def test_budget_stop(): state = ResearchState(tokens_used=50001, max_tokens=50000) assert should_continue(state) is False ``` --- ## 2. Integration Tests (Realistic & Mocked I/O) **Location**: `tests/integration/` Focus on the interaction between the Orchestrator, Tools, and LLM Judge. Use **VCR.py** or **Replay** patterns to record/replay API calls to save money/time. ### Key Test Cases #### Search Loop - **Iteration Flow**: Verify agent performs Search -> Judge -> Search loop. - **Tool Selection**: Verify correct tools are called based on judge output (mocked judge). - **Context Accumulation**: Ensure findings from Iteration 1 are passed to Iteration 2. #### MCP Server Integration - **Server Startup**: Verify MCP server starts and exposes tools. - **Client Connection**: Verify agent can call tools via MCP protocol. ```python # Example: Testing Search Loop with Mocked Tools async def test_search_loop_flow(): agent = ResearchAgent(tools=[MockPubMed(), MockWeb()]) report = await agent.run("test query") assert agent.state.iterations > 0 assert len(report.sources) > 0 ``` --- ## 3. End-to-End (E2E) Tests (The "Real Deal") **Location**: `tests/e2e/` Run against **real APIs** (with strict rate limits) to verify system integrity. Run these **on demand** or **nightly**, not on every commit. ### Key Test Cases #### The "Golden Query" Run the primary demo query: *"What existing drugs might help treat long COVID fatigue?"* - **Success Criteria**: - Returns at least 2 valid drug candidates (e.g., CoQ10, LDN). - Includes citations from PubMed. - Completes within 3 iterations. - JSON output matches schema. #### Deployment Smoke Test - **Gradio UI**: Verify UI launches and accepts input. - **Streaming**: Verify generator yields chunks (first chunk within 2s). --- ## 4. Tools & Config ### Pytest Configuration ```toml # pyproject.toml [tool.pytest.ini_options] markers = [ "unit: fast, isolated tests", "integration: mocked network tests", "e2e: real network tests (slow, expensive)" ] asyncio_mode = "auto" ``` ### CI/CD Pipeline (GitHub Actions) 1. **Lint**: `ruff check .` 2. **Type Check**: `mypy .` 3. **Unit**: `pytest -m unit` 4. **Integration**: `pytest -m integration` 5. **E2E**: (Manual trigger only) --- ## 5. Anti-Hallucination Validation How do we test if the agent is lying? 1. **Citation Check**: - Regex verify that every `[PMID: 12345]` in the report exists in the `Evidence` list. - Fail if a citation is "orphaned" (hallucinated ID). 2. **Negative Constraints**: - Test queries for fake diseases ("Ligma syndrome") -> Agent should return "No evidence found". --- ## Checklist for Implementation - [ ] Set up `tests/` directory structure - [ ] Configure `pytest` and `vcrpy` - [ ] Create `tests/fixtures/` for mock data (PubMed XMLs) - [ ] Write first unit test for `ResearchState`