# Phase 3 Implementation Spec: Judge Vertical Slice **Goal**: Implement the "Brain" of the agent — evaluating evidence quality. **Philosophy**: "Structured Output or Bust." **Prerequisite**: Phase 2 complete (all search tests passing) --- ## 1. The Slice Definition This slice covers: 1. **Input**: A user question + a list of `Evidence` (from Phase 2). 2. **Process**: - Construct a prompt with the evidence. - Call LLM (PydanticAI / OpenAI / Anthropic). - Force JSON structured output. 3. **Output**: A `JudgeAssessment` object. **Files to Create**: - `src/utils/models.py` - Add JudgeAssessment models (extend from Phase 2) - `src/prompts/judge.py` - Judge prompt templates - `src/agent_factory/judges.py` - JudgeHandler with PydanticAI - `tests/unit/agent_factory/test_judges.py` - Unit tests --- ## 2. Models (Add to `src/utils/models.py`) The output schema must be strict for reliable structured output. ```python """Add these models to src/utils/models.py (after Evidence models from Phase 2).""" from pydantic import BaseModel, Field from typing import List, Literal class AssessmentDetails(BaseModel): """Detailed assessment of evidence quality.""" mechanism_score: int = Field( ..., ge=0, le=10, description="How well does the evidence explain the mechanism? 0-10" ) mechanism_reasoning: str = Field( ..., min_length=10, description="Explanation of mechanism score" ) clinical_evidence_score: int = Field( ..., ge=0, le=10, description="Strength of clinical/preclinical evidence. 0-10" ) clinical_reasoning: str = Field( ..., min_length=10, description="Explanation of clinical evidence score" ) drug_candidates: List[str] = Field( default_factory=list, description="List of specific drug candidates mentioned" ) key_findings: List[str] = Field( default_factory=list, description="Key findings from the evidence" ) class JudgeAssessment(BaseModel): """Complete assessment from the Judge.""" details: AssessmentDetails sufficient: bool = Field( ..., description="Is evidence sufficient to provide a recommendation?" ) confidence: float = Field( ..., ge=0.0, le=1.0, description="Confidence in the assessment (0-1)" ) recommendation: Literal["continue", "synthesize"] = Field( ..., description="continue = need more evidence, synthesize = ready to answer" ) next_search_queries: List[str] = Field( default_factory=list, description="If continue, what queries to search next" ) reasoning: str = Field( ..., min_length=20, description="Overall reasoning for the recommendation" ) ``` --- ## 3. Prompt Engineering (`src/prompts/judge.py`) We treat prompts as code. They should be versioned and clean. ```python """Judge prompts for evidence assessment.""" from typing import List from src.utils.models import Evidence SYSTEM_PROMPT = """You are an expert drug repurposing research judge. Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to recommend drug candidates for a given condition. ## Evaluation Criteria 1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism? - 0-3: No clear mechanism, speculative - 4-6: Some mechanistic insight, but gaps exist - 7-10: Clear, well-supported mechanism of action 2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support? - 0-3: No clinical data, only theoretical - 4-6: Preclinical or early clinical data - 7-10: Strong clinical evidence (trials, meta-analyses) 3. **Sufficiency**: Evidence is sufficient when: - Combined scores >= 12 AND - At least one specific drug candidate identified AND - Clear mechanistic rationale exists ## Output Rules - Always output valid JSON matching the schema - Be conservative: only recommend "synthesize" when truly confident - If continuing, suggest specific, actionable search queries - Never hallucinate drug names or findings not in the evidence """ def format_user_prompt(question: str, evidence: List[Evidence]) -> str: """ Format the user prompt with question and evidence. Args: question: The user's research question evidence: List of Evidence objects from search Returns: Formatted prompt string """ evidence_text = "\n\n".join([ f"### Evidence {i+1}\n" f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n" f"**URL**: {e.citation.url}\n" f"**Date**: {e.citation.date}\n" f"**Content**:\n{e.content[:1500]}..." if len(e.content) > 1500 else f"### Evidence {i+1}\n" f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n" f"**URL**: {e.citation.url}\n" f"**Date**: {e.citation.date}\n" f"**Content**:\n{e.content}" for i, e in enumerate(evidence) ]) return f"""## Research Question {question} ## Available Evidence ({len(evidence)} sources) {evidence_text} ## Your Task Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates. Respond with a JSON object matching the JudgeAssessment schema. """ def format_empty_evidence_prompt(question: str) -> str: """ Format prompt when no evidence was found. Args: question: The user's research question Returns: Formatted prompt string """ return f"""## Research Question {question} ## Available Evidence No evidence was found from the search. ## Your Task Since no evidence was found, recommend search queries that might yield better results. Set sufficient=False and recommendation="continue". Suggest 3-5 specific search queries. """ ``` --- ## 4. JudgeHandler Implementation (`src/agent_factory/judges.py`) Using PydanticAI for structured output with retry logic. ```python """Judge handler for evidence assessment using PydanticAI.""" import os from typing import List import structlog from pydantic_ai import Agent from pydantic_ai.models.openai import OpenAIModel from pydantic_ai.models.anthropic import AnthropicModel from src.utils.models import Evidence, JudgeAssessment, AssessmentDetails from src.utils.config import settings from src.prompts.judge import SYSTEM_PROMPT, format_user_prompt, format_empty_evidence_prompt logger = structlog.get_logger() def get_model(): """Get the LLM model based on configuration.""" provider = getattr(settings, "llm_provider", "openai") if provider == "anthropic": return AnthropicModel( model_name=getattr(settings, "anthropic_model", "claude-3-5-sonnet-20241022"), api_key=os.getenv("ANTHROPIC_API_KEY"), ) else: return OpenAIModel( model_name=getattr(settings, "openai_model", "gpt-4o"), api_key=os.getenv("OPENAI_API_KEY"), ) class JudgeHandler: """ Handles evidence assessment using an LLM with structured output. Uses PydanticAI to ensure responses match the JudgeAssessment schema. """ def __init__(self, model=None): """ Initialize the JudgeHandler. Args: model: Optional PydanticAI model. If None, uses config default. """ self.model = model or get_model() self.agent = Agent( model=self.model, result_type=JudgeAssessment, system_prompt=SYSTEM_PROMPT, retries=3, ) async def assess( self, question: str, evidence: List[Evidence], ) -> JudgeAssessment: """ Assess evidence and determine if it's sufficient. Args: question: The user's research question evidence: List of Evidence objects from search Returns: JudgeAssessment with evaluation results Raises: JudgeError: If assessment fails after retries """ logger.info( "Starting evidence assessment", question=question[:100], evidence_count=len(evidence), ) # Format the prompt based on whether we have evidence if evidence: user_prompt = format_user_prompt(question, evidence) else: user_prompt = format_empty_evidence_prompt(question) try: # Run the agent with structured output result = await self.agent.run(user_prompt) assessment = result.data logger.info( "Assessment complete", sufficient=assessment.sufficient, recommendation=assessment.recommendation, confidence=assessment.confidence, ) return assessment except Exception as e: logger.error("Assessment failed", error=str(e)) # Return a safe default assessment on failure return self._create_fallback_assessment(question, str(e)) def _create_fallback_assessment( self, question: str, error: str, ) -> JudgeAssessment: """ Create a fallback assessment when LLM fails. Args: question: The original question error: The error message Returns: Safe fallback JudgeAssessment """ return JudgeAssessment( details=AssessmentDetails( mechanism_score=0, mechanism_reasoning="Assessment failed due to LLM error", clinical_evidence_score=0, clinical_reasoning="Assessment failed due to LLM error", drug_candidates=[], key_findings=[], ), sufficient=False, confidence=0.0, recommendation="continue", next_search_queries=[ f"{question} mechanism", f"{question} clinical trials", f"{question} drug candidates", ], reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.", ) class HFInferenceJudgeHandler: """ JudgeHandler using HuggingFace Inference API for FREE LLM calls. This is the DEFAULT for demo mode - provides real AI analysis without requiring users to have OpenAI/Anthropic API keys. Model Fallback Chain (handles gated models and rate limits): 1. meta-llama/Llama-3.1-8B-Instruct (best quality, requires HF_TOKEN) 2. mistralai/Mistral-7B-Instruct-v0.3 (good quality, may require token) 3. HuggingFaceH4/zephyr-7b-beta (ungated, always works) Rate Limit Handling: - Exponential backoff with 3 retries - Falls back to next model on persistent 429/503 errors """ # Model fallback chain: gated (best) → ungated (fallback) FALLBACK_MODELS = [ "meta-llama/Llama-3.1-8B-Instruct", # Best quality (gated) "mistralai/Mistral-7B-Instruct-v0.3", # Good quality "HuggingFaceH4/zephyr-7b-beta", # Ungated fallback ] def __init__(self, model_id: str | None = None) -> None: """ Initialize with HF Inference client. Args: model_id: Optional specific model ID. If None, uses FALLBACK_MODELS chain. """ self.model_id = model_id # Will automatically use HF_TOKEN from env if available self.client = InferenceClient() self.call_count = 0 self.last_question: str | None = None self.last_evidence: list[Evidence] | None = None def _extract_json(self, text: str) -> dict[str, Any] | None: """ Robust JSON extraction that handles markdown blocks and nested braces. """ text = text.strip() # Remove markdown code blocks if present (with bounds checking) if "```json" in text: parts = text.split("```json", 1) if len(parts) > 1: inner_parts = parts[1].split("```", 1) text = inner_parts[0] elif "```" in text: parts = text.split("```", 1) if len(parts) > 1: inner_parts = parts[1].split("```", 1) text = inner_parts[0] text = text.strip() # Find first '{' start_idx = text.find("{") if start_idx == -1: return None # Stack-based parsing ignoring chars in strings count = 0 in_string = False escape = False for i, char in enumerate(text[start_idx:], start=start_idx): if in_string: if escape: escape = False elif char == "\\": escape = True elif char == '"': in_string = False elif char == '"': in_string = True elif char == "{": count += 1 elif char == "}": count -= 1 if count == 0: try: result = json.loads(text[start_idx : i + 1]) if isinstance(result, dict): return result return None except json.JSONDecodeError: return None return None @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=4), retry=retry_if_exception_type(Exception), reraise=True, ) async def _call_with_retry(self, model: str, prompt: str, question: str) -> JudgeAssessment: """Make API call with retry logic using chat_completion.""" loop = asyncio.get_running_loop() # Build messages for chat_completion (model-agnostic) messages = [ { "role": "system", "content": f"""{SYSTEM_PROMPT} IMPORTANT: Respond with ONLY valid JSON matching this schema: {{ "details": {{ "mechanism_score": , "mechanism_reasoning": "", "clinical_evidence_score": , "clinical_reasoning": "", "drug_candidates": ["", ...], "key_findings": ["", ...] }}, "sufficient": , "confidence": , "recommendation": "continue" | "synthesize", "next_search_queries": ["", ...], "reasoning": "" }}""", }, {"role": "user", "content": prompt}, ] # Use chat_completion (conversational task - supported by all models) response = await loop.run_in_executor( None, lambda: self.client.chat_completion( messages=messages, model=model, max_tokens=1024, temperature=0.1, ), ) # Extract content from response content = response.choices[0].message.content if not content: raise ValueError("Empty response from model") # Extract and parse JSON json_data = self._extract_json(content) if not json_data: raise ValueError("No valid JSON found in response") return JudgeAssessment(**json_data) async def assess( self, question: str, evidence: list[Evidence], ) -> JudgeAssessment: """ Assess evidence using HuggingFace Inference API. Attempts models in order until one succeeds. """ self.call_count += 1 self.last_question = question self.last_evidence = evidence # Format the user prompt if evidence: user_prompt = format_user_prompt(question, evidence) else: user_prompt = format_empty_evidence_prompt(question) models_to_try: list[str] = [self.model_id] if self.model_id else self.FALLBACK_MODELS last_error: Exception | None = None for model in models_to_try: try: return await self._call_with_retry(model, user_prompt, question) except Exception as e: logger.warning("Model failed", model=model, error=str(e)) last_error = e continue # All models failed logger.error("All HF models failed", error=str(last_error)) return self._create_fallback_assessment(question, str(last_error)) def _create_fallback_assessment( self, question: str, error: str, ) -> JudgeAssessment: """Create a fallback assessment when inference fails.""" return JudgeAssessment( details=AssessmentDetails( mechanism_score=0, mechanism_reasoning=f"Assessment failed: {error}", clinical_evidence_score=0, clinical_reasoning=f"Assessment failed: {error}", drug_candidates=[], key_findings=[], ), sufficient=False, confidence=0.0, recommendation="continue", next_search_queries=[ f"{question} mechanism", f"{question} clinical trials", f"{question} drug candidates", ], reasoning=f"HF Inference failed: {error}. Recommend retrying.", ) class MockJudgeHandler: """ Mock JudgeHandler for UNIT TESTING ONLY. NOT for production use. Use HFInferenceJudgeHandler for demo mode. """ def __init__(self, mock_response: JudgeAssessment | None = None): """Initialize with optional mock response for testing.""" self.mock_response = mock_response self.call_count = 0 self.last_question = None self.last_evidence = None async def assess( self, question: str, evidence: List[Evidence], ) -> JudgeAssessment: """Return the mock response (for testing only).""" self.call_count += 1 self.last_question = question self.last_evidence = evidence if self.mock_response: return self.mock_response # Default mock response for tests return JudgeAssessment( details=AssessmentDetails( mechanism_score=7, mechanism_reasoning="Mock assessment for testing", clinical_evidence_score=6, clinical_reasoning="Mock assessment for testing", drug_candidates=["TestDrug"], key_findings=["Test finding"], ), sufficient=len(evidence) >= 3, confidence=0.75, recommendation="synthesize" if len(evidence) >= 3 else "continue", next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [], reasoning="Mock assessment for unit testing only", ) ``` --- ## 5. TDD Workflow ### Test File: `tests/unit/agent_factory/test_judges.py` ```python """Unit tests for JudgeHandler.""" import pytest from unittest.mock import AsyncMock, MagicMock, patch from src.utils.models import ( Evidence, Citation, JudgeAssessment, AssessmentDetails, ) class TestJudgeHandler: """Tests for JudgeHandler.""" @pytest.mark.asyncio async def test_assess_returns_assessment(self): """JudgeHandler should return JudgeAssessment from LLM.""" from src.agent_factory.judges import JudgeHandler # Create mock assessment mock_assessment = JudgeAssessment( details=AssessmentDetails( mechanism_score=8, mechanism_reasoning="Strong mechanistic evidence", clinical_evidence_score=7, clinical_reasoning="Good clinical support", drug_candidates=["Metformin"], key_findings=["Neuroprotective effects"], ), sufficient=True, confidence=0.85, recommendation="synthesize", next_search_queries=[], reasoning="Evidence is sufficient for synthesis", ) # Mock the PydanticAI agent mock_result = MagicMock() mock_result.data = mock_assessment with patch("src.agent_factory.judges.Agent") as mock_agent_class: mock_agent = AsyncMock() mock_agent.run = AsyncMock(return_value=mock_result) mock_agent_class.return_value = mock_agent handler = JudgeHandler() # Replace the agent with our mock handler.agent = mock_agent evidence = [ Evidence( content="Metformin shows neuroprotective properties...", citation=Citation( source="pubmed", title="Metformin in AD", url="https://pubmed.ncbi.nlm.nih.gov/12345/", date="2024-01-01", ), ) ] result = await handler.assess("metformin alzheimer", evidence) assert result.sufficient is True assert result.recommendation == "synthesize" assert result.confidence == 0.85 assert "Metformin" in result.details.drug_candidates @pytest.mark.asyncio async def test_assess_empty_evidence(self): """JudgeHandler should handle empty evidence gracefully.""" from src.agent_factory.judges import JudgeHandler mock_assessment = JudgeAssessment( details=AssessmentDetails( mechanism_score=0, mechanism_reasoning="No evidence to assess", clinical_evidence_score=0, clinical_reasoning="No evidence to assess", drug_candidates=[], key_findings=[], ), sufficient=False, confidence=0.0, recommendation="continue", next_search_queries=["metformin alzheimer mechanism"], reasoning="No evidence found, need to search more", ) mock_result = MagicMock() mock_result.data = mock_assessment with patch("src.agent_factory.judges.Agent") as mock_agent_class: mock_agent = AsyncMock() mock_agent.run = AsyncMock(return_value=mock_result) mock_agent_class.return_value = mock_agent handler = JudgeHandler() handler.agent = mock_agent result = await handler.assess("metformin alzheimer", []) assert result.sufficient is False assert result.recommendation == "continue" assert len(result.next_search_queries) > 0 @pytest.mark.asyncio async def test_assess_handles_llm_failure(self): """JudgeHandler should return fallback on LLM failure.""" from src.agent_factory.judges import JudgeHandler with patch("src.agent_factory.judges.Agent") as mock_agent_class: mock_agent = AsyncMock() mock_agent.run = AsyncMock(side_effect=Exception("API Error")) mock_agent_class.return_value = mock_agent handler = JudgeHandler() handler.agent = mock_agent evidence = [ Evidence( content="Some content", citation=Citation( source="pubmed", title="Title", url="url", date="2024", ), ) ] result = await handler.assess("test question", evidence) # Should return fallback, not raise assert result.sufficient is False assert result.recommendation == "continue" assert "failed" in result.reasoning.lower() class TestHFInferenceJudgeHandler: """Tests for HFInferenceJudgeHandler.""" @pytest.mark.asyncio async def test_extract_json_raw(self): """Should extract raw JSON.""" from src.agent_factory.judges import HFInferenceJudgeHandler handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler) # Bypass __init__ for unit testing extraction result = handler._extract_json('{"key": "value"}') assert result == {"key": "value"} @pytest.mark.asyncio async def test_extract_json_markdown_block(self): """Should extract JSON from markdown code block.""" from src.agent_factory.judges import HFInferenceJudgeHandler handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler) response = '''Here is the assessment: ```json {"key": "value", "nested": {"inner": 1}} ``` ''' result = handler._extract_json(response) assert result == {"key": "value", "nested": {"inner": 1}} @pytest.mark.asyncio async def test_extract_json_with_preamble(self): """Should extract JSON with preamble text.""" from src.agent_factory.judges import HFInferenceJudgeHandler handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler) response = 'Here is your JSON response:\n{"sufficient": true, "confidence": 0.85}' result = handler._extract_json(response) assert result == {"sufficient": True, "confidence": 0.85} @pytest.mark.asyncio async def test_extract_json_nested_braces(self): """Should handle nested braces correctly.""" from src.agent_factory.judges import HFInferenceJudgeHandler handler = HFInferenceJudgeHandler.__new__(HFInferenceJudgeHandler) response = '{"details": {"mechanism_score": 8}, "reasoning": "test"}' result = handler._extract_json(response) assert result["details"]["mechanism_score"] == 8 @pytest.mark.asyncio async def test_hf_handler_uses_fallback_models(self): """HFInferenceJudgeHandler should have fallback model chain.""" from src.agent_factory.judges import HFInferenceJudgeHandler # Check class has fallback models defined assert len(HFInferenceJudgeHandler.FALLBACK_MODELS) >= 3 assert "zephyr-7b-beta" in HFInferenceJudgeHandler.FALLBACK_MODELS[-1] @pytest.mark.asyncio async def test_hf_handler_fallback_on_auth_error(self): """Should fall back to ungated model on auth error.""" from src.agent_factory.judges import HFInferenceJudgeHandler from unittest.mock import MagicMock, patch with patch("src.agent_factory.judges.InferenceClient") as mock_client_class: # First call raises 403, second succeeds mock_client = MagicMock() mock_client.chat_completion.side_effect = [ Exception("403 Forbidden: gated model"), MagicMock(choices=[MagicMock(message=MagicMock(content='{"sufficient": false}'))]) ] mock_client_class.return_value = mock_client handler = HFInferenceJudgeHandler() # Manually trigger fallback test assert handler._try_fallback_model() is True assert handler.model_id != "meta-llama/Llama-3.1-8B-Instruct" class TestMockJudgeHandler: """Tests for MockJudgeHandler (UNIT TESTING ONLY).""" @pytest.mark.asyncio async def test_mock_handler_returns_default(self): """MockJudgeHandler should return default assessment.""" from src.agent_factory.judges import MockJudgeHandler handler = MockJudgeHandler() evidence = [ Evidence( content="Content 1", citation=Citation(source="pubmed", title="T1", url="u1", date="2024"), ), Evidence( content="Content 2", citation=Citation(source="web", title="T2", url="u2", date="2024"), ), ] result = await handler.assess("test", evidence) assert handler.call_count == 1 assert handler.last_question == "test" assert len(handler.last_evidence) == 2 assert result.details.mechanism_score == 7 @pytest.mark.asyncio async def test_mock_handler_custom_response(self): """MockJudgeHandler should return custom response when provided.""" from src.agent_factory.judges import MockJudgeHandler custom_assessment = JudgeAssessment( details=AssessmentDetails( mechanism_score=10, mechanism_reasoning="Custom reasoning", clinical_evidence_score=10, clinical_reasoning="Custom clinical", drug_candidates=["CustomDrug"], key_findings=["Custom finding"], ), sufficient=True, confidence=1.0, recommendation="synthesize", next_search_queries=[], reasoning="Custom assessment", ) handler = MockJudgeHandler(mock_response=custom_assessment) result = await handler.assess("test", []) assert result.details.mechanism_score == 10 assert result.details.drug_candidates == ["CustomDrug"] @pytest.mark.asyncio async def test_mock_handler_insufficient_with_few_evidence(self): """MockJudgeHandler should recommend continue with < 3 evidence.""" from src.agent_factory.judges import MockJudgeHandler handler = MockJudgeHandler() # Only 2 pieces of evidence evidence = [ Evidence( content="Content", citation=Citation(source="pubmed", title="T", url="u", date="2024"), ), Evidence( content="Content 2", citation=Citation(source="web", title="T2", url="u2", date="2024"), ), ] result = await handler.assess("test", evidence) assert result.sufficient is False assert result.recommendation == "continue" assert len(result.next_search_queries) > 0 ``` --- ## 6. Dependencies Add to `pyproject.toml`: ```toml [project] dependencies = [ # ... existing deps ... "pydantic-ai>=0.0.16", "openai>=1.0.0", "anthropic>=0.18.0", "huggingface-hub>=0.20.0", # For HFInferenceJudgeHandler (FREE LLM) ] ``` **Note**: `huggingface-hub` is required for the free tier to work. It: - Provides `InferenceClient` for API calls - Auto-reads `HF_TOKEN` from environment (optional, for gated models) - Works without any token for ungated models like `zephyr-7b-beta` --- ## 7. Configuration (`src/utils/config.py`) Add LLM configuration: ```python """Add to src/utils/config.py.""" from pydantic_settings import BaseSettings from typing import Literal class Settings(BaseSettings): """Application settings.""" # LLM Configuration llm_provider: Literal["openai", "anthropic"] = "openai" openai_model: str = "gpt-4o" anthropic_model: str = "claude-3-5-sonnet-20241022" # API Keys (loaded from environment) openai_api_key: str | None = None anthropic_api_key: str | None = None ncbi_api_key: str | None = None class Config: env_file = ".env" env_file_encoding = "utf-8" settings = Settings() ``` --- ## 8. Implementation Checklist - [ ] Add `AssessmentDetails` and `JudgeAssessment` models to `src/utils/models.py` - [ ] Create `src/prompts/__init__.py` (empty, for package) - [ ] Create `src/prompts/judge.py` with prompt templates - [ ] Create `src/agent_factory/__init__.py` with exports - [ ] Implement `src/agent_factory/judges.py` with JudgeHandler - [ ] Update `src/utils/config.py` with LLM settings - [ ] Create `tests/unit/agent_factory/__init__.py` - [ ] Write tests in `tests/unit/agent_factory/test_judges.py` - [ ] Run `uv run pytest tests/unit/agent_factory/ -v` — **ALL TESTS MUST PASS** - [ ] Commit: `git commit -m "feat: phase 3 judge slice complete"` --- ## 9. Definition of Done Phase 3 is **COMPLETE** when: 1. All unit tests pass: `uv run pytest tests/unit/agent_factory/ -v` 2. `JudgeHandler` can assess evidence and return structured output 3. Graceful degradation: if LLM fails, returns safe fallback 4. MockJudgeHandler works for testing without API calls 5. Can run this in Python REPL: ```python import asyncio import os from src.utils.models import Evidence, Citation from src.agent_factory.judges import JudgeHandler, MockJudgeHandler # Test with mock (no API key needed) async def test_mock(): handler = MockJudgeHandler() evidence = [ Evidence( content="Metformin shows neuroprotective effects in AD models", citation=Citation( source="pubmed", title="Metformin and Alzheimer's", url="https://pubmed.ncbi.nlm.nih.gov/12345/", date="2024-01-01", ), ), ] result = await handler.assess("metformin alzheimer", evidence) print(f"Sufficient: {result.sufficient}") print(f"Recommendation: {result.recommendation}") print(f"Drug candidates: {result.details.drug_candidates}") asyncio.run(test_mock()) # Test with real LLM (requires API key) async def test_real(): os.environ["OPENAI_API_KEY"] = "your-key-here" # Or set in .env handler = JudgeHandler() evidence = [ Evidence( content="Metformin shows neuroprotective effects in AD models...", citation=Citation( source="pubmed", title="Metformin and Alzheimer's", url="https://pubmed.ncbi.nlm.nih.gov/12345/", date="2024-01-01", ), ), ] result = await handler.assess("metformin alzheimer", evidence) print(f"Sufficient: {result.sufficient}") print(f"Confidence: {result.confidence}") print(f"Reasoning: {result.reasoning}") # asyncio.run(test_real()) # Uncomment with valid API key ``` **Proceed to Phase 4 ONLY after all checkboxes are complete.**