Tom Claude commited on
Commit
c5df650
Β·
1 Parent(s): c7dcc92

Switch to Llama-3.1-8B-Instruct via HF Inference Providers

Browse files

Major changes:
- Replace Phi-3 local model with Llama-3.1-8B-Instruct via Inference API
- Remove GPU dependencies (torch, transformers, accelerate, spaces)
- Use HuggingFace Inference Providers (Novita, etc.) for model hosting
- Enhanced system prompt with explicit date format and enum value rules
- Reduce monthly cost from $40 to ~$12 (Team β†’ PRO plan)
- Keep usage tracker (50 req/day per user) and MCP integration

Benefits:
- No more "model_pending_deploy" errors
- Native tool calling support via Llama-3.1
- Predictable costs with Inference Provider pay-per-use
- No ZeroGPU or Team plan required

πŸ€– Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

Files changed (3) hide show
  1. README.md +19 -18
  2. app.py +50 -92
  3. requirements.txt +0 -6
README.md CHANGED
@@ -7,33 +7,31 @@ sdk: gradio
7
  sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
10
- short_description: Swiss Parliamentary Data Chatbot with Phi-3-mini
11
  ---
12
 
13
  # πŸ›οΈ CoJournalist Data
14
 
15
- A Swiss Parliamentary Data Chatbot powered by Phi-3-mini and the OpenParlData MCP server.
16
 
17
  ## Features
18
 
19
- - πŸ€– **Phi-3-mini-4k-instruct** - Efficient 3.8B parameter model running on ZeroGPU
20
  - 🌍 **Multilingual** - Support for English, German, French, and Italian
21
  - πŸ› οΈ **Tool Calling** - Intelligent query routing to parliamentary data APIs
22
  - πŸ”’ **Rate Limited** - 50 requests per day per user for cost control
23
- - ⚑ **ZeroGPU** - FREE GPU inference for PRO users
24
 
25
  ## Space Settings Required
26
 
27
- **IMPORTANT:** To run this Space, you need to configure the following in your HuggingFace Space settings:
28
 
29
- ### 1. Hardware Selection
30
- - Go to **Settings** β†’ **Hardware**
31
- - Select **ZeroGPU** (FREE for PRO users)
32
- - Save changes
33
 
34
- ### 2. Environment Variables (Optional)
35
- If you want to use the OpenParlData API when it's available:
36
- - Add `HF_TOKEN` with your HuggingFace token
37
 
38
  ## Usage
39
 
@@ -44,15 +42,18 @@ Simply ask questions about Swiss parliamentary data in natural language:
44
 
45
  ## Architecture
46
 
47
- - **Model:** microsoft/Phi-3-mini-4k-instruct (3.8B params)
48
- - **GPU:** ZeroGPU (H200) with dynamic allocation
49
- - **Framework:** Gradio + Transformers + PyTorch
50
  - **MCP Integration:** OpenParlData server for parliamentary data
51
 
52
  ## Cost
53
 
54
- - **HF PRO:** $9/month (required for ZeroGPU)
55
- - **Inference:** FREE (included with PRO subscription)
56
- - **Total:** $9/month for unlimited usage within ZeroGPU quotas
 
 
 
57
 
58
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
7
  sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
10
+ short_description: Swiss Parliamentary Data Chatbot with Llama-3.1-8B
11
  ---
12
 
13
  # πŸ›οΈ CoJournalist Data
14
 
15
+ A Swiss Parliamentary Data Chatbot powered by Llama-3.1-8B-Instruct and the OpenParlData MCP server.
16
 
17
  ## Features
18
 
19
+ - πŸ€– **Llama-3.1-8B-Instruct** - Meta's 8B parameter model with native tool calling support
20
  - 🌍 **Multilingual** - Support for English, German, French, and Italian
21
  - πŸ› οΈ **Tool Calling** - Intelligent query routing to parliamentary data APIs
22
  - πŸ”’ **Rate Limited** - 50 requests per day per user for cost control
23
+ - ⚑ **HF Inference Providers** - Fast inference via Novita and other providers
24
 
25
  ## Space Settings Required
26
 
27
+ **IMPORTANT:** To run this Space, configure the following:
28
 
29
+ ### Environment Variables
30
+ - **Required:** `HF_TOKEN` - Your HuggingFace token with Inference Provider access
31
+ - Add this in Space Settings β†’ Repository secrets
 
32
 
33
+ ### Hardware
34
+ - **CPU Basic** (Free) - Sufficient since inference happens via API
 
35
 
36
  ## Usage
37
 
 
42
 
43
  ## Architecture
44
 
45
+ - **Model:** meta-llama/Llama-3.1-8B-Instruct (8B params)
46
+ - **Inference:** HuggingFace Inference Providers (Novita, etc.)
47
+ - **Framework:** Gradio + HuggingFace Hub
48
  - **MCP Integration:** OpenParlData server for parliamentary data
49
 
50
  ## Cost
51
 
52
+ - **HF PRO:** $9/month (recommended)
53
+ - **Inference:** $2/month included credits + pay-per-use
54
+ - **Estimated Total:** ~$12/month for typical usage (1,500 requests/month)
55
+ - **Space Hardware:** FREE (CPU Basic)
56
+
57
+ With 50 requests/day limit, costs stay predictable and affordable.
58
 
59
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py CHANGED
@@ -1,63 +1,30 @@
1
  """
2
  CoJournalist Data - Swiss Parliamentary Data Chatbot
3
- Powered by Phi-3-mini and OpenParlData MCP
4
  """
5
 
6
  import os
7
  import json
8
  import gradio as gr
 
9
  from dotenv import load_dotenv
10
  from mcp_integration import execute_mcp_query, OpenParlDataClient
11
  import asyncio
12
  from usage_tracker import UsageTracker
13
- import torch
14
- from transformers import AutoModelForCausalLM, AutoTokenizer
15
-
16
- # Import spaces only if available (for HuggingFace Spaces)
17
- try:
18
- import spaces
19
- SPACES_AVAILABLE = True
20
- except ImportError:
21
- SPACES_AVAILABLE = False
22
- print("Running locally without ZeroGPU support")
23
 
24
  # Load environment variables
25
  load_dotenv()
26
 
 
 
 
 
 
 
 
27
  # Initialize usage tracker with 50 requests per day limit
28
  tracker = UsageTracker(daily_limit=50)
29
 
30
- # Initialize model and tokenizer
31
- MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
32
- print(f"Loading model: {MODEL_NAME}")
33
- tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
34
-
35
- # Detect device (MPS for Mac, CUDA for GPU, CPU fallback)
36
- if torch.cuda.is_available():
37
- device = "cuda"
38
- dtype = torch.float16
39
- elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
40
- device = "mps"
41
- dtype = torch.float16
42
- else:
43
- device = "cpu"
44
- dtype = torch.float32
45
-
46
- print(f"Using device: {device}")
47
-
48
- model = AutoModelForCausalLM.from_pretrained(
49
- MODEL_NAME,
50
- torch_dtype=dtype,
51
- device_map=device if device != "mps" else None,
52
- trust_remote_code=True
53
- )
54
-
55
- # Move to MPS if needed
56
- if device == "mps":
57
- model = model.to(device)
58
-
59
- print(f"Model loaded successfully on {device}!")
60
-
61
  # Available languages
62
  LANGUAGES = {
63
  "English": "en",
@@ -66,7 +33,7 @@ LANGUAGES = {
66
  "Italiano": "it"
67
  }
68
 
69
- # System prompt optimized for Phi-3-mini-4k-instruct
70
  SYSTEM_PROMPT = """You are a helpful assistant that helps users query Swiss parliamentary data.
71
 
72
  You have access to the following tools from the OpenParlData MCP server:
@@ -78,18 +45,29 @@ You have access to the following tools from the OpenParlData MCP server:
78
  Parameters: person_id, include_votes, include_motions, language
79
 
80
  3. **openparldata_search_votes** - Search parliamentary votes
81
- Parameters: query (title/description), date_from (YYYY-MM-DD), date_to, vote_type, language, limit
 
 
 
 
 
82
 
83
  4. **openparldata_get_vote_details** - Get detailed vote information
84
  Parameters: vote_id, include_individual_votes, language
85
 
86
  5. **openparldata_search_motions** - Search motions and proposals
87
- Parameters: query, status, date_from, date_to, submitter_id, language, limit
88
 
89
  6. **openparldata_search_debates** - Search debate transcripts
90
- Parameters: query, date_from, date_to, speaker_id, language, limit
91
 
92
- IMPORTANT: Your response MUST be valid JSON only. Do not include any explanatory text before or after the JSON. Do not wrap your response in code blocks or markdown formatting.
 
 
 
 
 
 
93
 
94
  When a user asks a question about Swiss parliamentary data:
95
  1. Analyze what information they need
@@ -155,55 +133,37 @@ EXAMPLES = {
155
  }
156
 
157
 
158
- def query_model_impl(message: str, language: str = "en") -> dict:
159
- """Query Phi-3-mini model to interpret user intent and determine tool calls."""
160
 
161
  try:
162
- # Format prompt for Phi-3
163
- prompt = f"""<|system|>
164
- {SYSTEM_PROMPT}<|end|>
165
- <|user|>
166
- Language: {language}
167
- Question: {message}<|end|>
168
- <|assistant|>
169
- """
170
-
171
- # Tokenize and generate
172
- inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=3072)
173
- inputs = {k: v.to(model.device) for k, v in inputs.items()}
174
-
175
- with torch.no_grad():
176
- outputs = model.generate(
177
- **inputs,
178
- max_new_tokens=500,
179
- temperature=0.3,
180
- do_sample=True,
181
- pad_token_id=tokenizer.eos_token_id
182
- )
183
-
184
- # Decode response
185
- full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
186
-
187
- # Extract only the assistant's response (after the last <|assistant|>)
188
- if "<|assistant|>" in full_response:
189
- assistant_message = full_response.split("<|assistant|>")[-1].strip()
190
- else:
191
- assistant_message = full_response.strip()
192
 
193
  # Try to parse as JSON
194
  try:
195
- # Clean up response - enhanced for Phi-3 model
196
  clean_response = assistant_message.strip()
197
-
198
- # Remove markdown code blocks
199
  if clean_response.startswith("```json"):
200
  clean_response = clean_response[7:]
201
- elif clean_response.startswith("```"):
202
  clean_response = clean_response[3:]
203
-
204
  if clean_response.endswith("```"):
205
  clean_response = clean_response[:-3]
206
-
207
  clean_response = clean_response.strip()
208
 
209
  # Find first { or [ (start of JSON) to handle explanatory text
@@ -223,11 +183,9 @@ Question: {message}<|end|>
223
  return {"error": f"Error querying model: {str(e)}"}
224
 
225
 
226
- # Apply ZeroGPU decorator only when running on HuggingFace Spaces
227
- if SPACES_AVAILABLE:
228
- query_model = spaces.GPU(duration=60)(query_model_impl)
229
- else:
230
- query_model = query_model_impl
231
 
232
 
233
  async def execute_tool_async(tool_name: str, arguments: dict, show_debug: bool) -> tuple:
@@ -417,9 +375,9 @@ with gr.Blocks(css=custom_css, title="CoJournalist Data") as demo:
417
  **Note:** This app uses the OpenParlData MCP server to access Swiss parliamentary data.
418
  Currently returning mock data while the OpenParlData API is in development.
419
 
420
- **Rate Limit:** 50 requests per day per user to keep the service free and accessible.
421
 
422
- Powered by [Phi-3-mini](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) on ZeroGPU and [Model Context Protocol (MCP)](https://modelcontextprotocol.io/)
423
  """
424
  )
425
 
 
1
  """
2
  CoJournalist Data - Swiss Parliamentary Data Chatbot
3
+ Powered by Llama-3.1-8B-Instruct and OpenParlData MCP
4
  """
5
 
6
  import os
7
  import json
8
  import gradio as gr
9
+ from huggingface_hub import InferenceClient
10
  from dotenv import load_dotenv
11
  from mcp_integration import execute_mcp_query, OpenParlDataClient
12
  import asyncio
13
  from usage_tracker import UsageTracker
 
 
 
 
 
 
 
 
 
 
14
 
15
  # Load environment variables
16
  load_dotenv()
17
 
18
+ # Initialize Hugging Face Inference Client
19
+ HF_TOKEN = os.getenv("HF_TOKEN")
20
+ if not HF_TOKEN:
21
+ print("Warning: HF_TOKEN not found. Please set it in .env file or Hugging Face Space secrets.")
22
+
23
+ client = InferenceClient(token=HF_TOKEN)
24
+
25
  # Initialize usage tracker with 50 requests per day limit
26
  tracker = UsageTracker(daily_limit=50)
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  # Available languages
29
  LANGUAGES = {
30
  "English": "en",
 
33
  "Italiano": "it"
34
  }
35
 
36
+ # System prompt for Llama-3.1-8B-Instruct
37
  SYSTEM_PROMPT = """You are a helpful assistant that helps users query Swiss parliamentary data.
38
 
39
  You have access to the following tools from the OpenParlData MCP server:
 
45
  Parameters: person_id, include_votes, include_motions, language
46
 
47
  3. **openparldata_search_votes** - Search parliamentary votes
48
+ Parameters:
49
+ - query (title/description)
50
+ - date_from (YYYY-MM-DD format, e.g., "2024-01-01")
51
+ - date_to (YYYY-MM-DD format, e.g., "2024-12-31" - NEVER use "now", always use actual date)
52
+ - vote_type (must be "final", "detail", or "overall")
53
+ - language, limit
54
 
55
  4. **openparldata_get_vote_details** - Get detailed vote information
56
  Parameters: vote_id, include_individual_votes, language
57
 
58
  5. **openparldata_search_motions** - Search motions and proposals
59
+ Parameters: query, status, date_from (YYYY-MM-DD), date_to (YYYY-MM-DD), submitter_id, language, limit
60
 
61
  6. **openparldata_search_debates** - Search debate transcripts
62
+ Parameters: query, date_from (YYYY-MM-DD), date_to (YYYY-MM-DD), speaker_id, language, limit
63
 
64
+ CRITICAL RULES:
65
+ - All dates MUST be in YYYY-MM-DD format (e.g., "2024-12-31")
66
+ - NEVER use "now", "today", or relative dates - always use actual YYYY-MM-DD dates
67
+ - For "latest" queries, use date_from with a recent date like "2024-01-01" and NO date_to parameter
68
+ - vote_type must ONLY be "final", "detail", or "overall" - no other values
69
+ - Your response MUST be valid JSON only
70
+ - Do NOT include explanatory text or markdown formatting
71
 
72
  When a user asks a question about Swiss parliamentary data:
73
  1. Analyze what information they need
 
133
  }
134
 
135
 
136
+ async def query_model_async(message: str, language: str = "en") -> dict:
137
+ """Query Llama-3.1-8B model via Inference Providers to interpret user intent and determine tool calls."""
138
 
139
  try:
140
+ # Create messages for chat completion
141
+ messages = [
142
+ {"role": "system", "content": SYSTEM_PROMPT},
143
+ {"role": "user", "content": f"Language: {language}\nQuestion: {message}"}
144
+ ]
145
+
146
+ # Call Llama-3.1-8B via HuggingFace Inference Providers
147
+ response = client.chat_completion(
148
+ model="meta-llama/Llama-3.1-8B-Instruct",
149
+ messages=messages,
150
+ max_tokens=500,
151
+ temperature=0.3
152
+ )
153
+
154
+ # Extract response
155
+ assistant_message = response.choices[0].message.content
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
  # Try to parse as JSON
158
  try:
159
+ # Clean up response (sometimes models add markdown code blocks)
160
  clean_response = assistant_message.strip()
 
 
161
  if clean_response.startswith("```json"):
162
  clean_response = clean_response[7:]
163
+ if clean_response.startswith("```"):
164
  clean_response = clean_response[3:]
 
165
  if clean_response.endswith("```"):
166
  clean_response = clean_response[:-3]
 
167
  clean_response = clean_response.strip()
168
 
169
  # Find first { or [ (start of JSON) to handle explanatory text
 
183
  return {"error": f"Error querying model: {str(e)}"}
184
 
185
 
186
+ def query_model(message: str, language: str = "en") -> dict:
187
+ """Synchronous wrapper for async model query."""
188
+ return asyncio.run(query_model_async(message, language))
 
 
189
 
190
 
191
  async def execute_tool_async(tool_name: str, arguments: dict, show_debug: bool) -> tuple:
 
375
  **Note:** This app uses the OpenParlData MCP server to access Swiss parliamentary data.
376
  Currently returning mock data while the OpenParlData API is in development.
377
 
378
+ **Rate Limit:** 50 requests per day per user to keep the service affordable and accessible.
379
 
380
+ Powered by [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) via HF Inference Providers and [Model Context Protocol (MCP)](https://modelcontextprotocol.io/)
381
  """
382
  )
383
 
requirements.txt CHANGED
@@ -6,12 +6,6 @@ gradio>=5.49.1
6
 
7
  # Hugging Face
8
  huggingface-hub>=0.22.0
9
- transformers>=4.40.0
10
- torch>=2.0.0
11
- accelerate>=0.20.0
12
-
13
- # ZeroGPU support (required for HuggingFace Spaces deployment)
14
- spaces>=0.28.0
15
 
16
  # MCP Support
17
  mcp>=0.1.0
 
6
 
7
  # Hugging Face
8
  huggingface-hub>=0.22.0
 
 
 
 
 
 
9
 
10
  # MCP Support
11
  mcp>=0.1.0