Spaces:

AUXteam
/

Scraper_hub

Running

google-labs-jules[bot] Greene-ctrl commited on 7 days ago

Commit

e1d311a

1 Parent(s): 0526857

Deploy CyberScraper 2077 to Hugging Face with Blablador LLM support

Summary of changes:
- Added FastAPI `api.py` with `/health` and `/api/scrape` endpoints.
- Configured Nginx reverse proxy to handle UI and API on port 7860.
- Implemented Blablador LLM provider with `alias-fast` and `alias-large` models.
- Set `alias-fast` as the default model.
- Updated `Dockerfile` for Hugging Face Spaces (non-root user, `uv` manager).
- Created a GitHub Action for automatic synchronization to Hugging Face Hub.
- Updated README with Space metadata and .hfignore for cleaner deployment.
- Verified deployment and functionality on the live Space.

Co-authored-by: Greene-ctrl <192867433+Greene-ctrl@users.noreply.github.com>

Files changed (12) hide show

.github/workflows/sync_to_hf.yml +19 -0
.hfignore +9 -0
Dockerfile +46 -66
README.md +9 -0
api.py +48 -0
main.py +7 -2
nginx.conf +51 -0
src/models.py +7 -0
src/utils/error_handler.py +12 -0
start.sh +20 -0
test_extractor.py +21 -0
test_patchright.py +13 -0

.github/workflows/sync_to_hf.yml ADDED Viewed

	@@ -0,0 +1,19 @@

+name: Sync to Hugging Face hub
+on:
+  push:
+    branches: [main, master, jules-*]
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: |
+          git push --force https://AUXteam:${HF_TOKEN}@huggingface.co/spaces/AUXteam/Scraper_hub HEAD:main

.hfignore ADDED Viewed

	@@ -0,0 +1,9 @@

+.git/
+.github/
+venv/
+__pycache__/
+*.pyc
+.env
+chat_history.json
+test_patchright.py
+client_secret.json

Dockerfile CHANGED Viewed

@@ -1,10 +1,16 @@
 # Use Python 3.12 for better performance and compatibility
 FROM python:3.12-slim-bookworm
-# Set the working directory in the container
-WORKDIR /app
-# Install system dependencies including browser dependencies for Playwright/Patchright
 RUN apt-get update && apt-get install -y \
     wget \
     gnupg \
@@ -17,6 +23,7 @@ RUN apt-get update && apt-get install -y \
     python3-dev \
     libffi-dev \
     procps \
     # Browser dependencies for Playwright/Patchright
     libglib2.0-0 \
     libnspr4 \
@@ -38,77 +45,50 @@ RUN apt-get update && apt-get install -y \
     && apt-get clean \
     && rm -rf /var/lib/apt/lists/*
-# Configure Tor - Simplified configuration
 RUN echo "SocksPort 9050" >> /etc/tor/torrc && \
     echo "ControlPort 9051" >> /etc/tor/torrc && \
     echo "CookieAuthentication 1" >> /etc/tor/torrc && \
     echo "DataDirectory /var/lib/tor" >> /etc/tor/torrc
-# Set correct permissions for Tor
-RUN chown -R debian-tor:debian-tor /var/lib/tor && \
-    chmod 700 /var/lib/tor
-# Copy local files
-COPY . .
-# Create and activate a virtual environment
-RUN python -m venv venv
-ENV PATH="/app/venv/bin:$PATH"
-# Install Python dependencies (includes PySocks for Tor support)
-# Added retries and timeout for network reliability
-RUN pip install --no-cache-dir --timeout=120 --retries=3 -r requirements.txt
-# Install patchright browser (chrome not available on ARM64)
-RUN patchright install chromium
-# Create run script with proper Tor startup
-RUN echo '#!/bin/bash\n\
-\n\
-# Start Tor service\n\
-echo "Starting Tor service..."\n\
-service tor start\n\
-\n\
-# Wait for Tor to be ready\n\
-echo "Waiting for Tor to start..."\n\
-for i in {1..30}; do\n\
-    if ps aux | grep -v grep | grep -q /usr/bin/tor; then\n\
-        echo "Tor process is running"\n\
-        if nc -z localhost 9050; then\n\
-            echo "Tor SOCKS port is listening"\n\
-            break\n\
-        fi\n\
-    fi\n\
-    if [ $i -eq 30 ]; then\n\
-        echo "Warning: Tor might not be ready, but continuing..."\n\
-    fi\n\
-    sleep 1\n\
-done\n\
-\n\
-# Verify Tor status\n\
-echo "Checking Tor service status:"\n\
-service tor status\n\
-\n\
-# Export API key if provided\n\
-if [ ! -z "$OPENAI_API_KEY" ]; then\n\
-    export OPENAI_API_KEY=$OPENAI_API_KEY\n\
-    echo "OpenAI API key configured"\n\
-fi\n\
-\n\
-if [ ! -z "$GOOGLE_API_KEY" ]; then\n\
-    export GOOGLE_API_KEY=$GOOGLE_API_KEY\n\
-    echo "Google API key configured"\n\
-fi\n\
-\n\
-# Start the application with explicit host binding\n\
-echo "Starting CyberScraper 2077..."\n\
-streamlit run --server.address 0.0.0.0 --server.port 8501 main.py\n\
-' > /app/run.sh
-RUN chmod +x /app/run.sh
-# Expose ports
-EXPOSE 8501 9050 9051
 # Set the entrypoint
-ENTRYPOINT ["/app/run.sh"]

 # Use Python 3.12 for better performance and compatibility
 FROM python:3.12-slim-bookworm
+# Set environment variables
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PORT=7860 \
+    UV_SYSTEM_PYTHON=1 \
+    HOME=/home/user \
+    STREAMLIT_BROWSER_GATHER_USAGE_STATS=false \
+    STREAMLIT_SERVER_HEADLESS=true
+# Install system dependencies
 RUN apt-get update && apt-get install -y \
     wget \
     gnupg \
     python3-dev \
     libffi-dev \
     procps \
+    nginx \
     # Browser dependencies for Playwright/Patchright
     libglib2.0-0 \
     libnspr4 \
     && apt-get clean \
     && rm -rf /var/lib/apt/lists/*
+# Install uv
+COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
+# Set up working directory
+WORKDIR /app
+# Copy requirements and install as root
+COPY requirements.txt .
+RUN uv pip install --system -r requirements.txt
+RUN uv pip install --system fastapi uvicorn
+# Install patchright browser
+RUN patchright install chromium
+# Create a non-root user
+RUN useradd -m -u 1000 user
+RUN mkdir -p /home/user/.streamlit && chown -R user:user /home/user
+# Configure Tor
 RUN echo "SocksPort 9050" >> /etc/tor/torrc && \
     echo "ControlPort 9051" >> /etc/tor/torrc && \
     echo "CookieAuthentication 1" >> /etc/tor/torrc && \
     echo "DataDirectory /var/lib/tor" >> /etc/tor/torrc
+# Set permissions for Tor, app directory, and nginx
+RUN mkdir -p /var/lib/tor && \
+    chown -R user:user /var/lib/tor && \
+    chmod 700 /var/lib/tor && \
+    chown -R user:user /app && \
+    mkdir -p /var/log/nginx /var/lib/nginx /tmp && \
+    chown -R user:user /var/log/nginx /var/lib/nginx /tmp
+# Copy the rest of the application
+COPY --chown=user:user . .
+# Set permissions for the start script
+RUN chmod +x start.sh
+# Switch to non-root user
+USER user
+ENV PATH="/home/user/.local/bin:$PATH"
+# Expose port
+EXPOSE 7860
 # Set the entrypoint
+ENTRYPOINT ["./start.sh"]

README.md CHANGED Viewed

@@ -1,3 +1,12 @@
 # 🌐 CyberScraper 2077
 <p align="center">

+---
+title: Scraper Hub
+emoji: 🌐
+colorFrom: blue
+colorTo: red
+sdk: docker
+app_port: 7860
+---
 # 🌐 CyberScraper 2077
 <p align="center">

api.py ADDED Viewed

	@@ -0,0 +1,48 @@

+import os
+import asyncio
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from typing import Optional
+from src.web_extractor import WebExtractor
+from src.scrapers.playwright_scraper import ScraperConfig
+app = FastAPI()
+class ScrapeRequest(BaseModel):
+    url: str
+    query: str
+    model_name: Optional[str] = "alias-fast"
+@app.get("/health")
+async def health():
+    return {"status": "ok", "message": "CyberScraper 2077 API is running"}
+@app.post("/api/scrape")
+async def scrape(request: ScrapeRequest):
+    scraper_config = ScraperConfig(
+        headless=True,
+        max_retries=3,
+        delay_after_load=5
+    )
+    extractor = WebExtractor(model_name=request.model_name, scraper_config=scraper_config)
+    try:
+        # Construct the query by combining URL and the specific request
+        full_query = f"{request.url} {request.query}"
+        response = await extractor.process_query(full_query)
+        # If response is a tuple (csv/excel), extract the first part
+        if isinstance(response, tuple):
+            response = response[0]
+        return {
+            "url": request.url,
+            "query": request.query,
+            "response": response
+        }
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)

main.py CHANGED Viewed

@@ -241,6 +241,11 @@ def check_service_status() -> dict:
             "configured": bool(os.getenv("GOOGLE_API_KEY")),
             "env_var": "GOOGLE_API_KEY"
         },
         "tor": {
             "name": "Tor",
             "configured": False,  # Will be checked dynamically
@@ -442,7 +447,7 @@ def main():
             st.session_state.current_chat_id = new_chat_id
             save_chat_history(st.session_state.chat_history)
     if 'selected_model' not in st.session_state:
-        st.session_state.selected_model = "gpt-4.1-mini"
     if 'web_scraper_chat' not in st.session_state:
         st.session_state.web_scraper_chat = None
@@ -451,7 +456,7 @@ def main():
         # Model selection
         st.subheader("Select Model")
-        default_models = ["gpt-4.1-mini", "gpt-4o-mini", "gemini-1.5-flash", "gemini-pro"]
         ollama_models = st.session_state.get('ollama_models', [])
         all_models = default_models + [f"ollama:{model}" for model in ollama_models]
         selected_model = st.selectbox("Choose a model", all_models, index=all_models.index(st.session_state.selected_model) if st.session_state.selected_model in all_models else 0)

             "configured": bool(os.getenv("GOOGLE_API_KEY")),
             "env_var": "GOOGLE_API_KEY"
         },
+        "blablador": {
+            "name": "Blablador",
+            "configured": bool(os.getenv("BLABLADOR_API_KEY")),
+            "env_var": "BLABLADOR_API_KEY"
+        },
         "tor": {
             "name": "Tor",
             "configured": False,  # Will be checked dynamically
             st.session_state.current_chat_id = new_chat_id
             save_chat_history(st.session_state.chat_history)
     if 'selected_model' not in st.session_state:
+        st.session_state.selected_model = "alias-fast"
     if 'web_scraper_chat' not in st.session_state:
         st.session_state.web_scraper_chat = None
         # Model selection
         st.subheader("Select Model")
+        default_models = ["alias-fast", "alias-large", "gpt-4o-mini", "gemini-1.5-flash"]
         ollama_models = st.session_state.get('ollama_models', [])
         all_models = default_models + [f"ollama:{model}" for model in ollama_models]
         selected_model = st.selectbox("Choose a model", all_models, index=all_models.index(st.session_state.selected_model) if st.session_state.selected_model in all_models else 0)

nginx.conf ADDED Viewed

	@@ -0,0 +1,51 @@

+worker_processes 1;
+pid /tmp/nginx.pid;
+events {
+    worker_connections 1024;
+}
+http {
+    include /etc/nginx/mime.types;
+    default_type application/octet-stream;
+    client_body_temp_path /tmp/client_body;
+    proxy_temp_path /tmp/proxy_temp;
+    fastcgi_temp_path /tmp/fastcgi_temp;
+    uwsgi_temp_path /tmp/uwsgi_temp;
+    scgi_temp_path /tmp/scgi_temp;
+    access_log /tmp/access.log;
+    error_log /tmp/error.log;
+    server {
+        listen 7860;
+        server_name localhost;
+        location /api {
+            proxy_pass http://localhost:8000;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Proto $scheme;
+        }
+        location /health {
+            proxy_pass http://localhost:8000/health;
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+        }
+        location / {
+            proxy_pass http://localhost:8501;
+            proxy_http_version 1.1;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection "upgrade";
+            proxy_set_header Host $host;
+            proxy_set_header X-Real-IP $remote_addr;
+            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+            proxy_set_header X-Forwarded-Proto $scheme;
+            proxy_read_timeout 86400;
+        }
+    }
+}

src/models.py CHANGED Viewed

@@ -38,5 +38,12 @@ class Models:
                 return OpenAI(model_name=model_name, **kwargs)
             case name if name.startswith("gemini-"):
                 return ChatGoogleGenerativeAI(model=model_name, **kwargs)
             case _:
                 raise ValueError(f"Unsupported model: {model_name}")

                 return OpenAI(model_name=model_name, **kwargs)
             case name if name.startswith("gemini-"):
                 return ChatGoogleGenerativeAI(model=model_name, **kwargs)
+            case "alias-large" | "alias-fast":
+                return ChatOpenAI(
+                    model_name=model_name,
+                    openai_api_key=os.getenv("BLABLADOR_API_KEY"),
+                    openai_api_base="https://api.helmholtz-blablador.fz-juelich.de/v1",
+                    **kwargs
+                )
             case _:
                 raise ValueError(f"Unsupported model: {model_name}")

src/utils/error_handler.py CHANGED Viewed

@@ -78,6 +78,15 @@ class ErrorMessages:
         f"For help, see: {README_URL}#installation"
     )
     # Ollama errors
     OLLAMA_NOT_RUNNING = (
         "Ollama is not running or not accessible.\n\n"
@@ -188,4 +197,7 @@ def check_model_api_key(model_name: str) -> str | None:
     if model_name.startswith("gemini-") and not os.getenv("GOOGLE_API_KEY"):
         return ErrorMessages.GOOGLE_API_KEY_MISSING
     return None

         f"For help, see: {README_URL}#installation"
     )
+    BLABLADOR_API_KEY_MISSING = (
+        "Blablador API Key is missing.\n\n"
+        "Please set the BLABLADOR_API_KEY environment variable:\n"
+        "1. Create a .env file in the project root\n"
+        "2. Add: BLABLADOR_API_KEY=your_key_here\n"
+        "3. Or export it: export BLABLADOR_API_KEY=your_key_here\n\n"
+        f"For setup instructions, see: {README_URL}#installation"
+    )
     # Ollama errors
     OLLAMA_NOT_RUNNING = (
         "Ollama is not running or not accessible.\n\n"
     if model_name.startswith("gemini-") and not os.getenv("GOOGLE_API_KEY"):
         return ErrorMessages.GOOGLE_API_KEY_MISSING
+    if model_name.startswith("alias-") and not os.getenv("BLABLADOR_API_KEY"):
+        return ErrorMessages.BLABLADOR_API_KEY_MISSING
     return None

start.sh ADDED Viewed

	@@ -0,0 +1,20 @@

+#!/bin/bash
+# Start Tor service in the background
+echo "Starting Tor..."
+tor &
+# Wait for Tor to start
+sleep 5
+# Start FastAPI API in the background
+echo "Starting FastAPI API..."
+python3 api.py &
+# Start Streamlit app in the background
+echo "Starting Streamlit..."
+streamlit run main.py --server.port 8501 --server.address 0.0.0.0 --server.enableCORS=false --server.enableXsrfProtection=false &
+# Start Nginx in the foreground to keep the container running
+echo "Starting Nginx..."
+/usr/sbin/nginx -c /app/nginx.conf -g "daemon off;"

test_extractor.py ADDED Viewed

	@@ -0,0 +1,21 @@

+import asyncio
+from src.web_extractor import WebExtractor
+from src.scrapers.playwright_scraper import ScraperConfig
+async def test():
+    config = ScraperConfig(headless=True)
+    try:
+        extractor = WebExtractor(model_name="alias-fast", scraper_config=config)
+        print("WebExtractor initialized successfully!")
+        # Test URL extraction
+        from src.web_extractor import extract_url
+        url = extract_url("Check out https://example.com")
+        print(f"Extracted URL: {url}")
+        assert url == "https://example.com"
+    except Exception as e:
+        print(f"Error: {e}")
+if __name__ == "__main__":
+    asyncio.run(test())

test_patchright.py ADDED Viewed

	@@ -0,0 +1,13 @@

+import asyncio
+from patchright.async_api import async_playwright
+async def main():
+    async with async_playwright() as p:
+        browser = await p.chromium.launch(headless=True)
+        page = await browser.new_page()
+        await page.goto("https://example.com")
+        print(await page.title())
+        await browser.close()
+if __name__ == "__main__":
+    asyncio.run(main())