TranscriptWriting / HUGGINGFACE_SPACES_SETUP.md
jmisak's picture
Upload 6 files
57fa449 verified
# HuggingFace Spaces Deployment Guide
## Overview
This application is configured to run on **HuggingFace Spaces** using local model inference (no external API calls required).
---
## Quick Setup
### 1. Create a New Space
1. Go to https://huggingface.co/new-space
2. Choose **Gradio** as the SDK
3. Select **GPU** hardware (T4 or better recommended)
4. Name your Space (e.g., `transcriptor-ai`)
### 2. Upload Your Code
Upload all files from this directory to your Space, or connect a Git repository.
### 3. Configure Space Settings (Optional)
Go to **Settings β†’ Variables** in your Space and add:
| Variable | Value | Description |
|----------|-------|-------------|
| `DEBUG_MODE` | `True` or `False` | Enable detailed logging |
| `LLM_TEMPERATURE` | `0.7` | Model creativity (0.0-1.0) |
| `LLM_TIMEOUT` | `120` | Timeout in seconds |
| `LOCAL_MODEL` | `microsoft/Phi-3-mini-4k-instruct` | Model to use |
**Note:** All settings have sensible defaults - you don't need to set these unless you want to customize.
---
## Hardware Requirements
### Recommended: GPU (T4 or better)
- **Phi-3-mini-4k-instruct**: 3.8B params, ~8GB GPU RAM
- Processing speed: ~30-60 seconds per transcript chunk
- **Best for:** Production use with multiple users
### Alternative: CPU (not recommended)
- Will work but be very slow (5-10 minutes per chunk)
- Only suitable for testing
---
## Supported Models
You can change the model by setting the `LOCAL_MODEL` variable:
### Small & Fast (Recommended for Free Tier)
```
LOCAL_MODEL=microsoft/Phi-3-mini-4k-instruct (Default - 3.8B params)
```
### Medium (Better quality, needs more GPU)
```
LOCAL_MODEL=mistralai/Mistral-7B-Instruct-v0.3 (7B params)
```
### Alternatives
```
LOCAL_MODEL=HuggingFaceH4/zephyr-7b-beta (7B params, good instruction following)
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B params, very fast but lower quality)
```
---
## Configuration Files
### βœ… Required Files
- `app.py` - Main application
- `requirements.txt` - Python dependencies
- `llm.py`, `extractors.py`, etc. - Core modules
### ⚠️ NOT Needed for Spaces
- `.env` file - Use Spaces Variables instead
- Local database files
- API keys (unless using external APIs)
---
## Environment Configuration
The app automatically detects if it's running on HuggingFace Spaces and uses local model inference by default.
**Default Configuration (no .env needed):**
```python
USE_HF_API = False # Don't use HF Inference API
USE_LMSTUDIO = False # Don't use LM Studio
LLM_BACKEND = local # Use local transformers
DEBUG_MODE = False # Disable debug logs
```
**To override:** Set Spaces Variables (Settings β†’ Variables)
---
## Troubleshooting
### Issue: "Out of Memory" Error
**Solution:** Switch to a smaller model
```
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
### Issue: Very Slow Processing
**Solution:**
1. Make sure you selected **GPU** hardware (not CPU)
2. Check Space logs for "Model loaded on cuda" confirmation
3. If on CPU, upgrade to GPU tier
### Issue: Quality Score 0.00
**Causes:**
1. Model not loaded properly (check logs for "[Local Model] Loading...")
2. GPU out of memory (model falls back to CPU)
3. Timeout too short (increase `LLM_TIMEOUT`)
**Debug Steps:**
1. Set `DEBUG_MODE=True` in Spaces Variables
2. Check logs for detailed error messages
3. Look for "[Local Model] βœ… Generated X characters"
### Issue: Model Downloads Every Time
**Solution:** HuggingFace Spaces caches models automatically, but first load takes 2-5 minutes.
- Subsequent starts are faster (~30 seconds)
- Don't restart Space unnecessarily
---
## Performance Optimization
### 1. Reduce Context Window
Edit `llm.py` line 399:
```python
max_length=2000 # Reduce from 3500 for faster processing
```
### 2. Lower Token Limit
Set Spaces Variable:
```
MAX_TOKENS_PER_REQUEST=800 # Default is 1500
```
### 3. Use Smaller Model
```
LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0
```
### 4. Disable Debug Mode
```
DEBUG_MODE=False
```
---
## Monitoring
### View Logs
1. Go to your Space
2. Click **Logs** tab at the top
3. Look for startup messages:
```
βœ… Configuration loaded for HuggingFace Spaces
πŸš€ TranscriptorAI Enterprise - LLM Backend: local
[Local Model] Loading microsoft/Phi-3-mini-4k-instruct...
[Local Model] βœ… Model loaded on cuda:0
```
### Check Processing
During analysis, you should see:
```
[Local Model] Generating (1500 max tokens, temp=0.7)...
[Local Model] βœ… Generated 1247 characters
[LLM Debug] βœ… Successfully extracted JSON with 7 fields
```
---
## Cost Estimation
### Free Tier (CPU)
- ⚠️ Very slow but free
- ~5-10 minutes per transcript
### GPU (T4) - ~$0.60/hour
- ⚑ Fast processing
- ~30-60 seconds per transcript
- Space sleeps after inactivity (saves money)
### Persistent GPU (Upgraded)
- Always-on for instant access
- Higher cost but best user experience
---
## Security Notes
1. **No API Keys Needed:** Everything runs locally
2. **Private Processing:** Data never leaves your Space
3. **Secrets Management:** Use Spaces Secrets (not Variables) for sensitive data
4. **Model Access:** Phi-3 and most models don't require gated access
---
## Next Steps
1. βœ… Upload code to your Space
2. βœ… Select GPU hardware
3. βœ… Wait for first model download (~2-5 min)
4. βœ… Test with a sample transcript
5. πŸŽ‰ Share your Space URL!
---
## Support
- **HuggingFace Spaces Docs:** https://huggingface.co/docs/hub/spaces
- **Transformers Docs:** https://huggingface.co/docs/transformers
- **GPU Pricing:** https://huggingface.co/pricing
---
**Last Updated:** October 2025