Spaces:

empirenexus
/

TranscriptWriting

Sleeping

App Files Files Community

TranscriptWriting / HUGGINGFACE_SPACES_SETUP.md

jmisak

Upload 6 files

57fa449 verified 3 months ago

preview code

raw

history blame contribute delete

5.85 kB

	# HuggingFace Spaces Deployment Guide

	## Overview
	This application is configured to run on HuggingFace Spaces using local model inference (no external API calls required).

	---

	## Quick Setup

	### 1. Create a New Space
	1. Go to https://huggingface.co/new-space
	2. Choose Gradio as the SDK
	3. Select GPU hardware (T4 or better recommended)
	4. Name your Space (e.g., `transcriptor-ai`)

	### 2. Upload Your Code
	Upload all files from this directory to your Space, or connect a Git repository.

	### 3. Configure Space Settings (Optional)

	Go to Settings → Variables in your Space and add:

	\| Variable \| Value \| Description \|
	\|----------\|-------\|-------------\|
	\| `DEBUG_MODE` \| `True` or `False` \| Enable detailed logging \|
	\| `LLM_TEMPERATURE` \| `0.7` \| Model creativity (0.0-1.0) \|
	\| `LLM_TIMEOUT` \| `120` \| Timeout in seconds \|
	\| `LOCAL_MODEL` \| `microsoft/Phi-3-mini-4k-instruct` \| Model to use \|

	Note: All settings have sensible defaults - you don't need to set these unless you want to customize.

	---

	## Hardware Requirements

	### Recommended: GPU (T4 or better)
	- Phi-3-mini-4k-instruct: 3.8B params, ~8GB GPU RAM
	- Processing speed: ~30-60 seconds per transcript chunk
	- Best for: Production use with multiple users

	### Alternative: CPU (not recommended)
	- Will work but be very slow (5-10 minutes per chunk)
	- Only suitable for testing

	---

	## Supported Models

	You can change the model by setting the `LOCAL_MODEL` variable:

	### Small & Fast (Recommended for Free Tier)
	```
	LOCAL_MODEL=microsoft/Phi-3-mini-4k-instruct (Default - 3.8B params)
	```

	### Medium (Better quality, needs more GPU)
	```
	LOCAL_MODEL=mistralai/Mistral-7B-Instruct-v0.3 (7B params)
	```

	### Alternatives
	```
	LOCAL_MODEL=HuggingFaceH4/zephyr-7b-beta (7B params, good instruction following)
	LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0 (1.1B params, very fast but lower quality)
	```

	---

	## Configuration Files

	### ✅ Required Files
	- `app.py` - Main application
	- `requirements.txt` - Python dependencies
	- `llm.py`, `extractors.py`, etc. - Core modules

	### ⚠️ NOT Needed for Spaces
	- `.env` file - Use Spaces Variables instead
	- Local database files
	- API keys (unless using external APIs)

	---

	## Environment Configuration

	The app automatically detects if it's running on HuggingFace Spaces and uses local model inference by default.

	Default Configuration (no .env needed):
	```python
	USE_HF_API = False # Don't use HF Inference API
	USE_LMSTUDIO = False # Don't use LM Studio
	LLM_BACKEND = local # Use local transformers
	DEBUG_MODE = False # Disable debug logs
	```

	To override: Set Spaces Variables (Settings → Variables)

	---

	## Troubleshooting

	### Issue: "Out of Memory" Error
	Solution: Switch to a smaller model
	```
	LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0
	```

	### Issue: Very Slow Processing
	Solution:
	1. Make sure you selected GPU hardware (not CPU)
	2. Check Space logs for "Model loaded on cuda" confirmation
	3. If on CPU, upgrade to GPU tier

	### Issue: Quality Score 0.00
	Causes:
	1. Model not loaded properly (check logs for "[Local Model] Loading...")
	2. GPU out of memory (model falls back to CPU)
	3. Timeout too short (increase `LLM_TIMEOUT`)

	Debug Steps:
	1. Set `DEBUG_MODE=True` in Spaces Variables
	2. Check logs for detailed error messages
	3. Look for "[Local Model] ✅ Generated X characters"

	### Issue: Model Downloads Every Time
	Solution: HuggingFace Spaces caches models automatically, but first load takes 2-5 minutes.
	- Subsequent starts are faster (~30 seconds)
	- Don't restart Space unnecessarily

	---

	## Performance Optimization

	### 1. Reduce Context Window
	Edit `llm.py` line 399:
	```python
	max_length=2000 # Reduce from 3500 for faster processing
	```

	### 2. Lower Token Limit
	Set Spaces Variable:
	```
	MAX_TOKENS_PER_REQUEST=800 # Default is 1500
	```

	### 3. Use Smaller Model
	```
	LOCAL_MODEL=TinyLlama/TinyLlama-1.1B-Chat-v1.0
	```

	### 4. Disable Debug Mode
	```
	DEBUG_MODE=False
	```

	---

	## Monitoring

	### View Logs
	1. Go to your Space
	2. Click Logs tab at the top
	3. Look for startup messages:

	```
	✅ Configuration loaded for HuggingFace Spaces
	🚀 TranscriptorAI Enterprise - LLM Backend: local
	[Local Model] Loading microsoft/Phi-3-mini-4k-instruct...
	[Local Model] ✅ Model loaded on cuda:0
	```

	### Check Processing
	During analysis, you should see:
	```
	[Local Model] Generating (1500 max tokens, temp=0.7)...
	[Local Model] ✅ Generated 1247 characters
	[LLM Debug] ✅ Successfully extracted JSON with 7 fields
	```

	---

	## Cost Estimation

	### Free Tier (CPU)
	- ⚠️ Very slow but free
	- ~5-10 minutes per transcript

	### GPU (T4) - ~$0.60/hour
	- ⚡ Fast processing
	- ~30-60 seconds per transcript
	- Space sleeps after inactivity (saves money)

	### Persistent GPU (Upgraded)
	- Always-on for instant access
	- Higher cost but best user experience

	---

	## Security Notes

	1. No API Keys Needed: Everything runs locally
	2. Private Processing: Data never leaves your Space
	3. Secrets Management: Use Spaces Secrets (not Variables) for sensitive data
	4. Model Access: Phi-3 and most models don't require gated access

	---

	## Next Steps

	1. ✅ Upload code to your Space
	2. ✅ Select GPU hardware
	3. ✅ Wait for first model download (~2-5 min)
	4. ✅ Test with a sample transcript
	5. 🎉 Share your Space URL!

	---

	## Support

	- HuggingFace Spaces Docs: https://huggingface.co/docs/hub/spaces
	- Transformers Docs: https://huggingface.co/docs/transformers
	- GPU Pricing: https://huggingface.co/pricing

	---

	Last Updated: October 2025