KiWA001 commited on
Commit
b88f56b
·
1 Parent(s): 80d9d9d

Add SpeechMA TTS provider with 11Labs-compatible API

Browse files

- Create speechma_tts_provider.py with Playwright automation
- Add 11Labs-compatible TTS endpoints (/v1/text-to-speech/*)
- Implement OCR-based CAPTCHA solving (Tesseract/EasyOCR)
- Support 20+ voices with Ava Multilingual as default
- Add voice effects (pitch, speed, volume) support
- Include test script and documentation

TTS_README.md ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SpeechMA TTS Provider - 11Labs Compatible API
2
+
3
+ This module adds text-to-speech capabilities to KAI API using [SpeechMA](https://speechma.com) as the backend provider. The API is designed to be compatible with ElevenLabs API structure.
4
+
5
+ ## Features
6
+
7
+ - 🎙️ **20+ High-Quality Voices** (Ava, Andrew, Brian, Emma, and more)
8
+ - 🔐 **Automatic CAPTCHA Solving** with OCR
9
+ - 🌍 **Multilingual Support** (English, Spanish, French, German, Japanese, etc.)
10
+ - 📱 **11Labs API Compatible** - Drop-in replacement for ElevenLabs
11
+ - 🎛️ **Voice Effects** (pitch, speed, volume control)
12
+
13
+ ## Installation
14
+
15
+ ### Required Dependencies
16
+
17
+ ```bash
18
+ # Core dependencies (already in your project)
19
+ pip install fastapi playwright
20
+
21
+ # OCR dependencies (for CAPTCHA solving)
22
+ pip install pytesseract pillow
23
+
24
+ # OR use EasyOCR (alternative)
25
+ pip install easyocr
26
+
27
+ # Install Playwright browsers
28
+ playwright install chromium
29
+ ```
30
+
31
+ ### System Dependencies
32
+
33
+ For **pytesseract**, install Tesseract OCR:
34
+
35
+ **macOS:**
36
+ ```bash
37
+ brew install tesseract
38
+ ```
39
+
40
+ **Ubuntu/Debian:**
41
+ ```bash
42
+ sudo apt-get install tesseract-ocr
43
+ ```
44
+
45
+ **Windows:**
46
+ Download from: https://github.com/UB-Mannheim/tesseract/wiki
47
+
48
+ ## API Endpoints
49
+
50
+ ### 11Labs-Compatible Endpoints
51
+
52
+ | Endpoint | Method | Description |
53
+ |----------|--------|-------------|
54
+ | `/v1/models` | GET | List available TTS models |
55
+ | `/v1/voices` | GET | List all voices |
56
+ | `/v1/voices/{voice_id}` | GET | Get voice details |
57
+ | `/v1/voices/{voice_id}/settings` | GET | Get voice settings |
58
+ | `/v1/text-to-speech/{voice_id}` | POST | Generate speech |
59
+ | `/v1/text-to-speech/{voice_id}/stream` | POST | Generate speech (streaming) |
60
+ | `/v1/user/subscription` | GET | Get subscription info |
61
+
62
+ ### SpeechMA-Specific Endpoints
63
+
64
+ | Endpoint | Method | Description |
65
+ |----------|--------|-------------|
66
+ | `/v1/tts/speechma` | POST | Direct SpeechMA TTS with custom options |
67
+ | `/v1/tts/speechma/voices` | GET | Get all SpeechMA voices |
68
+ | `/v1/tts/health` | GET | Check TTS service health |
69
+
70
+ ## Usage Examples
71
+
72
+ ### 1. List Available Voices
73
+
74
+ ```bash
75
+ curl -X GET "http://localhost:8000/v1/voices" \
76
+ -H "Authorization: Bearer YOUR_API_KEY"
77
+ ```
78
+
79
+ **Response:**
80
+ ```json
81
+ {
82
+ "voices": [
83
+ {
84
+ "voice_id": "ava",
85
+ "name": "Ava Multilingual",
86
+ "category": "premade",
87
+ "labels": {
88
+ "accent": "United States",
89
+ "description": "Female Multilingual voice",
90
+ "gender": "female"
91
+ }
92
+ }
93
+ ]
94
+ }
95
+ ```
96
+
97
+ ### 2. Generate Speech (11Labs Style)
98
+
99
+ ```bash
100
+ curl -X POST "http://localhost:8000/v1/text-to-speech/ava" \
101
+ -H "Authorization: Bearer YOUR_API_KEY" \
102
+ -H "Content-Type: application/json" \
103
+ -d '{
104
+ "text": "Hello! This is a test of the SpeechMA TTS API.",
105
+ "model_id": "eleven_multilingual_v2",
106
+ "voice_settings": {
107
+ "stability": 0.5,
108
+ "similarity_boost": 0.75
109
+ }
110
+ }' \
111
+ --output speech.mp3
112
+ ```
113
+
114
+ ### 3. Generate Speech (SpeechMA Direct)
115
+
116
+ ```bash
117
+ curl -X POST "http://localhost:8000/v1/tts/speechma" \
118
+ -H "Authorization: Bearer YOUR_API_KEY" \
119
+ -H "Content-Type: application/json" \
120
+ -d '{
121
+ "text": "Hello with custom voice effects!",
122
+ "voice_id": "ava",
123
+ "pitch": 0,
124
+ "speed": 0,
125
+ "volume": 100
126
+ }' \
127
+ --output speech_custom.mp3
128
+ ```
129
+
130
+ ### 4. Python Client Example
131
+
132
+ ```python
133
+ import requests
134
+
135
+ # Configuration
136
+ API_KEY = "your-api-key"
137
+ BASE_URL = "http://localhost:8000"
138
+
139
+ # Generate speech
140
+ response = requests.post(
141
+ f"{BASE_URL}/v1/text-to-speech/ava",
142
+ headers={"Authorization": f"Bearer {API_KEY}"},
143
+ json={"text": "Hello, world!"}
144
+ )
145
+
146
+ # Save audio
147
+ with open("output.mp3", "wb") as f:
148
+ f.write(response.content)
149
+
150
+ print("Audio saved!")
151
+ ```
152
+
153
+ ## Available Voices
154
+
155
+ ### Default: Ava Multilingual
156
+
157
+ The default voice is **Ava Multilingual** - a high-quality female voice with multilingual capabilities.
158
+
159
+ ### All Available Voices
160
+
161
+ | Voice ID | Name | Gender | Language | Country |
162
+ |----------|------|--------|----------|---------|
163
+ | `ava` | Ava Multilingual | Female | Multilingual | United States |
164
+ | `andrew` | Andrew Multilingual | Male | Multilingual | United States |
165
+ | `brian` | Brian Multilingual | Male | Multilingual | United States |
166
+ | `emma` | Emma Multilingual | Female | Multilingual | United Kingdom |
167
+ | `remy` | Remy Multilingual | Male | Multilingual | France |
168
+ | `vivienne` | Vivienne Multilingual | Female | Multilingual | United States |
169
+ | `daniel` | Daniel Multilingual | Male | Multilingual | United Kingdom |
170
+ | `serena` | Serena Multilingual | Female | Multilingual | United States |
171
+ | `matthew` | Matthew Multilingual | Male | Multilingual | United States |
172
+ | `jane` | Jane Multilingual | Female | Multilingual | United States |
173
+ | `alfonso` | Alfonso Multilingual | Male | Multilingual | Spain |
174
+ | `mario` | Mario Multilingual | Male | Multilingual | Italy |
175
+ | `klaus` | Klaus Multilingual | Male | Multilingual | Germany |
176
+ | `sakura` | Sakura Multilingual | Female | Multilingual | Japan |
177
+ | `xin` | Xin Multilingual | Female | Multilingual | China |
178
+ | `jose` | Jose Multilingual | Male | Multilingual | Brazil |
179
+ | `ines` | Ines Multilingual | Female | Multilingual | Portugal |
180
+ | `amira` | Amira Multilingual | Female | Multilingual | Saudi Arabia |
181
+ | `fatima` | Fatima Multilingual | Female | Multilingual | UAE |
182
+
183
+ ## Voice Effects (Direct API Only)
184
+
185
+ When using the `/v1/tts/speechma` endpoint, you can customize:
186
+
187
+ - **pitch**: Voice pitch adjustment (-10 to 10)
188
+ - **speed**: Speech speed adjustment (-10 to 10)
189
+ - **volume**: Volume percentage (0-200)
190
+
191
+ ```json
192
+ {
193
+ "text": "Custom voice settings",
194
+ "voice_id": "ava",
195
+ "pitch": 2,
196
+ "speed": -1,
197
+ "volume": 120
198
+ }
199
+ ```
200
+
201
+ ## CAPTCHA Handling
202
+
203
+ SpeechMA requires CAPTCHA verification. The provider automatically:
204
+
205
+ 1. Extracts CAPTCHA images from the page
206
+ 2. Uses OCR (Tesseract or EasyOCR) to read the 5-digit code
207
+ 3. Enters the code and submits
208
+ 4. If OCR fails, automatically refreshes the CAPTCHA and retries (up to 5 times)
209
+
210
+ ### Manual CAPTCHA Solving (If OCR Fails)
211
+
212
+ If OCR consistently fails, you can:
213
+
214
+ 1. Check the CAPTCHA image manually at https://speechma.com
215
+ 2. Call the API with pre-solved CAPTCHA (future enhancement)
216
+ 3. Ensure Tesseract is properly installed
217
+
218
+ ## Testing
219
+
220
+ Run the test suite:
221
+
222
+ ```bash
223
+ python test_tts_api.py
224
+ ```
225
+
226
+ This will test:
227
+ - ✅ Health check
228
+ - ✅ List voices and models
229
+ - ✅ Get voice details
230
+ - ✅ Generate audio samples
231
+ - ✅ Direct SpeechMA API
232
+
233
+ ## Limitations
234
+
235
+ 1. **Character Limit**: Maximum 2000 characters per request
236
+ 2. **Rate Limits**: Depends on SpeechMA's server capacity
237
+ 3. **CAPTCHA**: May occasionally fail if OCR can't read the image
238
+ 4. **Audio Format**: Returns MP3 only (output_format is for compatibility)
239
+
240
+ ## Troubleshooting
241
+
242
+ ### CAPTCHA Not Solving
243
+
244
+ 1. **Install Tesseract OCR:**
245
+ ```bash
246
+ # macOS
247
+ brew install tesseract
248
+
249
+ # Ubuntu
250
+ sudo apt-get install tesseract-ocr
251
+ ```
252
+
253
+ 2. **Try EasyOCR instead:**
254
+ ```bash
255
+ pip install easyocr
256
+ ```
257
+
258
+ 3. **Check browser automation:**
259
+ ```bash
260
+ playwright install chromium
261
+ ```
262
+
263
+ ### Audio Not Generating
264
+
265
+ 1. Check SpeechMA is accessible: `GET /v1/tts/health`
266
+ 2. Check Playwright is installed: `playwright install`
267
+ 3. Try refreshing CAPTCHA manually on speechma.com
268
+
269
+ ### Import Errors
270
+
271
+ ```bash
272
+ # Install missing OCR libraries
273
+ pip install pytesseract pillow
274
+
275
+ # Or
276
+ pip install easyocr
277
+ ```
278
+
279
+ ## Architecture
280
+
281
+ ```
282
+ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐
283
+ │ API Client │────▶│ TTS Router │────▶│ SpeechMA │
284
+ │ │ │ (11Labs API) │ │ Provider │
285
+ └─────────────┘ └──────────────┘ └──────┬──────┘
286
+
287
+ ┌──────▼──────┐
288
+ │ Playwright │
289
+ │ Browser │
290
+ └──────┬──────┘
291
+
292
+ ┌──────▼──────┐
293
+ │ OCR Utils │
294
+ │ (Tesseract/ │
295
+ │ EasyOCR) │
296
+ └─────────────┘
297
+ ```
298
+
299
+ ## API Compatibility
300
+
301
+ This implementation aims to be compatible with ElevenLabs API v1:
302
+
303
+ - ✅ Text-to-Speech conversion
304
+ - ✅ Voice listing
305
+ - ✅ Voice details
306
+ - ✅ Model listing
307
+ - ✅ Subscription info (mock)
308
+ - ❌ Voice cloning (not supported by SpeechMA)
309
+ - ❌ Real-time streaming (returns complete file)
310
+ - ❌ Pronunciation dictionaries (ignored)
311
+ - ❐ Voice settings (stored but not fully applied)
312
+
313
+ ## Credits
314
+
315
+ - **SpeechMA**: https://speechma.com - Free TTS service
316
+ - **ElevenLabs**: API structure inspiration
317
+ - **Tesseract OCR**: Open source OCR engine
318
+ - **EasyOCR**: Alternative OCR library
319
+
320
+ ## License
321
+
322
+ This code is part of the KAI API project. Follow your project's license terms.
deploy-microservice.sh CHANGED
File without changes
main.py CHANGED
@@ -54,6 +54,7 @@ from models import (
54
  from services import engine, search_engine
55
  from v1_router import router as v1_router
56
  from admin_router import router as admin_router
 
57
 
58
  # ---------- Logging ----------
59
  logging.basicConfig(
@@ -110,6 +111,7 @@ app.add_middleware(
110
  # Include OpenAI Router
111
  app.include_router(v1_router)
112
  app.include_router(admin_router)
 
113
 
114
 
115
  # ---------- Admin Routes ----------
 
54
  from services import engine, search_engine
55
  from v1_router import router as v1_router
56
  from admin_router import router as admin_router
57
+ from tts_router import router as tts_router
58
 
59
  # ---------- Logging ----------
60
  logging.basicConfig(
 
111
  # Include OpenAI Router
112
  app.include_router(v1_router)
113
  app.include_router(admin_router)
114
+ app.include_router(tts_router)
115
 
116
 
117
  # ---------- Admin Routes ----------
ocr_utils.py ADDED
@@ -0,0 +1,236 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ OCR Utilities for CAPTCHA Solving
3
+ ---------------------------------
4
+ Helper functions to solve CAPTCHA images from SpeechMA.
5
+ """
6
+
7
+ import base64
8
+ import io
9
+ import re
10
+ from typing import Optional
11
+
12
+
13
+ async def extract_digits_from_image(image_data: bytes, method: str = "auto") -> Optional[str]:
14
+ """
15
+ Extract 5-digit CAPTCHA code from image.
16
+
17
+ Args:
18
+ image_data: Raw image bytes
19
+ method: OCR method to use - "tesseract", "easyocr", or "auto"
20
+
21
+ Returns:
22
+ 5-digit code or None if extraction failed
23
+ """
24
+
25
+ if method == "auto":
26
+ # Try tesseract first, then easyocr
27
+ result = await _try_tesseract(image_data)
28
+ if result:
29
+ return result
30
+ return await _try_easyocr(image_data)
31
+
32
+ elif method == "tesseract":
33
+ return await _try_tesseract(image_data)
34
+
35
+ elif method == "easyocr":
36
+ return await _try_easyocr(image_data)
37
+
38
+ return None
39
+
40
+
41
+ async def _try_tesseract(image_data: bytes) -> Optional[str]:
42
+ """Try extracting digits using pytesseract."""
43
+ try:
44
+ import pytesseract
45
+ from PIL import Image, ImageEnhance, ImageFilter
46
+
47
+ # Load image
48
+ image = Image.open(io.BytesIO(image_data))
49
+
50
+ # Preprocess for better OCR
51
+ # Convert to grayscale
52
+ image = image.convert('L')
53
+
54
+ # Enhance contrast
55
+ enhancer = ImageEnhance.Contrast(image)
56
+ image = enhancer.enhance(2.0)
57
+
58
+ # Denoise
59
+ image = image.filter(ImageFilter.MedianFilter(size=3))
60
+
61
+ # Binarize
62
+ threshold = 128
63
+ image = image.point(lambda x: 0 if x < threshold else 255, '1')
64
+
65
+ # OCR config optimized for single line of digits
66
+ custom_config = r'--oem 3 --psm 7 -c tessedit_char_whitelist=0123456789'
67
+ text = pytesseract.image_to_string(image, config=custom_config)
68
+
69
+ # Extract exactly 5 digits
70
+ digits = re.findall(r'\d', text)
71
+ if len(digits) >= 5:
72
+ return ''.join(digits[:5])
73
+
74
+ return None
75
+
76
+ except ImportError:
77
+ return None
78
+ except Exception as e:
79
+ print(f"Tesseract OCR error: {e}")
80
+ return None
81
+
82
+
83
+ async def _try_easyocr(image_data: bytes) -> Optional[str]:
84
+ """Try extracting digits using EasyOCR."""
85
+ try:
86
+ import easyocr
87
+ import tempfile
88
+ import os
89
+
90
+ # EasyOCR requires a file path, so save temporarily
91
+ with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as tmp:
92
+ tmp.write(image_data)
93
+ tmp_path = tmp.name
94
+
95
+ try:
96
+ # Initialize reader (English only)
97
+ reader = easyocr.Reader(['en'], gpu=False)
98
+
99
+ # Read text
100
+ results = reader.readtext(tmp_path)
101
+
102
+ if results:
103
+ # Get the text with highest confidence
104
+ text = results[0][1]
105
+
106
+ # Extract exactly 5 digits
107
+ digits = re.findall(r'\d', text)
108
+ if len(digits) >= 5:
109
+ return ''.join(digits[:5])
110
+
111
+ finally:
112
+ # Clean up temp file
113
+ if os.path.exists(tmp_path):
114
+ os.remove(tmp_path)
115
+
116
+ return None
117
+
118
+ except ImportError:
119
+ return None
120
+ except Exception as e:
121
+ print(f"EasyOCR error: {e}")
122
+ return None
123
+
124
+
125
+ def preprocess_captcha_image(image_data: bytes) -> bytes:
126
+ """
127
+ Preprocess CAPTCHA image for better OCR results.
128
+
129
+ Args:
130
+ image_data: Raw image bytes
131
+
132
+ Returns:
133
+ Preprocessed image bytes
134
+ """
135
+ try:
136
+ from PIL import Image, ImageEnhance, ImageFilter
137
+
138
+ # Load image
139
+ image = Image.open(io.BytesIO(image_data))
140
+
141
+ # Convert to grayscale
142
+ image = image.convert('L')
143
+
144
+ # Enhance contrast
145
+ enhancer = ImageEnhance.Contrast(image)
146
+ image = enhancer.enhance(2.0)
147
+
148
+ # Sharpen
149
+ image = image.filter(ImageFilter.SHARPEN)
150
+
151
+ # Resize slightly larger for better OCR
152
+ width, height = image.size
153
+ image = image.resize((width * 2, height * 2), Image.Resampling.LANCZOS)
154
+
155
+ # Save to bytes
156
+ output = io.BytesIO()
157
+ image.save(output, format='PNG')
158
+ return output.getvalue()
159
+
160
+ except Exception as e:
161
+ print(f"Image preprocessing error: {e}")
162
+ return image_data
163
+
164
+
165
+ # Simple fallback digit recognition (very basic)
166
+ def simple_digit_recognition(image_data: bytes) -> Optional[str]:
167
+ """
168
+ Very simple fallback digit recognition.
169
+ Not very accurate, but doesn't require external libraries.
170
+
171
+ Args:
172
+ image_data: Raw image bytes
173
+
174
+ Returns:
175
+ Guessed 5-digit code or None
176
+ """
177
+ try:
178
+ from PIL import Image
179
+
180
+ image = Image.open(io.BytesIO(image_data))
181
+ image = image.convert('L')
182
+
183
+ # Get image dimensions
184
+ width, height = image.size
185
+
186
+ # Simple heuristic: Look for 5 vertical segments with high contrast
187
+ # This is a very naive approach and won't work well for complex CAPTCHAs
188
+
189
+ pixels = list(image.getdata())
190
+
191
+ # Divide image into 5 equal vertical segments
192
+ segment_width = width // 5
193
+ digits = []
194
+
195
+ for i in range(5):
196
+ # Get center of each segment
197
+ x = i * segment_width + segment_width // 2
198
+
199
+ # Count dark pixels in this column
200
+ dark_count = 0
201
+ for y in range(height):
202
+ idx = y * width + x
203
+ if idx < len(pixels) and pixels[idx] < 128:
204
+ dark_count += 1
205
+
206
+ # Simple classification based on darkness
207
+ # This is extremely basic and won't work reliably
208
+ darkness_ratio = dark_count / height
209
+
210
+ # Guess digit based on darkness (very rough)
211
+ if darkness_ratio < 0.1:
212
+ digits.append('1')
213
+ elif darkness_ratio < 0.2:
214
+ digits.append('7')
215
+ elif darkness_ratio < 0.3:
216
+ digits.append('4')
217
+ elif darkness_ratio < 0.4:
218
+ digits.append('2')
219
+ elif darkness_ratio < 0.5:
220
+ digits.append('3')
221
+ elif darkness_ratio < 0.6:
222
+ digits.append('5')
223
+ elif darkness_ratio < 0.7:
224
+ digits.append('6')
225
+ elif darkness_ratio < 0.8:
226
+ digits.append('9')
227
+ elif darkness_ratio < 0.9:
228
+ digits.append('8')
229
+ else:
230
+ digits.append('0')
231
+
232
+ return ''.join(digits)
233
+
234
+ except Exception as e:
235
+ print(f"Simple recognition error: {e}")
236
+ return None
providers/speechma_tts_provider.py ADDED
@@ -0,0 +1,367 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ SpeechMA TTS Provider
3
+ ---------------------
4
+ Uses Playwright to automate speechma.com TTS generation.
5
+ Handles CAPTCHA solving via OCR and voice selection.
6
+ """
7
+
8
+ import asyncio
9
+ import base64
10
+ import re
11
+ import time
12
+ from typing import Optional
13
+ from playwright.async_api import async_playwright, Page, ElementHandle
14
+ import io
15
+
16
+ try:
17
+ from PIL import Image
18
+ HAS_PIL = True
19
+ except ImportError:
20
+ HAS_PIL = False
21
+
22
+ from ocr_utils import extract_digits_from_image
23
+
24
+
25
+ # SpeechMA Voice IDs mapping to their display names
26
+ SPEECHMA_VOICES = {
27
+ "andrew": {"name": "Andrew Multilingual", "gender": "Male", "language": "Multilingual", "country": "United States"},
28
+ "ava": {"name": "Ava Multilingual", "gender": "Female", "language": "Multilingual", "country": "United States"},
29
+ "brian": {"name": "Brian Multilingual", "gender": "Male", "language": "Multilingual", "country": "United States"},
30
+ "emma": {"name": "Emma Multilingual", "gender": "Female", "language": "Multilingual", "country": "United Kingdom"},
31
+ "remy": {"name": "Remy Multilingual", "gender": "Male", "language": "Multilingual", "country": "France"},
32
+ "vivienne": {"name": "Vivienne Multilingual", "gender": "Female", "language": "Multilingual", "country": "United States"},
33
+ "daniel": {"name": "Daniel Multilingual", "gender": "Male", "language": "Multilingual", "country": "United Kingdom"},
34
+ "serena": {"name": "Serena Multilingual", "gender": "Female", "language": "Multilingual", "country": "United States"},
35
+ "matthew": {"name": "Matthew Multilingual", "gender": "Male", "language": "Multilingual", "country": "United States"},
36
+ "jane": {"name": "Jane Multilingual", "gender": "Female", "language": "Multilingual", "country": "United States"},
37
+ "alfonso": {"name": "Alfonso Multilingual", "gender": "Male", "language": "Multilingual", "country": "Spain"},
38
+ "mario": {"name": "Mario Multilingual", "gender": "Male", "language": "Multilingual", "country": "Italy"},
39
+ "klaus": {"name": "Klaus Multilingual", "gender": "Male", "language": "Multilingual", "country": "Germany"},
40
+ "sakura": {"name": "Sakura Multilingual", "gender": "Female", "language": "Multilingual", "country": "Japan"},
41
+ "xin": {"name": "Xin Multilingual", "gender": "Female", "language": "Multilingual", "country": "China"},
42
+ "jose": {"name": "Jose Multilingual", "gender": "Male", "language": "Multilingual", "country": "Brazil"},
43
+ "ines": {"name": "Ines Multilingual", "gender": "Female", "language": "Multilingual", "country": "Portugal"},
44
+ "amira": {"name": "Amira Multilingual", "gender": "Female", "language": "Multilingual", "country": "Saudi Arabia"},
45
+ "fatima": {"name": "Fatima Multilingual", "gender": "Female", "language": "Multilingual", "country": "UAE"},
46
+ }
47
+
48
+
49
+ class SpeechMATTSProvider:
50
+ """SpeechMA Text-to-Speech Provider using Playwright automation."""
51
+
52
+ def __init__(self):
53
+ self.base_url = "https://speechma.com"
54
+ self.default_voice = "ava"
55
+ self.browser = None
56
+ self.context = None
57
+
58
+ def get_voice_info(self, voice_id: str) -> Optional[dict]:
59
+ """Get voice information by voice_id."""
60
+ voice_id_lower = voice_id.lower()
61
+
62
+ # Try direct match first
63
+ if voice_id_lower in SPEECHMA_VOICES:
64
+ return {"voice_id": voice_id_lower, **SPEECHMA_VOICES[voice_id_lower]}
65
+
66
+ # Try to find by partial match in name
67
+ for vid, info in SPEECHMA_VOICES.items():
68
+ if voice_id_lower in info["name"].lower():
69
+ return {"voice_id": vid, **info}
70
+
71
+ # Return default if not found
72
+ return {"voice_id": self.default_voice, **SPEECHMA_VOICES[self.default_voice]}
73
+
74
+ def get_available_voices(self) -> list[dict]:
75
+ """Return all available voices."""
76
+ return [{"voice_id": vid, **info} for vid, info in SPEECHMA_VOICES.items()]
77
+
78
+ async def _extract_captcha_code(self, page: Page) -> Optional[str]:
79
+ """
80
+ Extract CAPTCHA code from the image using OCR.
81
+ Returns the 5-digit code or None if failed.
82
+ """
83
+ try:
84
+ # Find the CAPTCHA image element
85
+ captcha_img = await page.wait_for_selector('img[alt="captcha"], .captcha-image, [class*="captcha"] img', timeout=5000)
86
+ if not captcha_img:
87
+ return None
88
+
89
+ # Get the image src
90
+ src = await captcha_img.get_attribute('src')
91
+ if not src:
92
+ return None
93
+
94
+ # If it's a data URL, extract base64
95
+ if src.startswith('data:image'):
96
+ base64_data = src.split(',')[1]
97
+ image_data = base64.b64decode(base64_data)
98
+ else:
99
+ # Otherwise download it
100
+ import aiohttp
101
+ async with aiohttp.ClientSession() as session:
102
+ async with session.get(src) as response:
103
+ image_data = await response.read()
104
+
105
+ # Use OCR utilities to extract digits
106
+ code = await extract_digits_from_image(image_data, method="auto")
107
+ return code
108
+
109
+ except Exception as e:
110
+ print(f"CAPTCHA extraction error: {e}")
111
+ return None
112
+
113
+ async def _refresh_captcha(self, page: Page) -> bool:
114
+ """Click the refresh button to get a new CAPTCHA."""
115
+ try:
116
+ # Find and click refresh button
117
+ refresh_btn = await page.query_selector('button[onclick*="refreshCaptcha"], button.captcha-refresh, button:has-text("↻")')
118
+ if refresh_btn:
119
+ await refresh_btn.click()
120
+ await asyncio.sleep(1)
121
+ return True
122
+
123
+ # Try finding by icon/aria-label
124
+ refresh_btn = await page.query_selector('button[aria-label*="refresh"], button[title*="refresh"]')
125
+ if refresh_btn:
126
+ await refresh_btn.click()
127
+ await asyncio.sleep(1)
128
+ return True
129
+
130
+ except Exception as e:
131
+ print(f"CAPTCHA refresh error: {e}")
132
+ return False
133
+
134
+ async def _select_voice(self, page: Page, voice_id: str) -> bool:
135
+ """Select the specified voice."""
136
+ try:
137
+ voice_info = self.get_voice_info(voice_id)
138
+ voice_name = voice_info["name"]
139
+
140
+ # Wait for voice selection area to load
141
+ await page.wait_for_selector('[class*="voice"]', timeout=10000)
142
+
143
+ # Find the voice card by name
144
+ voice_selector = f'text={voice_name}'
145
+ voice_element = await page.query_selector(voice_selector)
146
+
147
+ if voice_element:
148
+ await voice_element.click()
149
+ await asyncio.sleep(0.5)
150
+ return True
151
+
152
+ # Try alternative selectors
153
+ voice_cards = await page.query_selector_all('[class*="voice-card"], [class*="voice-item"], div[class*="voice"]')
154
+ for card in voice_cards:
155
+ text = await card.inner_text()
156
+ if voice_name.lower() in text.lower():
157
+ await card.click()
158
+ await asyncio.sleep(0.5)
159
+ return True
160
+
161
+ return False
162
+
163
+ except Exception as e:
164
+ print(f"Voice selection error: {e}")
165
+ return False
166
+
167
+ async def _set_voice_effects(self, page: Page, pitch: int = 0, speed: int = 0, volume: int = 100) -> bool:
168
+ """Set voice effects (pitch, speed, volume)."""
169
+ try:
170
+ # Click Voice Effects button
171
+ effects_btn = await page.query_selector('button:has-text("Voice Effects"), [class*="voice-effects"]')
172
+ if effects_btn:
173
+ await effects_btn.click()
174
+ await asyncio.sleep(0.5)
175
+
176
+ # Set pitch if not 0
177
+ if pitch != 0:
178
+ pitch_input = await page.query_selector('input[placeholder*="pitch"], input[name*="pitch"], [class*="pitch"] input')
179
+ if pitch_input:
180
+ await pitch_input.fill(str(pitch))
181
+
182
+ # Set speed if not 0
183
+ if speed != 0:
184
+ speed_input = await page.query_selector('input[placeholder*="speed"], input[name*="speed"], [class*="speed"] input')
185
+ if speed_input:
186
+ await speed_input.fill(str(speed))
187
+
188
+ # Set volume
189
+ if volume != 100:
190
+ volume_input = await page.query_selector('input[placeholder*="volume"], input[name*="volume"], [class*="volume"] input')
191
+ if volume_input:
192
+ await volume_input.fill(str(volume))
193
+
194
+ return True
195
+
196
+ except Exception as e:
197
+ print(f"Voice effects error: {e}")
198
+ return False
199
+
200
+ async def generate_speech(
201
+ self,
202
+ text: str,
203
+ voice_id: str = "ava",
204
+ output_format: str = "mp3",
205
+ pitch: int = 0,
206
+ speed: int = 0,
207
+ volume: int = 100
208
+ ) -> Optional[bytes]:
209
+ """
210
+ Generate speech from text using SpeechMA.
211
+
212
+ Args:
213
+ text: Text to convert to speech (max 2000 chars)
214
+ voice_id: Voice ID to use
215
+ output_format: Output audio format
216
+ pitch: Voice pitch adjustment (-10 to 10)
217
+ speed: Speech speed adjustment (-10 to 10)
218
+ volume: Volume percentage (0-200)
219
+
220
+ Returns:
221
+ Audio data as bytes or None if failed
222
+ """
223
+ # Limit text length
224
+ if len(text) > 2000:
225
+ text = text[:2000]
226
+
227
+ async with async_playwright() as p:
228
+ browser = None
229
+ try:
230
+ # Launch browser
231
+ browser = await p.chromium.launch(
232
+ headless=True,
233
+ args=['--no-sandbox', '--disable-setuid-sandbox']
234
+ )
235
+
236
+ context = await browser.new_context(
237
+ viewport={'width': 1280, 'height': 800},
238
+ user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
239
+ )
240
+
241
+ page = await context.new_page()
242
+
243
+ # Navigate to SpeechMA
244
+ await page.goto(self.base_url, wait_until='networkidle', timeout=60000)
245
+ await asyncio.sleep(2) # Wait for page to fully load
246
+
247
+ # Enter text
248
+ text_area = await page.wait_for_selector('textarea[placeholder*="text"], textarea[name*="text"], #text-input', timeout=10000)
249
+ if not text_area:
250
+ raise Exception("Could not find text input area")
251
+
252
+ await text_area.fill(text)
253
+ await asyncio.sleep(0.5)
254
+
255
+ # Select voice
256
+ voice_selected = await self._select_voice(page, voice_id)
257
+ if not voice_selected:
258
+ print(f"Warning: Could not select voice {voice_id}, using default")
259
+
260
+ # Set voice effects if needed
261
+ if pitch != 0 or speed != 0 or volume != 100:
262
+ await self._set_voice_effects(page, pitch, speed, volume)
263
+
264
+ # Solve CAPTCHA
265
+ max_captcha_attempts = 5
266
+ captcha_solved = False
267
+
268
+ for attempt in range(max_captcha_attempts):
269
+ # Extract CAPTCHA code
270
+ captcha_code = await self._extract_captcha_code(page)
271
+
272
+ if captcha_code and len(captcha_code) == 5:
273
+ # Enter CAPTCHA
274
+ captcha_input = await page.query_selector('input[placeholder*="captcha"], input[name*="captcha"], #captcha-input')
275
+ if captcha_input:
276
+ await captcha_input.fill(captcha_code)
277
+ await asyncio.sleep(0.5)
278
+ captcha_solved = True
279
+ break
280
+
281
+ # If CAPTCHA extraction failed, try refreshing
282
+ if attempt < max_captcha_attempts - 1:
283
+ refreshed = await self._refresh_captcha(page)
284
+ if refreshed:
285
+ await asyncio.sleep(2) # Wait for new CAPTCHA
286
+ continue
287
+ else:
288
+ # Try reloading the page
289
+ await page.reload(wait_until='networkidle')
290
+ await asyncio.sleep(2)
291
+ # Re-enter text
292
+ await text_area.fill(text)
293
+ await asyncio.sleep(0.5)
294
+
295
+ if not captcha_solved:
296
+ raise Exception("Could not solve CAPTCHA after multiple attempts")
297
+
298
+ # Click Generate Audio button
299
+ generate_btn = await page.wait_for_selector('button:has-text("Generate Audio"), button[type="submit"]', timeout=10000)
300
+ if not generate_btn:
301
+ raise Exception("Could not find Generate Audio button")
302
+
303
+ # Set up download handler before clicking
304
+ download_future = asyncio.Future()
305
+
306
+ async def handle_download(download):
307
+ try:
308
+ path = await download.path()
309
+ with open(path, 'rb') as f:
310
+ data = f.read()
311
+ download_future.set_result(data)
312
+ except Exception as e:
313
+ download_future.set_exception(e)
314
+
315
+ page.on('download', lambda d: asyncio.create_task(handle_download(d)))
316
+
317
+ # Click generate
318
+ await generate_btn.click()
319
+
320
+ # Wait for generation and download
321
+ try:
322
+ audio_data = await asyncio.wait_for(download_future, timeout=60)
323
+ return audio_data
324
+ except asyncio.TimeoutError:
325
+ # Alternative: Try to get audio from audio player element
326
+ audio_element = await page.wait_for_selector('audio[src], source[type="audio/mp3"]', timeout=10000)
327
+ if audio_element:
328
+ audio_src = await audio_element.get_attribute('src')
329
+ if audio_src:
330
+ # Download audio from URL
331
+ import aiohttp
332
+ async with aiohttp.ClientSession() as session:
333
+ async with session.get(audio_src) as response:
334
+ return await response.read()
335
+
336
+ raise Exception("Audio generation timeout - download not detected")
337
+
338
+ except Exception as e:
339
+ print(f"SpeechMA generation error: {e}")
340
+ return None
341
+
342
+ finally:
343
+ if browser:
344
+ await browser.close()
345
+
346
+ async def health_check(self) -> bool:
347
+ """Check if SpeechMA is accessible."""
348
+ try:
349
+ async with async_playwright() as p:
350
+ browser = await p.chromium.launch(headless=True)
351
+ page = await browser.new_page()
352
+ await page.goto(self.base_url, timeout=30000)
353
+ await browser.close()
354
+ return True
355
+ except Exception:
356
+ return False
357
+
358
+
359
+ # Global provider instance
360
+ _speechma_provider = None
361
+
362
+ def get_speechma_provider() -> SpeechMATTSProvider:
363
+ """Get or create the SpeechMA provider singleton."""
364
+ global _speechma_provider
365
+ if _speechma_provider is None:
366
+ _speechma_provider = SpeechMATTSProvider()
367
+ return _speechma_provider
test_tts_api.py ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test Script for SpeechMA TTS API
4
+ --------------------------------
5
+ Example usage of the 11Labs-compatible TTS endpoints.
6
+ """
7
+
8
+ import requests
9
+ import os
10
+ from pathlib import Path
11
+
12
+ # Configuration
13
+ BASE_URL = "http://localhost:8000" # Change to your API URL
14
+ API_KEY = "your-api-key-here" # Your KAI API key
15
+
16
+
17
+ def test_list_voices():
18
+ """Test listing available voices."""
19
+ print("\n🎙️ Testing: List Voices")
20
+ print("-" * 50)
21
+
22
+ response = requests.get(
23
+ f"{BASE_URL}/v1/voices",
24
+ headers={"Authorization": f"Bearer {API_KEY}"}
25
+ )
26
+
27
+ if response.status_code == 200:
28
+ data = response.json()
29
+ print(f"✅ Found {len(data['voices'])} voices")
30
+
31
+ # Print first 5 voices
32
+ for voice in data['voices'][:5]:
33
+ print(f" - {voice['voice_id']}: {voice['name']}")
34
+ if voice.get('labels'):
35
+ print(f" Gender: {voice['labels'].get('gender', 'N/A')}, "
36
+ f"Accent: {voice['labels'].get('accent', 'N/A')}")
37
+ else:
38
+ print(f"❌ Failed: {response.status_code}")
39
+ print(response.text)
40
+
41
+
42
+ def test_list_models():
43
+ """Test listing TTS models."""
44
+ print("\n🤖 Testing: List Models")
45
+ print("-" * 50)
46
+
47
+ response = requests.get(
48
+ f"{BASE_URL}/v1/models",
49
+ headers={"Authorization": f"Bearer {API_KEY}"}
50
+ )
51
+
52
+ if response.status_code == 200:
53
+ data = response.json()
54
+ print(f"✅ Found {len(data['models'])} models")
55
+ for model in data['models']:
56
+ print(f" - {model['model_id']}: {model['name']}")
57
+ else:
58
+ print(f"❌ Failed: {response.status_code}")
59
+ print(response.text)
60
+
61
+
62
+ def test_get_voice(voice_id: str = "ava"):
63
+ """Test getting a specific voice."""
64
+ print(f"\n🎭 Testing: Get Voice '{voice_id}'")
65
+ print("-" * 50)
66
+
67
+ response = requests.get(
68
+ f"{BASE_URL}/v1/voices/{voice_id}",
69
+ headers={"Authorization": f"Bearer {API_KEY}"}
70
+ )
71
+
72
+ if response.status_code == 200:
73
+ voice = response.json()
74
+ print(f"✅ Found voice: {voice['name']}")
75
+ print(f" Category: {voice['category']}")
76
+ if voice.get('labels'):
77
+ print(f" Labels: {voice['labels']}")
78
+ else:
79
+ print(f"❌ Failed: {response.status_code}")
80
+ print(response.text)
81
+
82
+
83
+ def test_text_to_speech(voice_id: str = "ava", text: str = "Hello, this is a test."):
84
+ """Test text-to-speech conversion."""
85
+ print(f"\n🔊 Testing: Text-to-Speech with '{voice_id}'")
86
+ print("-" * 50)
87
+ print(f"Text: {text}")
88
+
89
+ payload = {
90
+ "text": text,
91
+ "model_id": "eleven_multilingual_v2",
92
+ "voice_settings": {
93
+ "stability": 0.5,
94
+ "similarity_boost": 0.75
95
+ }
96
+ }
97
+
98
+ response = requests.post(
99
+ f"{BASE_URL}/v1/text-to-speech/{voice_id}",
100
+ headers={
101
+ "Authorization": f"Bearer {API_KEY}",
102
+ "Content-Type": "application/json"
103
+ },
104
+ json=payload
105
+ )
106
+
107
+ if response.status_code == 200:
108
+ # Save audio file
109
+ output_file = f"test_output_{voice_id}.mp3"
110
+ with open(output_file, "wb") as f:
111
+ f.write(response.content)
112
+
113
+ file_size = len(response.content)
114
+ print(f"✅ Success! Saved to {output_file}")
115
+ print(f" File size: {file_size:,} bytes")
116
+
117
+ # Show headers
118
+ if 'X-Character-Count' in response.headers:
119
+ print(f" Character count: {response.headers['X-Character-Count']}")
120
+ if 'Request-Id' in response.headers:
121
+ print(f" Request ID: {response.headers['Request-Id']}")
122
+
123
+ return output_file
124
+ else:
125
+ print(f"❌ Failed: {response.status_code}")
126
+ print(response.text)
127
+ return None
128
+
129
+
130
+ def test_speechma_direct(text: str = "Hello from SpeechMA direct API.", voice_id: str = "ava"):
131
+ """Test the direct SpeechMA endpoint with more options."""
132
+ print(f"\n🎯 Testing: SpeechMA Direct API")
133
+ print("-" * 50)
134
+ print(f"Text: {text}")
135
+ print(f"Voice: {voice_id}")
136
+
137
+ payload = {
138
+ "text": text,
139
+ "voice_id": voice_id,
140
+ "pitch": 0,
141
+ "speed": 0,
142
+ "volume": 100
143
+ }
144
+
145
+ response = requests.post(
146
+ f"{BASE_URL}/v1/tts/speechma",
147
+ headers={
148
+ "Authorization": f"Bearer {API_KEY}",
149
+ "Content-Type": "application/json"
150
+ },
151
+ json=payload
152
+ )
153
+
154
+ if response.status_code == 200:
155
+ output_file = f"test_speechma_direct_{voice_id}.mp3"
156
+ with open(output_file, "wb") as f:
157
+ f.write(response.content)
158
+
159
+ file_size = len(response.content)
160
+ print(f"✅ Success! Saved to {output_file}")
161
+ print(f" File size: {file_size:,} bytes")
162
+
163
+ if 'X-Voice-Used' in response.headers:
164
+ print(f" Voice used: {response.headers['X-Voice-Used']}")
165
+
166
+ return output_file
167
+ else:
168
+ print(f"❌ Failed: {response.status_code}")
169
+ print(response.text)
170
+ return None
171
+
172
+
173
+ def test_speechma_voices():
174
+ """Test getting SpeechMA-specific voice list."""
175
+ print("\n🎙️ Testing: SpeechMA Voices List")
176
+ print("-" * 50)
177
+
178
+ response = requests.get(
179
+ f"{BASE_URL}/v1/tts/speechma/voices",
180
+ headers={"Authorization": f"Bearer {API_KEY}"}
181
+ )
182
+
183
+ if response.status_code == 200:
184
+ data = response.json()
185
+ print(f"✅ Found {data['count']} voices")
186
+ print(f" Default: {data['default_voice']}")
187
+
188
+ # Print all voices
189
+ print("\n Available Voices:")
190
+ for voice in data['voices'][:10]: # First 10
191
+ print(f" - {voice['voice_id']}: {voice['name']} ({voice['gender']}, {voice['country']})")
192
+
193
+ if data['count'] > 10:
194
+ print(f" ... and {data['count'] - 10} more")
195
+ else:
196
+ print(f"❌ Failed: {response.status_code}")
197
+ print(response.text)
198
+
199
+
200
+ def test_health():
201
+ """Test TTS health check."""
202
+ print("\n🏥 Testing: Health Check")
203
+ print("-" * 50)
204
+
205
+ response = requests.get(f"{BASE_URL}/v1/tts/health")
206
+
207
+ if response.status_code == 200:
208
+ data = response.json()
209
+ print(f"✅ Status: {data['status']}")
210
+ print(f" Provider: {data['provider']}")
211
+ else:
212
+ print(f"❌ Failed: {response.status_code}")
213
+ print(response.text)
214
+
215
+
216
+ def test_user_subscription():
217
+ """Test user subscription endpoint."""
218
+ print("\n👤 Testing: User Subscription")
219
+ print("-" * 50)
220
+
221
+ response = requests.get(
222
+ f"{BASE_URL}/v1/user/subscription",
223
+ headers={"Authorization": f"Bearer {API_KEY}"}
224
+ )
225
+
226
+ if response.status_code == 200:
227
+ data = response.json()
228
+ print(f"✅ Tier: {data['tier']}")
229
+ print(f" Character limit: {data['character_limit']:,}")
230
+ print(f" Character used: {data['character_count']:,}")
231
+ else:
232
+ print(f"❌ Failed: {response.status_code}")
233
+ print(response.text)
234
+
235
+
236
+ def main():
237
+ """Run all tests."""
238
+ print("\n" + "=" * 60)
239
+ print("🧪 SpeechMA TTS API Test Suite")
240
+ print("=" * 60)
241
+ print(f"Base URL: {BASE_URL}")
242
+
243
+ # Health check first
244
+ test_health()
245
+
246
+ # List resources
247
+ test_list_models()
248
+ test_list_voices()
249
+ test_speechma_voices()
250
+
251
+ # Get specific voice
252
+ test_get_voice("ava")
253
+ test_get_voice("andrew")
254
+
255
+ # User info
256
+ test_user_subscription()
257
+
258
+ # TTS generation (comment out if you don't want to generate audio)
259
+ print("\n" + "=" * 60)
260
+ print("🎵 Generating Audio Samples...")
261
+ print("=" * 60)
262
+
263
+ # Test different voices
264
+ test_text_to_speech("ava", "Hello! I am Ava, a multilingual voice.")
265
+ test_text_to_speech("andrew", "Greetings! I am Andrew, ready to help you.")
266
+ test_text_to_speech("emma", "Hi there! I'm Emma with a British accent.")
267
+
268
+ # Test direct API with effects
269
+ test_speechma_direct(
270
+ "This is a test with custom voice settings.",
271
+ "brian"
272
+ )
273
+
274
+ print("\n" + "=" * 60)
275
+ print("✅ All tests completed!")
276
+ print("=" * 60)
277
+
278
+
279
+ if __name__ == "__main__":
280
+ main()
tts_router.py ADDED
@@ -0,0 +1,500 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ TTS Router - 11Labs Compatible API
3
+ ----------------------------------
4
+ Text-to-Speech endpoints compatible with ElevenLabs API structure.
5
+ Uses SpeechMA as the backend provider.
6
+ """
7
+
8
+ from fastapi import APIRouter, Depends, HTTPException, Header, Request, Response
9
+ from fastapi.responses import StreamingResponse, JSONResponse
10
+ from pydantic import BaseModel, Field
11
+ from typing import List, Optional, Dict, Any, Literal
12
+ import time
13
+ import uuid
14
+ import json
15
+
16
+ from auth import verify_api_key
17
+ from providers.speechma_tts_provider import get_speechma_provider
18
+
19
+ router = APIRouter()
20
+
21
+
22
+ # --- Pydantic Models (11Labs Compatible) ---
23
+
24
+ class VoiceSettings(BaseModel):
25
+ """Voice settings for TTS."""
26
+ stability: float = Field(default=0.5, ge=0.0, le=1.0, description="Voice stability")
27
+ similarity_boost: float = Field(default=0.75, ge=0.0, le=1.0, description="Similarity boost")
28
+ style: float = Field(default=0.0, ge=0.0, le=1.0, description="Style exaggeration")
29
+ use_speaker_boost: bool = Field(default=True, description="Use speaker boost")
30
+
31
+
32
+ class TextToSpeechRequest(BaseModel):
33
+ """11Labs-compatible TTS request."""
34
+ text: str = Field(..., max_length=2000, description="Text to convert to speech")
35
+ model_id: Optional[str] = Field("eleven_multilingual_v2", description="Model ID (ignored, uses SpeechMA)")
36
+ voice_settings: Optional[VoiceSettings] = Field(None, description="Voice settings")
37
+ pronunciation_dictionary_locators: Optional[List[Dict[str, str]]] = None
38
+ seed: Optional[int] = None
39
+ previous_text: Optional[str] = None
40
+ language_code: Optional[str] = None
41
+
42
+ # SpeechMA-specific fields
43
+ voice_id: Optional[str] = Field("ava", description="Voice ID to use")
44
+ output_format: Optional[str] = Field("mp3_44100_128", description="Output format")
45
+ optimize_streaming_latency: Optional[int] = Field(0, ge=0, le=4)
46
+
47
+
48
+ class VoiceResponse(BaseModel):
49
+ """Voice information response."""
50
+ voice_id: str
51
+ name: str
52
+ samples: Optional[List[Dict[str, Any]]] = None
53
+ category: str = "premade"
54
+ fine_tuning: Optional[Dict[str, Any]] = None
55
+ labels: Optional[Dict[str, str]] = None
56
+ description: Optional[str] = None
57
+ preview_url: Optional[str] = None
58
+ available_for_tiers: List[str] = ["free", "starter", "creator", "enterprise"]
59
+ settings: Optional[VoiceSettings] = None
60
+ sharing: Optional[Dict[str, Any]] = None
61
+ high_quality_base_model_ids: Optional[List[str]] = None
62
+ safety_control: Optional[str] = None
63
+ voice_verification: Optional[Dict[str, Any]] = None
64
+ permission_on_resource: Optional[str] = None
65
+ is_legacy: bool = False
66
+ is_mixed: bool = False
67
+
68
+
69
+ class VoicesListResponse(BaseModel):
70
+ """List of voices response."""
71
+ voices: List[VoiceResponse]
72
+
73
+
74
+ class TTSModelInfo(BaseModel):
75
+ """TTS model information."""
76
+ model_id: str
77
+ name: str
78
+ description: str
79
+ can_do_text_to_speech: bool = True
80
+ can_do_voice_conversion: bool = False
81
+ can_use_style: bool = True
82
+ can_use_speaker_boost: bool = True
83
+ serves_pro_voices: bool = True
84
+ serves_v2_models: bool = True
85
+ token_cost_factor: float = 1.0
86
+ requires_alpha_access: bool = False
87
+ max_characters_request_free_user: int = 2000
88
+ max_characters_request_subscribed_user: int = 2000
89
+ languages: List[Dict[str, str]]
90
+
91
+
92
+ class TTSModelsResponse(BaseModel):
93
+ """TTS models list response."""
94
+ models: List[TTSModelInfo]
95
+
96
+
97
+ class UserSubscriptionResponse(BaseModel):
98
+ """User subscription info (mock for compatibility)."""
99
+ tier: str = "free"
100
+ character_count: int = 0
101
+ character_limit: int = 1000000
102
+ can_extend_character_limit: bool = True
103
+ allowed_to_extend_character_limit: bool = True
104
+ next_character_count_reset_unix: int = 0
105
+ voice_slots_used: int = 1
106
+ voice_slots_available: int = 100
107
+ professional_voice_slots_used: int = 0
108
+ professional_voice_slots_available: int = 5
109
+ can_use_delayed_payment_methods: bool = False
110
+ can_use_instant_voice_cloning: bool = True
111
+ can_use_professional_voice_cloning: bool = False
112
+ currency: Dict[str, Any] = {"usd": "USD"}
113
+ status: str = "active"
114
+ has_open_invoices: bool = False
115
+
116
+
117
+ # --- Helper Functions ---
118
+
119
+ def format_voice_to_11labs(voice_id: str, voice_info: dict) -> VoiceResponse:
120
+ """Convert SpeechMA voice to 11Labs format."""
121
+ return VoiceResponse(
122
+ voice_id=voice_id,
123
+ name=voice_info["name"],
124
+ category="premade",
125
+ labels={
126
+ "accent": voice_info.get("country", "Multilingual"),
127
+ "description": f"{voice_info['gender']} {voice_info['language']} voice",
128
+ "age": "adult",
129
+ "gender": voice_info["gender"].lower(),
130
+ "use_case": "general"
131
+ },
132
+ description=f"{voice_info['gender']} {voice_info['language']} voice from {voice_info.get('country', 'Unknown')}",
133
+ settings=VoiceSettings()
134
+ )
135
+
136
+
137
+ # --- Endpoints ---
138
+
139
+ @router.get("/v1/user/subscription", response_model=UserSubscriptionResponse)
140
+ async def get_user_subscription(
141
+ key_data: dict = Depends(verify_api_key)
142
+ ):
143
+ """
144
+ Get user subscription information.
145
+ Mock endpoint for 11Labs compatibility.
146
+ """
147
+ return UserSubscriptionResponse(
148
+ tier="free",
149
+ character_count=0,
150
+ character_limit=1000000,
151
+ next_character_count_reset_unix=int(time.time()) + 86400 * 30
152
+ )
153
+
154
+
155
+ @router.get("/v1/models", response_model=TTSModelsResponse)
156
+ async def list_tts_models(
157
+ key_data: dict = Depends(verify_api_key)
158
+ ):
159
+ """
160
+ List available TTS models.
161
+ """
162
+ models = [
163
+ TTSModelInfo(
164
+ model_id="eleven_multilingual_v2",
165
+ name="Eleven Multilingual v2",
166
+ description="Our most advanced multilingual model with highest quality",
167
+ can_do_text_to_speech=True,
168
+ can_do_voice_conversion=False,
169
+ can_use_style=True,
170
+ can_use_speaker_boost=True,
171
+ serves_pro_voices=True,
172
+ serves_v2_models=True,
173
+ token_cost_factor=1.0,
174
+ requires_alpha_access=False,
175
+ max_characters_request_free_user=2000,
176
+ max_characters_request_subscribed_user=2000,
177
+ languages=[
178
+ {"language_id": "en", "name": "English"},
179
+ {"language_id": "es", "name": "Spanish"},
180
+ {"language_id": "fr", "name": "French"},
181
+ {"language_id": "de", "name": "German"},
182
+ {"language_id": "it", "name": "Italian"},
183
+ {"language_id": "pt", "name": "Portuguese"},
184
+ {"language_id": "ja", "name": "Japanese"},
185
+ {"language_id": "zh", "name": "Chinese"},
186
+ {"language_id": "ar", "name": "Arabic"},
187
+ {"language_id": "hi", "name": "Hindi"},
188
+ ]
189
+ ),
190
+ TTSModelInfo(
191
+ model_id="eleven_flash_v2_5",
192
+ name="Eleven Flash v2.5",
193
+ description="Ultra-low latency model (~75ms)",
194
+ can_do_text_to_speech=True,
195
+ can_do_voice_conversion=False,
196
+ can_use_style=False,
197
+ can_use_speaker_boost=True,
198
+ serves_pro_voices=True,
199
+ serves_v2_models=True,
200
+ token_cost_factor=0.5,
201
+ requires_alpha_access=False,
202
+ max_characters_request_free_user=2000,
203
+ max_characters_request_subscribed_user=2000,
204
+ languages=[
205
+ {"language_id": "en", "name": "English"},
206
+ {"language_id": "es", "name": "Spanish"},
207
+ {"language_id": "fr", "name": "French"},
208
+ ]
209
+ )
210
+ ]
211
+
212
+ return TTSModelsResponse(models=models)
213
+
214
+
215
+ @router.get("/v1/voices", response_model=VoicesListResponse)
216
+ async def list_voices(
217
+ key_data: dict = Depends(verify_api_key)
218
+ ):
219
+ """
220
+ List all available voices.
221
+ """
222
+ provider = get_speechma_provider()
223
+ voices_data = provider.get_available_voices()
224
+
225
+ voices = []
226
+ for voice_data in voices_data:
227
+ voice_id = voice_data["voice_id"]
228
+ info = {
229
+ "name": voice_data["name"],
230
+ "gender": voice_data["gender"],
231
+ "language": voice_data["language"],
232
+ "country": voice_data.get("country", "Unknown")
233
+ }
234
+ voices.append(format_voice_to_11labs(voice_id, info))
235
+
236
+ return VoicesListResponse(voices=voices)
237
+
238
+
239
+ @router.get("/v1/voices/{voice_id}", response_model=VoiceResponse)
240
+ async def get_voice(
241
+ voice_id: str,
242
+ key_data: dict = Depends(verify_api_key)
243
+ ):
244
+ """
245
+ Get information about a specific voice.
246
+ """
247
+ provider = get_speechma_provider()
248
+ voice_info = provider.get_voice_info(voice_id)
249
+
250
+ if not voice_info:
251
+ raise HTTPException(status_code=404, detail=f"Voice '{voice_id}' not found")
252
+
253
+ return format_voice_to_11labs(voice_info["voice_id"], {
254
+ "name": voice_info["name"],
255
+ "gender": voice_info["gender"],
256
+ "language": voice_info["language"],
257
+ "country": voice_info.get("country", "Unknown")
258
+ })
259
+
260
+
261
+ @router.get("/v1/voices/{voice_id}/settings", response_model=VoiceSettings)
262
+ async def get_voice_settings(
263
+ voice_id: str,
264
+ key_data: dict = Depends(verify_api_key)
265
+ ):
266
+ """
267
+ Get default settings for a voice.
268
+ """
269
+ provider = get_speechma_provider()
270
+ voice_info = provider.get_voice_info(voice_id)
271
+
272
+ if not voice_info:
273
+ raise HTTPException(status_code=404, detail=f"Voice '{voice_id}' not found")
274
+
275
+ return VoiceSettings()
276
+
277
+
278
+ @router.post("/v1/text-to-speech/{voice_id}")
279
+ async def text_to_speech(
280
+ voice_id: str,
281
+ request: TextToSpeechRequest,
282
+ key_data: dict = Depends(verify_api_key)
283
+ ):
284
+ """
285
+ Convert text to speech.
286
+
287
+ This endpoint is compatible with 11Labs API:
288
+ POST /v1/text-to-speech/{voice_id}
289
+
290
+ Returns audio data as MP3.
291
+ """
292
+ provider = get_speechma_provider()
293
+
294
+ # Validate voice
295
+ voice_info = provider.get_voice_info(voice_id)
296
+ if not voice_info:
297
+ raise HTTPException(status_code=404, detail=f"Voice '{voice_id}' not found")
298
+
299
+ # Use provided voice_id or from request
300
+ actual_voice_id = voice_id
301
+
302
+ # Generate speech
303
+ try:
304
+ audio_data = await provider.generate_speech(
305
+ text=request.text,
306
+ voice_id=actual_voice_id,
307
+ output_format=request.output_format or "mp3"
308
+ )
309
+
310
+ if audio_data is None:
311
+ raise HTTPException(
312
+ status_code=500,
313
+ detail="Failed to generate speech. This could be due to CAPTCHA issues or site changes."
314
+ )
315
+
316
+ # Return audio with proper headers
317
+ headers = {
318
+ "Content-Type": "audio/mpeg",
319
+ "X-Character-Count": str(len(request.text)),
320
+ "Request-Id": f"tts-{uuid.uuid4().hex[:12]}"
321
+ }
322
+
323
+ return Response(
324
+ content=audio_data,
325
+ media_type="audio/mpeg",
326
+ headers=headers
327
+ )
328
+
329
+ except Exception as e:
330
+ raise HTTPException(
331
+ status_code=500,
332
+ detail=f"Speech generation failed: {str(e)}"
333
+ )
334
+
335
+
336
+ @router.post("/v1/text-to-speech/{voice_id}/stream")
337
+ async def text_to_speech_stream(
338
+ voice_id: str,
339
+ request: TextToSpeechRequest,
340
+ key_data: dict = Depends(verify_api_key)
341
+ ):
342
+ """
343
+ Convert text to speech with streaming response.
344
+
345
+ Note: Since SpeechMA generates complete audio files,
346
+ this returns the full audio as a stream.
347
+ """
348
+ provider = get_speechma_provider()
349
+
350
+ # Validate voice
351
+ voice_info = provider.get_voice_info(voice_id)
352
+ if not voice_info:
353
+ raise HTTPException(status_code=404, detail=f"Voice '{voice_id}' not found")
354
+
355
+ try:
356
+ audio_data = await provider.generate_speech(
357
+ text=request.text,
358
+ voice_id=voice_id,
359
+ output_format=request.output_format or "mp3"
360
+ )
361
+
362
+ if audio_data is None:
363
+ raise HTTPException(
364
+ status_code=500,
365
+ detail="Failed to generate speech"
366
+ )
367
+
368
+ # Return as streaming response
369
+ def audio_generator():
370
+ # Yield audio data in chunks
371
+ chunk_size = 8192
372
+ for i in range(0, len(audio_data), chunk_size):
373
+ yield audio_data[i:i + chunk_size]
374
+
375
+ headers = {
376
+ "X-Character-Count": str(len(request.text)),
377
+ "Request-Id": f"tts-stream-{uuid.uuid4().hex[:12]}"
378
+ }
379
+
380
+ return StreamingResponse(
381
+ audio_generator(),
382
+ media_type="audio/mpeg",
383
+ headers=headers
384
+ )
385
+
386
+ except Exception as e:
387
+ raise HTTPException(
388
+ status_code=500,
389
+ detail=f"Speech generation failed: {str(e)}"
390
+ )
391
+
392
+
393
+ # Additional SpeechMA-specific endpoints
394
+
395
+ @router.post("/v1/tts/speechma")
396
+ async def speechma_tts(
397
+ request: Request,
398
+ key_data: dict = Depends(verify_api_key)
399
+ ):
400
+ """
401
+ Direct SpeechMA TTS endpoint with custom options.
402
+
403
+ Body: {
404
+ "text": "Hello world",
405
+ "voice_id": "ava",
406
+ "pitch": 0,
407
+ "speed": 0,
408
+ "volume": 100
409
+ }
410
+ """
411
+ data = await request.json()
412
+
413
+ text = data.get("text")
414
+ voice_id = data.get("voice_id", "ava")
415
+ pitch = data.get("pitch", 0)
416
+ speed = data.get("speed", 0)
417
+ volume = data.get("volume", 100)
418
+
419
+ if not text:
420
+ raise HTTPException(status_code=400, detail="Text is required")
421
+
422
+ if len(text) > 2000:
423
+ raise HTTPException(status_code=400, detail="Text exceeds 2000 character limit")
424
+
425
+ provider = get_speechma_provider()
426
+
427
+ # Validate voice
428
+ voice_info = provider.get_voice_info(voice_id)
429
+ if not voice_info:
430
+ raise HTTPException(status_code=404, detail=f"Voice '{voice_id}' not found")
431
+
432
+ try:
433
+ audio_data = await provider.generate_speech(
434
+ text=text,
435
+ voice_id=voice_id,
436
+ pitch=pitch,
437
+ speed=speed,
438
+ volume=volume
439
+ )
440
+
441
+ if audio_data is None:
442
+ raise HTTPException(
443
+ status_code=500,
444
+ detail="Failed to generate speech. This could be due to CAPTCHA issues."
445
+ )
446
+
447
+ return Response(
448
+ content=audio_data,
449
+ media_type="audio/mpeg",
450
+ headers={
451
+ "Content-Disposition": f'attachment; filename="speech_{voice_id}.mp3"',
452
+ "X-Voice-Used": voice_info["voice_id"]
453
+ }
454
+ )
455
+
456
+ except Exception as e:
457
+ raise HTTPException(
458
+ status_code=500,
459
+ detail=f"Speech generation failed: {str(e)}"
460
+ )
461
+
462
+
463
+ @router.get("/v1/tts/speechma/voices")
464
+ async def speechma_voices(
465
+ key_data: dict = Depends(verify_api_key)
466
+ ):
467
+ """
468
+ Get all available SpeechMA voices with full details.
469
+ """
470
+ provider = get_speechma_provider()
471
+ voices = provider.get_available_voices()
472
+
473
+ return JSONResponse({
474
+ "voices": voices,
475
+ "count": len(voices),
476
+ "default_voice": "ava"
477
+ })
478
+
479
+
480
+ @router.get("/v1/tts/health")
481
+ async def tts_health_check():
482
+ """
483
+ Check if TTS service is healthy.
484
+ """
485
+ try:
486
+ provider = get_speechma_provider()
487
+ is_healthy = await provider.health_check()
488
+
489
+ return JSONResponse({
490
+ "status": "healthy" if is_healthy else "unhealthy",
491
+ "provider": "speechma",
492
+ "timestamp": time.time()
493
+ })
494
+ except Exception as e:
495
+ return JSONResponse({
496
+ "status": "unhealthy",
497
+ "provider": "speechma",
498
+ "error": str(e),
499
+ "timestamp": time.time()
500
+ }, status_code=503)