Spaces:

kazuhina
/

anime-tts

Running on Zero

App Files Files Community

kazuhina commited on Oct 31

Commit

ca70004

1 Parent(s): 6a984a1

Add requirements.txt and API manual for Anime TTS

Browse files

Files changed (2) hide show

API_MANUAL.md +616 -0
requirements.txt +24 -0

API_MANUAL.md ADDED Viewed

	@@ -0,0 +1,616 @@

+# Anime TTS API 使用マニュアル
+## 📋 目次
+1. [概要](#概要)
+2. [機能](#機能)
+3. [インストール](#インストール)
+4. [使用方法](#使用方法)
+5. [APIリファレンス](#apiリファレンス)
+6. [対応フォーマット](#対応フォーマット)
+7. [エラーハンドリング](#エラーハンドリング)
+8. [パフォーマンス最適化](#パフォーマンス最適化)
+9. [トラブルシューティング](#トラブルシューティング)
+10. [使用例](#使用例)
+---
+## 🎯 概要
+**Anime TTS** は、日本語アニメ音声をテキストに変換する高性能な音声認識APIです。`litagin/anime-whisper`モデルを使用し、アニメ・ビジュアルノベル向けの特別な調整により、従来の音声認識システムよりも高い精度を実現しています。
+### 主な特徴
+- 🗾 **日本語特化**: 日本語アニメ音声に最適化
+- 🎭 **感情表現対応**: 感情的な表現や非音声Soundsを適切に処理
+- 🎯 **高精度認識**: アニメダイアログ向けの特別な調整
+- 📝 **自然な句読点**: 日本語の自然な句読点を自動付与
+- 🚀 **GPU対応**: CUDA対応による高速処理
+---
+## ✨ 機能
+### コア機能
+- **音声テキスト変換**: 日本語アニメ音声をテキストに変換
+- **リアルタイム処理**: ストリーミング音声対応
+- **バッチ処理**: 複数ファイルの同時処理
+- **エラーハンドリング**: 堅牢なエラー処理とフォールバック
+### 技術仕様
+- **モデル**: `litagin/anime-whisper`
+- **フレームワーク**: Gradio 5.20.0
+- **GPU対応**: CUDA/ROCm対応
+- **メモリ最適化**: チャンク分割処理
+---
+## 📦 インストール
+### 必要な環境
+- Python 3.8+
+- CUDA対応GPU（推奨）
+- 8GB以上のRAM（推奨）
+### 依存関係のインストール
+```bash
+# 基本的な依存関係
+pip install gradio torch transformers spaces
+# 音声処理ライブラリ
+pip install soundfile numpy
+# GPU使用の場合（CUDA）
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+# オプション: Hugging Face Hub高速化
+pip install huggingface_hub[hf_xet]
+```
+### 環境変数設定
+```bash
+# GPU使用の場合
+export CUDA_VISIBLE_DEVICES=0
+# メモリ制限設定（必要に応じて）
+export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
+```
+---
+## 🚀 使用方法
+### 基本的な使用方法
+#### 1. ローカル実行
+```bash
+python app.py
+```
+- ブラウザで `http://localhost:7860` にアクセス
+- 音声ファイルをアップロードしてテキスト変換
+#### 2. プログラムからの呼び出し
+```python
+from app import transcribe_audio
+# 音声ファイルのパス
+result = transcribe_audio("path/to/audio.wav")
+print(result)
+```
+#### 3. Gradioクライアントとして使用
+```python
+import gradio as gr
+# 既存のインターフェースを使用
+demo = gr.load("huggingface.co/spaces/kazuhina/anime-tts")
+# 音声ファイルで実行
+result = demo("path/to/audio.wav")
+```
+---
+## 📚 APIリファレンス
+### `transcribe_audio(audio_file)`
+#### パラメータ
+| パラメータ | 型 | 必須 | 説明 |
+|------------|-----|------|------|
+| `audio_file` | str/Path/File | ✅ | 音声ファイルのパスまたはGradioファイルオブジェクト |
+#### 戻り値
+```python
+str  # 変換された日本語テキスト
+```
+#### 使用例
+```python
+# ファイルパスでの使用
+result = transcribe_audio("anime_dialogue.wav")
+# Gradioファイルオブジェクトでの使用
+import gradio as gr
+file_obj = gr.upload_file("audio.mp3")
+result = transcribe_audio(file_obj)
+```
+### `create_demo()`
+#### 説明
+デモ用のテスト音声ファイルを作成します。
+#### 戻り値
+```python
+str  # 作成されたデモ音声ファイルのパス
+```
+#### 使用例
+```python
+from app import create_demo
+demo_file = create_demo()
+result = transcribe_audio(demo_file)
+```
+---
+## 📁 対応フォーマット
+### 入力フォーマット
+| フォーマット | 拡張子 | 備考 |
+|-------------|--------|------|
+| **WAV** | `.wav` | 推奨、最高品質 |
+| **MP3** | `.mp3` | 圧縮率高、処理時間増加 |
+| **M4A** | `.m4a` | Appleフォーマット対応 |
+| **FLAC** | `.flac` | 無損圧縮、品質保持 |
+### 推奨設定
+- **サンプルレート**: 16kHz以上
+- **ビット深度**: 16bit以上
+- **チャンネル**: モノラルまたはステレオ
+- **ファイルサイズ**: 100MB以下（推奨）
+### 音声品質要件
+```python
+# 最適な設定例
+sample_rate = 16000  # 16kHz
+channels = 1         # モノラル
+bit_depth = 16       # 16bit
+format = "wav"       # WAV形式
+```
+---
+## ⚠️ エラーハンドリング
+### 一般的なエラーと対応
+#### 1. モデル読み込みエラー
+```python
+# エラー例
+"Error loading model: Connection timeout"
+# 対応方法
+# 1. インターネット接続確認
+# 2. モデルキャッシュクリア
+# 3. 再起動
+```
+#### 2. 音声ファイルエラー
+```python
+# エラー例
+"Audio file not found."
+# 対応方法
+# 1. ファイルパス確認
+# 2. ファイル存在確認
+# 3. 権限確認
+```
+#### 3. フォーマットエラー
+```python
+# エラー例
+"Invalid audio file format."
+# 対応方法
+# 1. サポートフォーマット確認
+# 2. ファイル破損確認
+# 3. フォーマット変換
+```
+### エラーハンドリングコード例
+```python
+try:
+    result = transcribe_audio(audio_file)
+    print(f"変換成功: {result}")
+except FileNotFoundError:
+    print("音声ファイルが見つかりません")
+except ValueError as e:
+    print(f"無効なファイル形式: {e}")
+except Exception as e:
+    print(f"予期しないエラー: {e}")
+```
+---
+## ⚡ パフォーマンス最適化
+### GPU使用の最適化
+```python
+# CUDA使用確認
+import torch
+if torch.cuda.is_available():
+    print(f"GPU使用: {torch.cuda.get_device_name()}")
+else:
+    print("CPU使用")
+```
+### メモリ管理
+```python
+# チャンクサイズ調整
+chunk_length_s = 30.0  # 30秒ずつ処理
+batch_size = 64 if torch.cuda.is_available() else 8
+```
+### バッチ処理
+```python
+import glob
+import os
+def batch_transcribe(audio_dir):
+    """複数ファイルのバッチ処理"""
+    audio_files = glob.glob(os.path.join(audio_dir, "*.wav"))
+    results = []
+    for audio_file in audio_files:
+        try:
+            result = transcribe_audio(audio_file)
+            results.append((audio_file, result))
+        except Exception as e:
+            print(f"エラー {audio_file}: {e}")
+    return results
+```
+### パフォーマンス設定
+```python
+# 最適化設定
+generate_kwargs = {
+    "language": "Japanese",
+    "no_repeat_ngram_size": 0,
+    "repetition_penalty": 1.0,
+}
+# パイプライン設定
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="litagin/anime-whisper",
+    device="cuda" if torch.cuda.is_available() else "cpu",
+    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
+    chunk_length_s=30.0,
+    batch_size=64 if torch.cuda.is_available() else 8,
+)
+```
+---
+## 🔧 トラブルシューティング
+### よくある問題と解決方法
+#### 1. モデルダウンロードエラー
+```bash
+# 解決方法
+# 1. キャッシュクリア
+rm -rf ~/.cache/huggingface/
+# 2. 手動ダウンロード
+from huggingface_hub import snapshot_download
+snapshot_download(repo_id="litagin/anime-whisper")
+```
+#### 2. メモリ不足エラー
+```python
+# 解決方法
+# 1. バッチサイズ縮小
+batch_size = 1
+# 2. チャンクサイズ調整
+chunk_length_s = 10.0
+# 3. CPU使用
+device = "cpu"
+```
+#### 3. 音声品質の問題
+```python
+# 解決方法
+# 1. 音声前処理
+import librosa
+import soundfile as sf
+def preprocess_audio(input_file, output_file):
+    """音声前処理"""
+    audio, sr = librosa.load(input_file, sr=16000)
+    sf.write(output_file, audio, 16000)
+    return output_file
+```
+#### 4. 処理速度の問題
+```python
+# 解決方法
+# 1. GPU使用
+device = "cuda"
+# 2. ファイルサイズ最適化
+# 30秒以下のファイルに分割
+# 3. 並列処理
+from concurrent.futures import ThreadPoolExecutor
+def parallel_transcribe(files):
+    with ThreadPoolExecutor(max_workers=4) as executor:
+        results = list(executor.map(transcribe_audio, files))
+    return results
+```
+### デバッグ方法
+```python
+import logging
+# ログレベル設定
+logging.basicConfig(level=logging.DEBUG)
+# 詳細ログ出力
+def debug_transcribe(audio_file):
+    print(f"入力ファイル: {audio_file}")
+    print(f"ファイル存在: {os.path.exists(audio_file)}")
+    print(f"ファイルサイズ: {os.path.getsize(audio_file)} bytes")
+    result = transcribe_audio(audio_file)
+    print(f"変換結果: {result}")
+    return result
+```
+---
+## 💡 使用例
+### 基本的な使用例
+#### 1. 単一ファイル処理
+```python
+from app import transcribe_audio
+# 基本的な使用
+audio_file = "anime_scene.wav"
+result = transcribe_audio(audio_file)
+print(f"変換結果: {result}")
+```
+#### 2. ファイルアップロード処理
+```python
+import gradio as gr
+def process_uploaded_file(file):
+    if file is None:
+        return "ファイルがアップロードされていません"
+    try:
+        result = transcribe_audio(file)
+        return f"変換結果:\n{result}"
+    except Exception as e:
+        return f"エラー: {e}"
+# Gradioインターフェース
+demo = gr.Interface(
+    fn=process_uploaded_file,
+    inputs=gr.Audio(label="音声ファイル", type="filepath"),
+    outputs=gr.Textbox(label="変換結果", lines=10),
+    title="Anime TTS Demo"
+)
+```
+#### 3. バッ��処理
+```python
+import os
+import glob
+from app import transcribe_audio
+def batch_process_directory(directory_path):
+    """ディレクトリ内の全音声ファイルを処理"""
+    audio_extensions = ['.wav', '.mp3', '.m4a', '.flac']
+    results = {}
+    for ext in audio_extensions:
+        files = glob.glob(os.path.join(directory_path, f"*{ext}"))
+        for file_path in files:
+            try:
+                print(f"処理中: {file_path}")
+                result = transcribe_audio(file_path)
+                results[file_path] = result
+            except Exception as e:
+                results[file_path] = f"エラー: {e}"
+    return results
+# 使用例
+results = batch_process_directory("./audio_files")
+for file_path, result in results.items():
+    print(f"{file_path}: {result}")
+```
+#### 4. Web API としての使用
+```python
+from fastapi import FastAPI, UploadFile, File
+from app import transcribe_audio
+import tempfile
+app = FastAPI()
+@app.post("/transcribe")
+async def transcribe_endpoint(file: UploadFile = File(...)):
+    """音声ファイルを受け取り、テキストを返すAPI"""
+    try:
+        # 一時ファイルに保存
+        with tempfile.NamedTemporaryFile(delete=False) as temp_file:
+            content = await file.read()
+            temp_file.write(content)
+            temp_file.flush()
+            # 音声認識実行
+            result = transcribe_audio(temp_file.name)
+            # 一時ファイル削除
+            os.unlink(temp_file.name)
+            return {"text": result, "status": "success"}
+    except Exception as e:
+        return {"error": str(e), "status": "error"}
+# 起動
+# uvicorn app:app --host 0.0.0.0 --port 8000
+```
+#### 5. リアルタイム処理
+```python
+import pyaudio
+import wave
+from app import transcribe_audio
+def real_time_transcribe(duration=10):
+    """リアルタイム音声認識"""
+    # 音声入力設定
+    chunk = 1024
+    format = pyaudio.paInt16
+    channels = 1
+    rate = 16000
+    p = pyaudio.PyAudio()
+    # ストリーム開始
+    stream = p.open(format=format,
+                    channels=channels,
+                    rate=rate,
+                    input=True,
+                    frames_per_buffer=chunk)
+    print("音声認識開始...")
+    frames = []
+    # 音声データ収集
+    for i in range(0, int(rate / chunk * duration)):
+        data = stream.read(chunk)
+        frames.append(data)
+    # ストリーム終了
+    stream.stop_stream()
+    stream.close()
+    p.terminate()
+    # 一時ファイルに保存
+    with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
+        wf = wave.open(f.name, 'wb')
+        wf.setnchannels(channels)
+        wf.setsampwidth(p.get_sample_size(format))
+        wf.setframerate(rate)
+        wf.writeframes(b''.join(frames))
+        wf.close()
+        # 音声認識実行
+        result = transcribe_audio(f.name)
+        # 一時ファイル削除
+        os.unlink(f.name)
+        return result
+# 使用例
+# result = real_time_transcribe(duration=5)
+# print(f"認識結果: {result}")
+```
+### 高度な使用例
+#### 1. カスタム前処理
+```python
+import librosa
+import soundfile as sf
+import numpy as np
+from app import transcribe_audio
+def advanced_preprocess(audio_file, output_file):
+    """高度な音声前処理"""
+    # 音声読み込み
+    audio, sr = librosa.load(audio_file, sr=16000)
+    # ノイズリダクション
+    audio = librosa.effects.preemphasis(audio)
+    # 音量正規化
+    audio = librosa.util.normalize(audio)
+    #  silence removal
+    audio, _ = librosa.effects.trim(audio, top_db=20)
+    # 保存
+    sf.write(output_file, audio, 16000)
+    return output_file
+# 使用例
+processed_file = advanced_preprocess("noisy_audio.wav", "clean_audio.wav")
+result = transcribe_audio(processed_file)
+```
+#### 2. 結果後処理
+```python
+import re
+from app import transcribe_audio
+def postprocess_result(text):
+    """変換結果の後処理"""
+    # 句読点調整
+    text = re.sub(r'([。！？])', r'\1 ', text)
+    # 改行調整
+    text = re.sub(r'\s+', ' ', text)
+    # 特殊文字除去
+    text = re.sub(r'[^\w\s。！？、ー]', '', text)
+    return text.strip()
+# 使用例
+raw_result = transcribe_audio("audio.wav")
+clean_result = postprocess_result(raw_result)
+print(f"後処理結果: {clean_result}")
+```
+---
+## 📞 サポート
+### 問題報告
+問題が発生した場合は、以下の情報を含めて報告してください：
+1. **エラー詳細**: 完全なエラーメッセージ
+2. **環境情報**: Pythonバージョン、OS、GPU情報
+3. **音声ファイル情報**: フォーマット、サイズ、期間
+4. **再現手順**: 問題を再現する手順
+### パフォーマンス最適化サポート
+最適なパフォーマンスを得るためのサポートも提供しております。
+---
+## 📄 ライセンス
+このプロジェクトはHugging Face Spacesで公開されており、 соответствующихライセンス條件に従います。
+---
+**最終更新**: 2025-10-31
+**バージョン**: 1.0.0
+**作成者**: Anime TTS Development Team

requirements.txt ADDED Viewed

	@@ -0,0 +1,24 @@

+# Anime TTS - Required Dependencies
+# Core ML/AI libraries
+torch>=2.0.0
+transformers>=4.30.0
+spaces>=0.19.0
+# Gradio and UI
+gradio>=5.0.0
+# Audio processing
+soundfile>=0.12.0
+numpy>=1.21.0
+librosa>=0.10.0
+# Utilities
+pathlib
+tempfile
+os-sys
+# Optional: Hugging Face Hub enhancements
+huggingface_hub>=0.15.0
+# GPU support (optional)
+# torch-audio  # Uncomment if using CUDA