Spaces:

kazuhina
/

anime-tts

Running on Zero

File size: 15,105 Bytes

ca70004

# Anime TTS API 使用マニュアル

## 📋 目次
1. [概要](#概要)
2. [機能](#機能)
3. [インストール](#インストール)
4. [使用方法](#使用方法)
5. [APIリファレンス](#apiリファレンス)
6. [対応フォーマット](#対応フォーマット)
7. [エラーハンドリング](#エラーハンドリング)
8. [パフォーマンス最適化](#パフォーマンス最適化)
9. [トラブルシューティング](#トラブルシューティング)
10. [使用例](#使用例)

---

## 🎯 概要

**Anime TTS** は、日本語アニメ音声をテキストに変換する高性能な音声認識APIです。`litagin/anime-whisper`モデルを使用し、アニメ・ビジュアルノベル向けの特別な調整により、従来の音声認識システムよりも高い精度を実現しています。

### 主な特徴
- 🗾 **日本語特化**: 日本語アニメ音声に最適化
- 🎭 **感情表現対応**: 感情的な表現や非音声Soundsを適切に処理
- 🎯 **高精度認識**: アニメダイアログ向けの特別な調整
- 📝 **自然な句読点**: 日本語の自然な句読点を自動付与
- 🚀 **GPU対応**: CUDA対応による高速処理

---

## ✨ 機能

### コア機能
- **音声テキスト変換**: 日本語アニメ音声をテキストに変換
- **リアルタイム処理**: ストリーミング音声対応
- **バッチ処理**: 複数ファイルの同時処理
- **エラーハンドリング**: 堅牢なエラー処理とフォールバック

### 技術仕様
- **モデル**: `litagin/anime-whisper`
- **フレームワーク**: Gradio 5.20.0
- **GPU対応**: CUDA/ROCm対応
- **メモリ最適化**: チャンク分割処理

---

## 📦 インストール

### 必要な環境
- Python 3.8+
- CUDA対応GPU（推奨）
- 8GB以上のRAM（推奨）

### 依存関係のインストール

```bash
# 基本的な依存関係
pip install gradio torch transformers spaces

# 音声処理ライブラリ
pip install soundfile numpy

# GPU使用の場合（CUDA）
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# オプション: Hugging Face Hub高速化
pip install huggingface_hub[hf_xet]
```

### 環境変数設定
```bash
# GPU使用の場合
export CUDA_VISIBLE_DEVICES=0

# メモリ制限設定（必要に応じて）
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
```

---

## 🚀 使用方法

### 基本的な使用方法

#### 1. ローカル実行
```bash
python app.py
```
- ブラウザで `http://localhost:7860` にアクセス
- 音声ファイルをアップロードしてテキスト変換

#### 2. プログラムからの呼び出し
```python
from app import transcribe_audio

# 音声ファイルのパス
result = transcribe_audio("path/to/audio.wav")
print(result)
```

#### 3. Gradioクライアントとして使用
```python
import gradio as gr

# 既存のインターフェースを使用
demo = gr.load("huggingface.co/spaces/kazuhina/anime-tts")

# 音声ファイルで実行
result = demo("path/to/audio.wav")
```

---

## 📚 APIリファレンス

### `transcribe_audio(audio_file)`

#### パラメータ
| パラメータ | 型 | 必須 | 説明 |
|------------|-----|------|------|
| `audio_file` | str/Path/File | ✅ | 音声ファイルのパスまたはGradioファイルオブジェクト |

#### 戻り値
```python
str  # 変換された日本語テキスト
```

#### 使用例
```python
# ファイルパスでの使用
result = transcribe_audio("anime_dialogue.wav")

# Gradioファイルオブジェクトでの使用
import gradio as gr
file_obj = gr.upload_file("audio.mp3")
result = transcribe_audio(file_obj)
```

### `create_demo()`

#### 説明
デモ用のテスト音声ファイルを作成します。

#### 戻り値
```python
str  # 作成されたデモ音声ファイルのパス
```

#### 使用例
```python
from app import create_demo

demo_file = create_demo()
result = transcribe_audio(demo_file)
```

---

## 📁 対応フォーマット

### 入力フォーマット
| フォーマット | 拡張子 | 備考 |
|-------------|--------|------|
| **WAV** | `.wav` | 推奨、最高品質 |
| **MP3** | `.mp3` | 圧縮率高、処理時間増加 |
| **M4A** | `.m4a` | Appleフォーマット対応 |
| **FLAC** | `.flac` | 無損圧縮、品質保持 |

### 推奨設定
- **サンプルレート**: 16kHz以上
- **ビット深度**: 16bit以上
- **チャンネル**: モノラルまたはステレオ
- **ファイルサイズ**: 100MB以下（推奨）

### 音声品質要件
```python
# 最適な設定例
sample_rate = 16000  # 16kHz
channels = 1         # モノラル
bit_depth = 16       # 16bit
format = "wav"       # WAV形式
```

---

## ⚠️ エラーハンドリング

### 一般的なエラーと対応

#### 1. モデル読み込みエラー
```python
# エラー例
"Error loading model: Connection timeout"

# 対応方法
# 1. インターネット接続確認
# 2. モデルキャッシュクリア
# 3. 再起動
```

#### 2. 音声ファイルエラー
```python
# エラー例
"Audio file not found."

# 対応方法
# 1. ファイルパス確認
# 2. ファイル存在確認
# 3. 権限確認
```

#### 3. フォーマットエラー
```python
# エラー例
"Invalid audio file format."

# 対応方法
# 1. サポートフォーマット確認
# 2. ファイル破損確認
# 3. フォーマット変換
```

### エラーハンドリングコード例
```python
try:
    result = transcribe_audio(audio_file)
    print(f"変換成功: {result}")
except FileNotFoundError:
    print("音声ファイルが見つかりません")
except ValueError as e:
    print(f"無効なファイル形式: {e}")
except Exception as e:
    print(f"予期しないエラー: {e}")
```

---

## ⚡ パフォーマンス最適化

### GPU使用の最適化
```python
# CUDA使用確認
import torch
if torch.cuda.is_available():
    print(f"GPU使用: {torch.cuda.get_device_name()}")
else:
    print("CPU使用")
```

### メモリ管理
```python
# チャンクサイズ調整
chunk_length_s = 30.0  # 30秒ずつ処理
batch_size = 64 if torch.cuda.is_available() else 8
```

### バッチ処理
```python
import glob
import os

def batch_transcribe(audio_dir):
    """複数ファイルのバッチ処理"""
    audio_files = glob.glob(os.path.join(audio_dir, "*.wav"))
    results = []
    
    for audio_file in audio_files:
        try:
            result = transcribe_audio(audio_file)
            results.append((audio_file, result))
        except Exception as e:
            print(f"エラー {audio_file}: {e}")
    
    return results
```

### パフォーマンス設定
```python
# 最適化設定
generate_kwargs = {
    "language": "Japanese",
    "no_repeat_ngram_size": 0,
    "repetition_penalty": 1.0,
}

# パイプライン設定
pipe = pipeline(
    "automatic-speech-recognition",
    model="litagin/anime-whisper",
    device="cuda" if torch.cuda.is_available() else "cpu",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    chunk_length_s=30.0,
    batch_size=64 if torch.cuda.is_available() else 8,
)
```

---

## 🔧 トラブルシューティング

### よくある問題と解決方法

#### 1. モデルダウンロードエラー
```bash
# 解決方法
# 1. キャッシュクリア
rm -rf ~/.cache/huggingface/

# 2. 手動ダウンロード
from huggingface_hub import snapshot_download
snapshot_download(repo_id="litagin/anime-whisper")
```

#### 2. メモリ不足エラー
```python
# 解決方法
# 1. バッチサイズ縮小
batch_size = 1

# 2. チャンクサイズ調整
chunk_length_s = 10.0

# 3. CPU使用
device = "cpu"
```

#### 3. 音声品質の問題
```python
# 解決方法
# 1. 音声前処理
import librosa
import soundfile as sf

def preprocess_audio(input_file, output_file):
    """音声前処理"""
    audio, sr = librosa.load(input_file, sr=16000)
    sf.write(output_file, audio, 16000)
    return output_file
```

#### 4. 処理速度の問題
```python
# 解決方法
# 1. GPU使用
device = "cuda"

# 2. ファイルサイズ最適化
# 30秒以下のファイルに分割

# 3. 並列処理
from concurrent.futures import ThreadPoolExecutor

def parallel_transcribe(files):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(transcribe_audio, files))
    return results
```

### デバッグ方法
```python
import logging

# ログレベル設定
logging.basicConfig(level=logging.DEBUG)

# 詳細ログ出力
def debug_transcribe(audio_file):
    print(f"入力ファイル: {audio_file}")
    print(f"ファイル存在: {os.path.exists(audio_file)}")
    print(f"ファイルサイズ: {os.path.getsize(audio_file)} bytes")
    
    result = transcribe_audio(audio_file)
    print(f"変換結果: {result}")
    return result
```

---

## 💡 使用例

### 基本的な使用例

#### 1. 単一ファイル処理
```python
from app import transcribe_audio

# 基本的な使用
audio_file = "anime_scene.wav"
result = transcribe_audio(audio_file)
print(f"変換結果: {result}")
```

#### 2. ファイルアップロード処理
```python
import gradio as gr

def process_uploaded_file(file):
    if file is None:
        return "ファイルがアップロードされていません"
    
    try:
        result = transcribe_audio(file)
        return f"変換結果:\n{result}"
    except Exception as e:
        return f"エラー: {e}"

# Gradioインターフェース
demo = gr.Interface(
    fn=process_uploaded_file,
    inputs=gr.Audio(label="音声ファイル", type="filepath"),
    outputs=gr.Textbox(label="変換結果", lines=10),
    title="Anime TTS Demo"
)
```

#### 3. バッチ処理
```python
import os
import glob
from app import transcribe_audio

def batch_process_directory(directory_path):
    """ディレクトリ内の全音声ファイルを処理"""
    audio_extensions = ['.wav', '.mp3', '.m4a', '.flac']
    results = {}
    
    for ext in audio_extensions:
        files = glob.glob(os.path.join(directory_path, f"*{ext}"))
        for file_path in files:
            try:
                print(f"処理中: {file_path}")
                result = transcribe_audio(file_path)
                results[file_path] = result
            except Exception as e:
                results[file_path] = f"エラー: {e}"
    
    return results

# 使用例
results = batch_process_directory("./audio_files")
for file_path, result in results.items():
    print(f"{file_path}: {result}")
```

#### 4. Web API としての使用
```python
from fastapi import FastAPI, UploadFile, File
from app import transcribe_audio
import tempfile

app = FastAPI()

@app.post("/transcribe")
async def transcribe_endpoint(file: UploadFile = File(...)):
    """音声ファイルを受け取り、テキストを返すAPI"""
    try:
        # 一時ファイルに保存
        with tempfile.NamedTemporaryFile(delete=False) as temp_file:
            content = await file.read()
            temp_file.write(content)
            temp_file.flush()
            
            # 音声認識実行
            result = transcribe_audio(temp_file.name)
            
            # 一時ファイル削除
            os.unlink(temp_file.name)
            
            return {"text": result, "status": "success"}
    
    except Exception as e:
        return {"error": str(e), "status": "error"}

# 起動
# uvicorn app:app --host 0.0.0.0 --port 8000
```

#### 5. リアルタイム処理
```python
import pyaudio
import wave
from app import transcribe_audio

def real_time_transcribe(duration=10):
    """リアルタイム音声認識"""
    # 音声入力設定
    chunk = 1024
    format = pyaudio.paInt16
    channels = 1
    rate = 16000
    
    p = pyaudio.PyAudio()
    
    # ストリーム開始
    stream = p.open(format=format,
                    channels=channels,
                    rate=rate,
                    input=True,
                    frames_per_buffer=chunk)
    
    print("音声認識開始...")
    frames = []
    
    # 音声データ収集
    for i in range(0, int(rate / chunk * duration)):
        data = stream.read(chunk)
        frames.append(data)
    
    # ストリーム終了
    stream.stop_stream()
    stream.close()
    p.terminate()
    
    # 一時ファイルに保存
    with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as f:
        wf = wave.open(f.name, 'wb')
        wf.setnchannels(channels)
        wf.setsampwidth(p.get_sample_size(format))
        wf.setframerate(rate)
        wf.writeframes(b''.join(frames))
        wf.close()
        
        # 音声認識実行
        result = transcribe_audio(f.name)
        
        # 一時ファイル削除
        os.unlink(f.name)
        
        return result

# 使用例
# result = real_time_transcribe(duration=5)
# print(f"認識結果: {result}")
```

### 高度な使用例

#### 1. カスタム前処理
```python
import librosa
import soundfile as sf
import numpy as np
from app import transcribe_audio

def advanced_preprocess(audio_file, output_file):
    """高度な音声前処理"""
    # 音声読み込み
    audio, sr = librosa.load(audio_file, sr=16000)
    
    # ノイズリダクション
    audio = librosa.effects.preemphasis(audio)
    
    # 音量正規化
    audio = librosa.util.normalize(audio)
    
    #  silence removal
    audio, _ = librosa.effects.trim(audio, top_db=20)
    
    # 保存
    sf.write(output_file, audio, 16000)
    return output_file

# 使用例
processed_file = advanced_preprocess("noisy_audio.wav", "clean_audio.wav")
result = transcribe_audio(processed_file)
```

#### 2. 結果後処理
```python
import re
from app import transcribe_audio

def postprocess_result(text):
    """変換結果の後処理"""
    # 句読点調整
    text = re.sub(r'([。！？])', r'\1 ', text)
    
    # 改行調整
    text = re.sub(r'\s+', ' ', text)
    
    # 特殊文字除去
    text = re.sub(r'[^\w\s。！？、ー]', '', text)
    
    return text.strip()

# 使用例
raw_result = transcribe_audio("audio.wav")
clean_result = postprocess_result(raw_result)
print(f"後処理結果: {clean_result}")
```

---

## 📞 サポート

### 問題報告
問題が発生した場合は、以下の情報を含めて報告してください：

1. **エラー詳細**: 完全なエラーメッセージ
2. **環境情報**: Pythonバージョン、OS、GPU情報
3. **音声ファイル情報**: フォーマット、サイズ、期間
4. **再現手順**: 問題を再現する手順

### パフォーマンス最適化サポート
最適なパフォーマンスを得るためのサポートも提供しております。

---

## 📄 ライセンス

このプロジェクトはHugging Face Spacesで公開されており、 соответствующихライセンス條件に従います。

---

**最終更新**: 2025-10-31  
**バージョン**: 1.0.0  
**作成者**: Anime TTS Development Team