Anime-Speech-Japanese-Captioner

This model is a fine-tuned version of Qwen/Qwen3-Omni-30B-A3B-Captioner.

This is an audio captioning model specialized for Japanese anime-style or game-style speech. It takes an audio input and generates a detailed description in Japanese, including emotion, speaker profile, mood, speed, prosody, pitch/timbre, style, and an overall caption.

It was fine-tuned using the NandemoGHS/Galgame_Gemini_Captions dataset.

The training was conducted using the ms-swift library with the Megatron Backend.

Intended Use and Limitations

This model is specifically designed for Japanese game-style or anime-style speech.

Due to the nature of its training data, it is not expected to perform well on:

  • Languages other than Japanese.
  • General conversational speech (e.g., meetings, casual dialogue).

How to Use (Inference)

We recommend using vLLM for inference.

vLLM Installation Requirements

This model requires building vLLM from a recent development commit as it is not yet supported in the latest stable release (v0.11.0 as of this writing).

It has been tested and confirmed to work with commit 18961c5ea62976efc50525b72e40337993c5e4f9. You must build vLLM from source:

git clone https://github.com/vllm-project/vllm.git
cd vllm
uv pip install . --torch-backend=auto -v --prerelease=allow

This requirement will likely be unnecessary after the v0.11.1 release.

Inference Example

Here is a simple inference script using vLLM:

import os
import torch

from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

if __name__ == '__main__':
    # vLLM engine v1 not supported yet
    os.environ['VLLM_USE_V1'] = '0'

    MODEL_PATH = "NandemoGHS/Anime-Speech-Japanese-Captioner-FP8-DYNAMIC"

    llm = LLM(
            model=MODEL_PATH, trust_remote_code=True, gpu_memory_utilization=0.95,
            tensor_parallel_size=torch.cuda.device_count(),
            limit_mm_per_prompt={'audio': 1},
            max_num_seqs=8,
            max_model_len=8192,
            seed=100,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=4096,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

    # Example audio file
    audio_path = "https://huggingface.co/NandemoGHS/Anime-Speech-Japanese-Captioner/resolve/main/examples/example1.wav"

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio": audio_path}
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    audios, _, _ = process_mm_info(messages, use_audio_in_video=False)

    inputs = {
        'prompt': text,
        'multi_modal_data': {},
    }

    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios

    outputs = llm.generate([inputs], sampling_params=sampling_params)

    print(outputs[0].outputs[0].text)

Example Output

This is the caption generated for this example.

emotion: ecstatic
profile: お嬢様風の女性声
mood: 快楽、絶頂
speed: 途切れ途切れ
prosody: 息遣いが荒く、感情の起伏が激しい
pitch_timbre: 高め、息多め、喘ぎ声
style: 喘ぎ
notes: 喘ぎ声と吐息が混じり、性的興奮が非常に高い状態。
caption: お嬢様風の女性が快楽に溺れ、喘ぎながら話す。息遣いが荒く、途切れ途切れに感情を爆発させる。性的興奮が最高潮に達している。

Notebook Example

For a more detailed walkthrough, please see the inference_example.ipynb notebook.

Output Format

The model outputs a structured description of the audio in Japanese, following this format:

emotion: {Emotion of the speech}
profile: {Speaker profile}
mood: {Mood of the speech}
speed: {Speaking speed}
prosody: {Prosody, rhythm}
pitch_timbre:{Pitch, voice quality}
style: {Style of utterance}
notes: {Other relevant notes}
caption: {A comprehensive caption integrating all elements}

License

This model is licensed under CC-BY-NC-4.0 License.

Furthermore, the training data utilized outputs from Gemini 2.5 Pro. Therefore, any use that competes with or violates the terms of service of Gemini is strictly prohibited.

Downloads last month
18
Safetensors
Model size
32B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NandemoGHS/Anime-Speech-Japanese-Captioner

Finetuned
(4)
this model
Quantizations
1 model

Dataset used to train NandemoGHS/Anime-Speech-Japanese-Captioner