You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Chroma by FlashLabs

🚀 Get Started with Voice Agents!

Production-ready voice AI solutions powered by Chroma | Open-source model for developers & researchers

Model Description

Chroma 1.0 is an advanced multimodal model developed by FlashLabs. It is designed to understand and generate content across multiple modalities, including text and audio. As a virtual human model, Chroma possesses the ability to process auditory inputs and respond with both text and synthesized speech, enabling natural voice interactions.

Model Type: Multimodal Causal Language Model
Developed by: FlashLabs
Language(s): English
License: Apache-2.0
Model Architecture:
- Reasoner: Based on Qwen2.5-Omni-3B
- Backbone: Based on Llama3 (16 layers, 2048 hidden size)
- Decoder: Based on Llama3(4 layers, 1024 hidden size)
- Codec: Mimi (24kHz sampling rate)

Model Architecture

Capabilities

Chroma 1.0 is capable of:

Speech Understanding: Processing user audio input directly.
Multimodal Generation: Generating coherent text and speech responses simultaneously.
Voice Cloning: utilizing reference audio prompts to guide speech generation style.

Usage

Installation

Ensure you have the necessary dependencies installed. You may need the latest versions of transformers and torch.

pip install transformers torch

Loading the Model

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "FlashLabs/Chroma-4B" # Or local path

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    trust_remote_code=True, 
    device_map="auto"
)

# Load processor
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Inference Example

Here is how to perform a simple conversation with audio input and audio output:

import torch
from IPython.display import Audio



# Construct conversation history
system_prompt = (
    "You are Chroma, an advanced virtual human created by the FlashLabs. "
    "You possess the ability to understand auditory inputs and generate both text and speech."
)
conversation = [[
    {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt}
        ],
    },
    {
        "role": "user",
        "content": [
            # Input audio file path
            {"type": "audio", "audio": "assets/make_taco.wav"}, 
        ],
    },
]]

# Provide reference audio/text for style or context
prompt_text = ["War and bloodshed throughout the world."]
prompt_audio = ["assets/reference_audio.wav"]

# Process inputs
inputs = processor(
    conversation,
    add_generation_prompt=True, 
    tokenize=False,
    prompt_audio=prompt_audio,
    prompt_text=prompt_text
)

# Move inputs to device
device = model.device
inputs = {k: v.to(device) for k, v in inputs.items()}

# 2. Generate
output = model.generate(
    **inputs, 
    max_new_tokens=100, 
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    use_cache=True
)

# 3. Decode Audio
# The model outputs raw tokens; we decode the audio part using the codec
audio_values = model.codec_model.decode(output.permute(0, 2, 1)).audio_values

# Save or play audio (e.g., in Jupyter)
Audio(audio_values[0].cpu().detach().numpy(), rate=24_000)

Citation

If you use Chroma in your research, please cite:

@misc{chen2026flashlabschroma10realtime,
      title={FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning}, 
      author={Tanyu Chen and Tairan Chen and Kai Shen and Zhenghua Bao and Zhihui Zhang and Man Yuan and Yi Shi},
      year={2026},
      eprint={2601.11141},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2601.11141}, 
}

Contact

For questions or issues, please contact: chroma@flashlabs.ai

Downloads last month: 10

Safetensors

Model size

6B params

Tensor type

F32

BF16

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for FlashLabs/Chroma-4B

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Paper • 2601.11141 • Published 6 days ago