Spaces:

alethanhson
/

csm-1b-gradio-v2

Running

App Files Files Community

A Le Thanh Son commited on Mar 17

Commit

e02c9de

1 Parent(s): a6403d5

fix

Browse files

Files changed (4) hide show

README.md +39 -39
app.py +195 -139
generator.py +10 -10
test_model.py +22 -22

README.md CHANGED Viewed

@@ -11,35 +11,35 @@ pinned: false
 # CSM-1B Text-to-Speech Demo
-Ứng dụng này sử dụng mô hình CSM-1B (Collaborative Speech Model) để chuyển đổi văn bản thành giọng nói với chất lượng cao.
-## Tính năng
-- **Tạo âm thanh đơn giản**: Chuyển đổi văn bản thành giọng nói với các tùy chọn về ID người nói, thời lượng, temperature và top-k.
-- **Tạo âm thanh với ngữ cảnh**: Cung cấp các đoạn âm thanh và văn bản làm ngữ cảnh để mô hình tạo ra âm thanh phù hợp hơn.
-- **Tối ưu GPU**: Sử dụng ZeroGPU của Hugging Face Spaces để tối ưu việc sử dụng GPU.
-## Cài đặt và Cấu hình
-### Yêu cầu truy cập
-Để sử dụng mô hình CSM-1B, bạn cần có quyền truy cập vào các mô hình sau trên Hugging Face:
 - [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
 - [sesame/csm-1b](https://huggingface.co/sesame/csm-1b)
-### Cấu hình Hugging Face Token
-1. Tạo tài khoản Hugging Face nếu bạn chưa có.
-2. Truy cập vào [Hugging Face Settings](https://huggingface.co/settings/tokens) để tạo token.
-3. Yêu cầu quyền truy cập vào các mô hình nếu cần.
-4. Đặt biến môi trường `HF_TOKEN` với giá trị là token của bạn:
    ```bash
    export HF_TOKEN=your_token_here
    ```
-5. Hoặc bạn có thể nhập token trực tiếp trong tab "Cấu hình" của ứng dụng.
-### Cài đặt
 ```bash
 git clone https://github.com/yourusername/csm-1b-gradio.git
@@ -47,54 +47,54 @@ cd csm-1b-gradio
 pip install -r requirements.txt
 ```
-## Cách sử dụng
-1. Khởi động ứng dụng:
    ```bash
    python app.py
    ```
-2. Mở trình duyệt web và truy cập địa chỉ được hiển thị (thường là http://127.0.0.1:7860).
-3. Nhập văn bản bạn muốn chuyển thành giọng nói.
-4. Chọn ID người nói (từ 0-10).
-5. Điều chỉnh các tham số như thời lượng tối đa, temperature và top-k.
-6. Nhấn nút "Tạo âm thanh" để tạo giọng nói.
-## Thông tin về mô hình
-CSM-1B là một mô hình text-to-speech tiên tiến được phát triển bởi Sesame AI Labs. Mô hình này có khả năng tạo giọng nói tự nhiên từ văn bản với nhiều giọng nói khác nhau.
 ## ZeroGPU
-Ứng dụng này sử dụng ZeroGPU của Hugging Face Spaces để tối ưu việc sử dụng GPU. ZeroGPU giúp giải phóng bộ nhớ GPU khi không sử dụng, giúp tiết kiệm tài nguyên và cải thiện hiệu suất.
 ```python
 import spaces
 @spaces.GPU
 def my_gpu_function():
-    # Hàm này sẽ chỉ sử dụng GPU khi được gọi
-    # và giải phóng GPU sau khi hoàn thành
     pass
 ```
-Khi triển khai trên Hugging Face Spaces, ZeroGPU sẽ tự động quản lý việc sử dụng GPU, giúp ứng dụng hoạt động hiệu quả hơn.
-## Lưu ý
-- Mô hình này sử dụng watermarking để đánh dấu âm thanh được tạo ra bởi AI.
-- Thời gian tạo âm thanh phụ thuộc vào độ dài văn bản và cấu hình phần cứng.
-- Bạn cần có quyền truy cập vào mô hình CSM-1B trên Hugging Face để sử dụng ứng dụng này.
-## Triển khai trên Hugging Face Spaces
-Để triển khai ứng dụng này trên Hugging Face Spaces:
-1. Tạo một Space mới trên Hugging Face với SDK là Gradio.
-2. Tải lên tất cả các file của dự án.
-3. Trong phần cài đặt của Space, thêm biến môi trường `HF_TOKEN` với giá trị là token của bạn.
-4. Chọn cấu hình phần cứng phù hợp (khuyến nghị sử dụng GPU).
-## Tài nguyên
 - [GitHub Repository](https://github.com/SesameAILabs/csm-1b)
 - [Hugging Face Model](https://huggingface.co/sesame/csm-1b)

 # CSM-1B Text-to-Speech Demo
+This application uses the CSM-1B (Collaborative Speech Model) to convert text to high-quality speech.
+## Features
+- **Simple Audio Generation**: Convert text to speech with options for speaker ID, duration, temperature, and top-k.
+- **Audio Generation with Context**: Provide audio clips and text as context to help the model generate more appropriate speech.
+- **GPU Optimization**: Uses Hugging Face Spaces' ZeroGPU to optimize GPU usage.
+## Installation and Configuration
+### Access Requirements
+To use the CSM-1B model, you need access to the following models on Hugging Face:
 - [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
 - [sesame/csm-1b](https://huggingface.co/sesame/csm-1b)
+### Hugging Face Token Configuration
+1. Create a Hugging Face account if you don't have one.
+2. Go to [Hugging Face Settings](https://huggingface.co/settings/tokens) to create a token.
+3. Request access to the models if needed.
+4. Set the `HF_TOKEN` environment variable with your token:
    ```bash
    export HF_TOKEN=your_token_here
    ```
+5. Or you can enter your token directly in the "Configuration" tab of the application.
+### Installation
 ```bash
 git clone https://github.com/yourusername/csm-1b-gradio.git
 pip install -r requirements.txt
 ```
+## How to Use
+1. Start the application:
    ```bash
    python app.py
    ```
+2. Open a web browser and go to the displayed address (usually http://127.0.0.1:7860).
+3. Enter the text you want to convert to speech.
+4. Choose a speaker ID (from 0-10).
+5. Adjust parameters like maximum duration, temperature, and top-k.
+6. Click the "Generate Audio" button to create speech.
+## About the Model
+CSM-1B is an advanced text-to-speech model developed by Sesame AI Labs. This model can generate natural speech from text with various voices.
 ## ZeroGPU
+This application uses Hugging Face Spaces' ZeroGPU to optimize GPU usage. ZeroGPU helps free up GPU memory when not in use, saving resources and improving performance.
 ```python
 import spaces
 @spaces.GPU
 def my_gpu_function():
+    # This function will only use GPU when called
+    # and release GPU after completion
     pass
 ```
+When deployed on Hugging Face Spaces, ZeroGPU will automatically manage GPU usage, making the application more efficient.
+## Notes
+- This model uses watermarking to mark audio generated by AI.
+- Audio generation time depends on text length and hardware configuration.
+- You need access to the CSM-1B model on Hugging Face to use this application.
+## Deployment on Hugging Face Spaces
+To deploy this application on Hugging Face Spaces:
+1. Create a new Space on Hugging Face with Gradio SDK.
+2. Upload all project files.
+3. In the Space settings, add the `HF_TOKEN` environment variable with your token.
+4. Choose appropriate hardware configuration (GPU recommended).
+## Resources
 - [GitHub Repository](https://github.com/SesameAILabs/csm-1b)
 - [Hugging Face Model](https://huggingface.co/sesame/csm-1b)

app.py CHANGED Viewed

@@ -11,64 +11,82 @@ from dataclasses import dataclass
 from generator import Segment, load_csm_1b
 from huggingface_hub import login
-# Tắt tính năng compile của torch để tránh lỗi triton
 torch._dynamo.config.suppress_errors = True
-# Kiểm tra xem có GPU không và cấu hình thiết bị phù hợp
 device = "cuda" if torch.cuda.is_available() else "cpu"
-print(f"Sử dụng thiết bị: {device}")
-# Đăng nhập vào Hugging Face Hub nếu có token
 def login_huggingface():
     hf_token = os.environ.get("HF_TOKEN")
     if hf_token:
-        print("Đang đăng nhập vào Hugging Face Hub...")
         login(token=hf_token)
-        print("Đã đăng nhập thành công!")
     else:
-        print("Không tìm thấy HF_TOKEN trong biến môi trường. Một số mô hình có thể không truy cập được.")
-# Đăng nhập khi khởi động
 login_huggingface()
-# Biến toàn cục để theo dõi trạng thái mô hình
 generator = None
 model_loaded = False
-# Hàm tải mô hình được gọi trong ZeroGPU
-@spaces.GPU
 def initialize_model():
     global generator, model_loaded
     if not model_loaded:
-        print("Đang tải mô hình CSM-1B trong GPU...")
         generator = load_csm_1b(device="cuda")
         model_loaded = True
-        print("Đã tải xong mô hình!")
     return generator
-# Hàm lấy mô hình đã tải
-@spaces.GPU
 def get_model():
     global generator, model_loaded
     if not model_loaded:
         return initialize_model()
     return generator
-# Hàm chuyển đổi âm thanh thành tensor
 def audio_to_tensor(audio_path: str) -> Tuple[torch.Tensor, int]:
     waveform, sample_rate = torchaudio.load(audio_path)
-    waveform = waveform.mean(dim=0)  # Chuyển stereo thành mono nếu cần
     return waveform, sample_rate
-# Hàm lưu tensor âm thanh thành file
 def save_audio(audio_tensor: torch.Tensor, sample_rate: int) -> str:
     temp_dir = tempfile.gettempdir()
     output_path = os.path.join(temp_dir, f"csm1b_output_{int(time.time())}.wav")
     torchaudio.save(output_path, audio_tensor.unsqueeze(0), sample_rate)
     return output_path
-# Hàm tạo âm thanh từ văn bản sử dụng ZeroGPU
-@spaces.GPU
 def generate_speech(
     text: str,
     speaker_id: int,
@@ -83,49 +101,62 @@ def generate_speech(
     top_k: int = 50,
     progress=gr.Progress()
 ) -> str:
-    # Lấy mô hình đã tải
-    generator = get_model()
-    # Chuẩn bị ngữ cảnh (context)
-    context = []
-    progress(0.1, "Đang xử lý ngữ cảnh...")
-    # Xử lý ngữ cảnh 1
-    if context_audio_path1 and context_text1:
-        waveform, sample_rate = audio_to_tensor(context_audio_path1)
-        # Resample nếu cần
-        if sample_rate != generator.sample_rate:
-            waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=generator.sample_rate)
-        context.append(Segment(speaker=context_speaker1, text=context_text1, audio=waveform))
-    # Xử lý ngữ cảnh 2
-    if context_audio_path2 and context_text2:
-        waveform, sample_rate = audio_to_tensor(context_audio_path2)
-        # Resample nếu cần
-        if sample_rate != generator.sample_rate:
-            waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=generator.sample_rate)
-        context.append(Segment(speaker=context_speaker2, text=context_text2, audio=waveform))
-    progress(0.3, "Đang tạo âm thanh...")
-    # Tạo âm thanh từ văn bản
-    audio = generator.generate(
-        text=text,
-        speaker=speaker_id,
-        context=context,
-        max_audio_length_ms=max_duration_ms,
-        temperature=temperature,
-        topk=top_k
-    )
-    progress(0.8, "Đang lưu âm thanh...")
-    # Lưu âm thanh thành file
-    output_path = save_audio(audio, generator.sample_rate)
-    progress(1.0, "Hoàn thành!")
-    return output_path
-# Hàm tạo âm thanh đơn giản không có ngữ cảnh
-@spaces.GPU
 def generate_speech_simple(
     text: str,
     speaker_id: int,
@@ -134,43 +165,56 @@ def generate_speech_simple(
     top_k: int = 50,
     progress=gr.Progress()
 ) -> str:
-    # Lấy mô hình đã tải
-    generator = get_model()
-    progress(0.3, "Đang tạo âm thanh...")
-    # Tạo âm thanh từ văn bản
-    audio = generator.generate(
-        text=text,
-        speaker=speaker_id,
-        context=[],  # Không có ngữ cảnh
-        max_audio_length_ms=max_duration_ms,
-        temperature=temperature,
-        topk=top_k
-    )
-    progress(0.8, "Đang lưu âm thanh...")
-    # Lưu âm thanh thành file
-    output_path = save_audio(audio, generator.sample_rate)
-    progress(1.0, "Hoàn thành!")
-    return output_path
-# Tạo giao diện Gradio
 def create_demo():
     with gr.Blocks(title="CSM-1B Text-to-Speech") as demo:
         gr.Markdown("# CSM-1B Text-to-Speech Demo")
-        gr.Markdown("Mô hình CSM-1B (Collaborative Speech Model) là một mô hình text-to-speech tiên tiến có khả năng tạo giọng nói tự nhiên từ văn bản.")
-        with gr.Tab("Tạo âm thanh đơn giản"):
             with gr.Row():
                 with gr.Column():
                     text_input = gr.Textbox(
-                        label="Văn bản cần chuyển thành giọng nói",
-                        placeholder="Nhập văn bản bạn muốn chuyển thành giọng nói...",
                         lines=5
                     )
                     speaker_id = gr.Number(
-                        label="ID người nói",
                         value=0,
                         precision=0,
                         minimum=0,
@@ -179,7 +223,7 @@ def create_demo():
                     with gr.Row():
                         max_duration = gr.Slider(
-                            label="Thời lượng tối đa (ms)",
                             minimum=1000,
                             maximum=90000,
                             value=30000,
@@ -200,38 +244,38 @@ def create_demo():
                             step=1
                         )
-                    generate_btn = gr.Button("Tạo âm thanh")
                 with gr.Column():
-                    output_audio = gr.Audio(label="Âm thanh đầu ra", type="filepath")
-        with gr.Tab("Tạo âm thanh với ngữ cảnh"):
-            gr.Markdown("Tính năng này cho phép bạn cung cấp các đoạn âm thanh và văn bản làm ngữ cảnh để mô hình tạo ra âm thanh phù hợp hơn.")
             with gr.Row():
                 with gr.Column():
-                    context_text1 = gr.Textbox(label="Văn bản ngữ cảnh 1", lines=2)
-                    context_audio1 = gr.Audio(label="Âm thanh ngữ cảnh 1", type="filepath")
-                    context_speaker1 = gr.Number(label="ID người nói 1", value=0, precision=0)
-                    context_text2 = gr.Textbox(label="Văn bản ngữ cảnh 2", lines=2)
-                    context_audio2 = gr.Audio(label="Âm thanh ngữ cảnh 2", type="filepath")
-                    context_speaker2 = gr.Number(label="ID người nói 2", value=1, precision=0)
                     text_input_context = gr.Textbox(
-                        label="Văn bản cần chuyển thành giọng nói",
-                        placeholder="Nhập văn bản bạn muốn chuyển thành giọng nói...",
                         lines=3
                     )
                     speaker_id_context = gr.Number(
-                        label="ID người nói",
                         value=0,
                         precision=0
                     )
                     with gr.Row():
                         max_duration_context = gr.Slider(
-                            label="Thời lượng tối đa (ms)",
                             minimum=1000,
                             maximum=90000,
                             value=30000,
@@ -252,27 +296,27 @@ def create_demo():
                             step=1
                         )
-                    generate_context_btn = gr.Button("Tạo âm thanh với ngữ cảnh")
                 with gr.Column():
-                    output_audio_context = gr.Audio(label="Âm thanh đầu ra", type="filepath")
-        # Thêm tab cấu hình Hugging Face
-        with gr.Tab("Cấu hình"):
-            gr.Markdown("### Cấu hình Hugging Face Token")
             gr.Markdown("""
-            Để sử dụng mô hình CSM-1B, bạn cần có quyền truy cập vào mô hình trên Hugging Face.
-            Bạn có thể cấu hình token của mình bằng cách:
-            1. Tạo token tại [Hugging Face Settings](https://huggingface.co/settings/tokens)
-            2. Đặt biến môi trường `HF_TOKEN` với giá trị là token của bạn
-            Lưu ý: Trong Hugging Face Spaces, bạn có thể đặt biến môi trường trong phần Cài đặt của Space.
             """)
             hf_token_input = gr.Textbox(
-                label="Hugging Face Token (Chỉ sử dụng trong phiên này)",
-                placeholder="Nhập token của bạn...",
                 type="password"
             )
@@ -280,57 +324,69 @@ def create_demo():
                 if token:
                     os.environ["HF_TOKEN"] = token
                     login(token=token)
-                    return "Đã đặt token thành công! Bạn có thể tải mô hình bây giờ."
-                return "Token không hợp lệ. Vui lòng nhập token hợp lệ."
-            set_token_btn = gr.Button("Đặt Token")
-            token_status = gr.Textbox(label="Trạng thái", interactive=False)
             set_token_btn.click(fn=set_token, inputs=hf_token_input, outputs=token_status)
-        # Thêm tab thông tin về ZeroGPU
-        with gr.Tab("Thông tin GPU"):
-            gr.Markdown("### Thông tin về ZeroGPU")
             gr.Markdown("""
-            Ứng dụng này sử dụng ZeroGPU của Hugging Face Spaces để tối ưu việc sử dụng GPU.
-            ZeroGPU giúp giải phóng bộ nhớ GPU khi không sử dụng, giúp tiết kiệm tài nguyên và cải thiện hiệu suất.
-            Khi bạn tạo âm thanh, GPU sẽ được sử dụng tự động và giải phóng sau khi hoàn thành.
-            Lưu ý: Trong môi trường ZeroGPU, CUDA không được khởi tạo trong quá trình chính, mà chỉ trong các hàm có decorator @spaces.GPU.
             """)
-            @spaces.GPU
             def check_gpu():
                 if torch.cuda.is_available():
                     gpu_name = torch.cuda.get_device_name(0)
                     gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
-                    return f"GPU: {gpu_name}\nBộ nhớ: {gpu_memory:.2f} GB"
                 else:
-                    return "Không tìm thấy GPU. Ứng dụng sẽ chạy trên CPU."
-            check_gpu_btn = gr.Button("Kiểm tra GPU")
-            gpu_info = gr.Textbox(label="Thông tin GPU", interactive=False)
             check_gpu_btn.click(fn=check_gpu, inputs=None, outputs=gpu_info)
-            # Thêm nút tải mô hình
-            load_model_btn = gr.Button("Tải mô hình")
-            model_status = gr.Textbox(label="Trạng thái mô hình", interactive=False)
-            @spaces.GPU
             def load_model_and_report():
                 global model_loaded
                 if model_loaded:
-                    return "Mô hình đã được tải trước đó!"
                 else:
                     initialize_model()
-                    return "Mô hình đã được tải thành công!"
             load_model_btn.click(fn=load_model_and_report, inputs=None, outputs=model_status)
-        # Kết nối các thành phần
         generate_btn.click(
             fn=generate_speech_simple,
             inputs=[
@@ -363,7 +419,7 @@ def create_demo():
     return demo
-# Khởi chạy ứng dụng
 if __name__ == "__main__":
     demo = create_demo()
     demo.queue().launch()

 from generator import Segment, load_csm_1b
 from huggingface_hub import login
+# Disable torch compile feature to avoid triton error
 torch._dynamo.config.suppress_errors = True
+# Check if GPU is available and configure the device
 device = "cuda" if torch.cuda.is_available() else "cpu"
+print(f"Using device: {device}")
+# Login to Hugging Face Hub if token is available
 def login_huggingface():
     hf_token = os.environ.get("HF_TOKEN")
     if hf_token:
+        print("Logging in to Hugging Face Hub...")
         login(token=hf_token)
+        print("Login successful!")
     else:
+        print("HF_TOKEN not found in environment variables. Some models may not be accessible.")
+# Login at startup
 login_huggingface()
+# Global variables to track model state
 generator = None
 model_loaded = False
+# Function to load model in ZeroGPU
+@spaces.GPU(duration=30)
 def initialize_model():
     global generator, model_loaded
     if not model_loaded:
+        print("Loading CSM-1B model in GPU...")
         generator = load_csm_1b(device="cuda")
         model_loaded = True
+        print("Model loaded successfully!")
     return generator
+# Function to get the loaded model
+@spaces.GPU(duration=30)
 def get_model():
     global generator, model_loaded
     if not model_loaded:
         return initialize_model()
     return generator
+# Preload model if environment variable is set
+def preload_model_if_needed():
+    if os.environ.get("PRELOAD_MODEL", "").lower() in ("true", "1", "yes"):
+        print("PRELOAD_MODEL is set. Attempting to preload model...")
+        try:
+            # We can't directly call initialize_model() here because it's decorated with @spaces.GPU
+            # Instead, we'll set a flag that will be checked when the first request comes in
+            global model_loaded
+            model_loaded = False
+            print("Model will be loaded on first request.")
+        except Exception as e:
+            print(f"Error during model preloading setup: {e}")
+    else:
+        print("PRELOAD_MODEL is not set. Model will be loaded on demand.")
+# Call preload function at startup
+preload_model_if_needed()
+# Function to convert audio to tensor
 def audio_to_tensor(audio_path: str) -> Tuple[torch.Tensor, int]:
     waveform, sample_rate = torchaudio.load(audio_path)
+    waveform = waveform.mean(dim=0)  # Convert stereo to mono if needed
     return waveform, sample_rate
+# Function to save audio tensor to file
 def save_audio(audio_tensor: torch.Tensor, sample_rate: int) -> str:
     temp_dir = tempfile.gettempdir()
     output_path = os.path.join(temp_dir, f"csm1b_output_{int(time.time())}.wav")
     torchaudio.save(output_path, audio_tensor.unsqueeze(0), sample_rate)
     return output_path
+# Function to generate speech from text using ZeroGPU
+@spaces.GPU(duration=30)
 def generate_speech(
     text: str,
     speaker_id: int,
     top_k: int = 50,
     progress=gr.Progress()
 ) -> str:
+    try:
+        # Get the loaded model
+        generator = get_model()
+        # Prepare context
+        context = []
+        progress(0.1, "Processing context...")
+        # Process context 1
+        if context_audio_path1 and context_text1:
+            waveform, sample_rate = audio_to_tensor(context_audio_path1)
+            # Resample if needed
+            if sample_rate != generator.sample_rate:
+                waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=generator.sample_rate)
+            context.append(Segment(speaker=context_speaker1, text=context_text1, audio=waveform))
+        # Process context 2
+        if context_audio_path2 and context_text2:
+            waveform, sample_rate = audio_to_tensor(context_audio_path2)
+            # Resample if needed
+            if sample_rate != generator.sample_rate:
+                waveform = torchaudio.functional.resample(waveform, orig_freq=sample_rate, new_freq=generator.sample_rate)
+            context.append(Segment(speaker=context_speaker2, text=context_text2, audio=waveform))
+        progress(0.3, "Generating audio...")
+        # Generate audio from text
+        audio = generator.generate(
+            text=text,
+            speaker=speaker_id,
+            context=context,
+            max_audio_length_ms=max_duration_ms,
+            temperature=temperature,
+            topk=top_k
+        )
+        progress(0.8, "Saving audio...")
+        # Save audio to file
+        output_path = save_audio(audio, generator.sample_rate)
+        progress(1.0, "Completed!")
+        return output_path
+    except spaces.zero.gradio.HTMLError as e:
+        # Handle ZeroGPU quota exceeded error
+        error_message = str(e)
+        if "GPU quota exceeded" in error_message:
+            # Extract wait time from error message
+            import re
+            wait_time_match = re.search(r"Try again in (\d+:\d+:\d+)", error_message)
+            wait_time = wait_time_match.group(1) if wait_time_match else "some time"
+            return f"GPU quota exceeded. Please try again in {wait_time}."
+        return f"GPU error: {error_message}"
+    except Exception as e:
+        return f"Error generating speech: {str(e)}"
+# Function to generate simple speech without context
+@spaces.GPU(duration=30)
 def generate_speech_simple(
     text: str,
     speaker_id: int,
     top_k: int = 50,
     progress=gr.Progress()
 ) -> str:
+    try:
+        # Get the loaded model
+        generator = get_model()
+        progress(0.3, "Generating audio...")
+        # Generate audio from text
+        audio = generator.generate(
+            text=text,
+            speaker=speaker_id,
+            context=[],  # No context
+            max_audio_length_ms=max_duration_ms,
+            temperature=temperature,
+            topk=top_k
+        )
+        progress(0.8, "Saving audio...")
+        # Save audio to file
+        output_path = save_audio(audio, generator.sample_rate)
+        progress(1.0, "Completed!")
+        return output_path
+    except spaces.zero.gradio.HTMLError as e:
+        # Handle ZeroGPU quota exceeded error
+        error_message = str(e)
+        if "GPU quota exceeded" in error_message:
+            # Extract wait time from error message
+            import re
+            wait_time_match = re.search(r"Try again in (\d+:\d+:\d+)", error_message)
+            wait_time = wait_time_match.group(1) if wait_time_match else "some time"
+            return f"GPU quota exceeded. Please try again in {wait_time}."
+        return f"GPU error: {error_message}"
+    except Exception as e:
+        return f"Error generating speech: {str(e)}"
+# Create Gradio interface
 def create_demo():
     with gr.Blocks(title="CSM-1B Text-to-Speech") as demo:
         gr.Markdown("# CSM-1B Text-to-Speech Demo")
+        gr.Markdown("CSM-1B (Collaborative Speech Model) is an advanced text-to-speech model capable of generating natural-sounding speech from text.")
+        with gr.Tab("Simple Audio Generation"):
             with gr.Row():
                 with gr.Column():
                     text_input = gr.Textbox(
+                        label="Text to convert to speech",
+                        placeholder="Enter the text you want to convert to speech...",
                         lines=5
                     )
                     speaker_id = gr.Number(
+                        label="Speaker ID",
                         value=0,
                         precision=0,
                         minimum=0,
                     with gr.Row():
                         max_duration = gr.Slider(
+                            label="Maximum Duration (ms)",
                             minimum=1000,
                             maximum=90000,
                             value=30000,
                             step=1
                         )
+                    generate_btn = gr.Button("Generate Audio")
                 with gr.Column():
+                    output_audio = gr.Audio(label="Output Audio", type="filepath")
+        with gr.Tab("Audio Generation with Context"):
+            gr.Markdown("This feature allows you to provide audio clips and text as context to help the model generate more appropriate speech.")
             with gr.Row():
                 with gr.Column():
+                    context_text1 = gr.Textbox(label="Context Text 1", lines=2)
+                    context_audio1 = gr.Audio(label="Context Audio 1", type="filepath")
+                    context_speaker1 = gr.Number(label="Speaker ID 1", value=0, precision=0)
+                    context_text2 = gr.Textbox(label="Context Text 2", lines=2)
+                    context_audio2 = gr.Audio(label="Context Audio 2", type="filepath")
+                    context_speaker2 = gr.Number(label="Speaker ID 2", value=1, precision=0)
                     text_input_context = gr.Textbox(
+                        label="Text to convert to speech",
+                        placeholder="Enter the text you want to convert to speech...",
                         lines=3
                     )
                     speaker_id_context = gr.Number(
+                        label="Speaker ID",
                         value=0,
                         precision=0
                     )
                     with gr.Row():
                         max_duration_context = gr.Slider(
+                            label="Maximum Duration (ms)",
                             minimum=1000,
                             maximum=90000,
                             value=30000,
                             step=1
                         )
+                    generate_context_btn = gr.Button("Generate Audio with Context")
                 with gr.Column():
+                    output_audio_context = gr.Audio(label="Output Audio", type="filepath")
+        # Add Hugging Face configuration tab
+        with gr.Tab("Configuration"):
+            gr.Markdown("### Hugging Face Token Configuration")
             gr.Markdown("""
+            To use the CSM-1B model, you need access to the model on Hugging Face.
+            You can configure your token by:
+            1. Create a token at [Hugging Face Settings](https://huggingface.co/settings/tokens)
+            2. Set the `HF_TOKEN` environment variable with your token value
+            Note: In Hugging Face Spaces, you can set environment variables in the Space Settings.
             """)
             hf_token_input = gr.Textbox(
+                label="Hugging Face Token (Only for this session)",
+                placeholder="Enter your token...",
                 type="password"
             )
                 if token:
                     os.environ["HF_TOKEN"] = token
                     login(token=token)
+                    return "Token set successfully! You can now load the model."
+                return "Invalid token. Please enter a valid token."
+            set_token_btn = gr.Button("Set Token")
+            token_status = gr.Textbox(label="Status", interactive=False)
             set_token_btn.click(fn=set_token, inputs=hf_token_input, outputs=token_status)
+        # Add GPU information tab
+        with gr.Tab("GPU Information"):
+            gr.Markdown("### About ZeroGPU")
             gr.Markdown("""
+            This application uses Hugging Face Spaces' ZeroGPU to optimize GPU usage.
+            ZeroGPU helps free up GPU memory when not in use, saving resources and improving performance.
+            When you generate audio, the GPU will be used automatically and released after completion.
+            Note: In the ZeroGPU environment, CUDA is not initialized in the main process, but only in functions with the @spaces.GPU decorator.
+            """)
+            gr.Markdown("### GPU Quota Information")
+            gr.Markdown("""
+            Hugging Face Spaces has GPU quota limitations:
+            - Each GPU operation has a default duration of 60 seconds
+            - We've reduced this to 30 seconds for audio generation and 10 seconds for GPU checks
+            - If you exceed your quota, you'll need to wait for it to reset (usually a few hours)
+            - For better performance, try generating shorter audio clips
+            If you encounter a "GPU quota exceeded" error, please wait for the specified time and try again.
             """)
+            @spaces.GPU(duration=10)
             def check_gpu():
                 if torch.cuda.is_available():
                     gpu_name = torch.cuda.get_device_name(0)
                     gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)
+                    return f"GPU: {gpu_name}\nMemory: {gpu_memory:.2f} GB"
                 else:
+                    return "No GPU found. The application will run on CPU."
+            check_gpu_btn = gr.Button("Check GPU")
+            gpu_info = gr.Textbox(label="GPU Information", interactive=False)
             check_gpu_btn.click(fn=check_gpu, inputs=None, outputs=gpu_info)
+            # Add model loading button
+            load_model_btn = gr.Button("Load Model")
+            model_status = gr.Textbox(label="Model Status", interactive=False)
+            @spaces.GPU(duration=10)
             def load_model_and_report():
                 global model_loaded
                 if model_loaded:
+                    return "Model has already been loaded!"
                 else:
                     initialize_model()
+                    return "Model loaded successfully!"
             load_model_btn.click(fn=load_model_and_report, inputs=None, outputs=model_status)
+        # Connect components
         generate_btn.click(
             fn=generate_speech_simple,
             inputs=[
     return demo
+# Launch the application
 if __name__ == "__main__":
     demo = create_demo()
     demo.queue().launch()

generator.py CHANGED Viewed

@@ -10,7 +10,7 @@ from tokenizers.processors import TemplateProcessing
 from transformers import AutoTokenizer
 from watermarking import CSM_1B_GH_WATERMARK, load_watermarker, watermark
-# Tắt tính năng compile của torch để tránh lỗi triton
 torch._dynamo.config.suppress_errors = True
 @dataclass
@@ -167,19 +167,19 @@ class Generator:
 def load_csm_1b(device: str = "cuda") -> Generator:
     """
-    Tải mô hình CSM-1B từ Hugging Face Hub.
     Args:
-        device: Thiết bị để chạy mô hình (cuda hoặc cpu)
     Returns:
-        Generator: Đối tượng Generator để tạo âm thanh từ văn bản
     """
     try:
-        # Trong ZeroGPU, không nên khởi tạo CUDA trong quá trình chính
-        # Chỉ chuyển mô hình sang GPU khi được gọi trong hàm có decorator @spaces.GPU
         if 'cuda' in device and not torch.cuda.is_initialized():
-            # Sử dụng CPU cho quá trình chính
             model = Model.from_pretrained("sesame/csm-1b")
         else:
             model = Model.from_pretrained("sesame/csm-1b")
@@ -188,7 +188,7 @@ def load_csm_1b(device: str = "cuda") -> Generator:
         generator = Generator(model)
         return generator
     except Exception as e:
-        print(f"Lỗi khi tải mô hình: {e}")
-        print("Vui lòng kiểm tra xem bạn đã đăng nhập vào Hugging Face Hub chưa.")
-        print("Bạn có thể cần phải yêu cầu quyền truy cập vào mô hình tại: https://huggingface.co/sesame/csm-1b")
         raise e

 from transformers import AutoTokenizer
 from watermarking import CSM_1B_GH_WATERMARK, load_watermarker, watermark
+# Disable torch compile feature to avoid triton error
 torch._dynamo.config.suppress_errors = True
 @dataclass
 def load_csm_1b(device: str = "cuda") -> Generator:
     """
+    Load the CSM-1B model from Hugging Face Hub.
     Args:
+        device: Device to run the model on (cuda or cpu)
     Returns:
+        Generator: Generator object to create audio from text
     """
     try:
+        # In ZeroGPU, CUDA should not be initialized in the main process
+        # Only move the model to GPU when called in a function with the @spaces.GPU decorator
         if 'cuda' in device and not torch.cuda.is_initialized():
+            # Use CPU for the main process
             model = Model.from_pretrained("sesame/csm-1b")
         else:
             model = Model.from_pretrained("sesame/csm-1b")
         generator = Generator(model)
         return generator
     except Exception as e:
+        print(f"Error loading model: {e}")
+        print("Please check if you are logged in to Hugging Face Hub.")
+        print("You may need to request access to the model at: https://huggingface.co/sesame/csm-1b")
         raise e

test_model.py CHANGED Viewed

@@ -6,29 +6,29 @@ from generator import Segment, load_csm_1b
 from huggingface_hub import login
 def login_huggingface():
-    """Đăng nhập vào Hugging Face Hub sử dụng token từ biến môi trường hoặc nhập từ người dùng"""
     hf_token = os.environ.get("HF_TOKEN")
     if not hf_token:
-        print("Không tìm thấy HF_TOKEN trong biến môi trường.")
-        hf_token = input("Vui lòng nhập Hugging Face token của bạn: ")
     if hf_token:
-        print("Đang đăng nhập vào Hugging Face Hub...")
         login(token=hf_token)
-        print("Đã đăng nhập thành công!")
         return True
     else:
-        print("Không có token. Một số mô hình có thể không truy cập được.")
         return False
 @spaces.GPU
 def generate_test_audio(text, speaker_id, device):
-    """Tạo âm thanh kiểm tra sử dụng ZeroGPU"""
     generator = load_csm_1b(device=device)
-    print("Đã tải xong mô hình!")
-    print(f"Đang tạo âm thanh cho văn bản: '{text}'")
     audio = generator.generate(
         text=text,
         speaker=speaker_id,
@@ -41,33 +41,33 @@ def generate_test_audio(text, speaker_id, device):
     return audio, generator.sample_rate
 def test_model():
-    print("Kiểm tra mô hình CSM-1B...")
-    # Đăng nhập vào Hugging Face Hub
     login_huggingface()
-    # Kiểm tra xem có GPU không và cấu hình thiết bị phù hợp
     device = "cuda" if torch.cuda.is_available() else "cpu"
-    print(f"Sử dụng thiết bị: {device}")
-    # Tải mô hình CSM-1B và tạo âm thanh
-    print("Đang tải mô hình CSM-1B...")
     try:
-        # Sử dụng ZeroGPU để tạo âm thanh
-        text = "Xin chào, đây là bài kiểm tra mô hình CSM-1B."
         speaker_id = 0
         audio, sample_rate = generate_test_audio(text, speaker_id, device)
-        # Lưu âm thanh thành file
         output_path = "test_output.wav"
         torchaudio.save(output_path, audio.unsqueeze(0), sample_rate)
-        print(f"Đã lưu âm thanh vào file: {output_path}")
-        print("Kiểm tra hoàn tất!")
     except Exception as e:
-        print(f"Lỗi khi kiểm tra mô hình: {e}")
-        print("Vui lòng kiểm tra lại token và quyền truy cập của bạn.")
 if __name__ == "__main__":
     test_model()

 from huggingface_hub import login
 def login_huggingface():
+    """Login to Hugging Face Hub using token from environment variable or user input"""
     hf_token = os.environ.get("HF_TOKEN")
     if not hf_token:
+        print("HF_TOKEN not found in environment variables.")
+        hf_token = input("Please enter your Hugging Face token: ")
     if hf_token:
+        print("Logging in to Hugging Face Hub...")
         login(token=hf_token)
+        print("Login successful!")
         return True
     else:
+        print("No token provided. Some models may not be accessible.")
         return False
 @spaces.GPU
 def generate_test_audio(text, speaker_id, device):
+    """Generate test audio using ZeroGPU"""
     generator = load_csm_1b(device=device)
+    print("Model loaded successfully!")
+    print(f"Generating audio for text: '{text}'")
     audio = generator.generate(
         text=text,
         speaker=speaker_id,
     return audio, generator.sample_rate
 def test_model():
+    print("Testing CSM-1B model...")
+    # Login to Hugging Face Hub
     login_huggingface()
+    # Check if GPU is available and configure the device
     device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"Using device: {device}")
+    # Load CSM-1B model and generate audio
+    print("Loading CSM-1B model...")
     try:
+        # Use ZeroGPU to generate audio
+        text = "Hello, this is a test of the CSM-1B model."
         speaker_id = 0
         audio, sample_rate = generate_test_audio(text, speaker_id, device)
+        # Save audio to file
         output_path = "test_output.wav"
         torchaudio.save(output_path, audio.unsqueeze(0), sample_rate)
+        print(f"Audio saved to file: {output_path}")
+        print("Test completed!")
     except Exception as e:
+        print(f"Error testing model: {e}")
+        print("Please check your token and access permissions.")
 if __name__ == "__main__":
     test_model()