Spaces:

jordand
/

echo-tts-preview

Running on Zero

App Files Files Community

jordand commited on 16 days ago

Commit

60cc71a

verified ·

1 Parent(s): 14e4ac4

Upload 21 files

Browse files

Files changed (22) hide show

.gitattributes +6 -0
LICENSE +9 -0
LICENSE-APACHE +203 -0
app.py +0 -0
autoencoder.py +1227 -0
inference.py +290 -0
model.py +650 -0
packages.txt +1 -0
prompt_audio/EARS p004 freeform.mp3 +3 -0
prompt_audio/EARS p005 freeform.mp3 +3 -0
prompt_audio/EARS p028 freeform.mp3 +3 -0
prompt_audio/EARS p036 freeform.mp3 +3 -0
prompt_audio/expresso_02_ex03-ex01_calm_005.wav +3 -0
prompt_audio/freesound_demon_chant(use_forcespeaker).mp3 +3 -0
requirements.txt +8 -0
sampler_presets.json +120 -0
samplers.py +690 -0
silentcipher/__init__.py +3 -0
silentcipher/model.py +95 -0
silentcipher/server.py +480 -0
silentcipher/stft.py +40 -0
text_presets.txt +42 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+prompt_audio/EARS[[:space:]]p004[[:space:]]freeform.mp3 filter=lfs diff=lfs merge=lfs -text
+prompt_audio/EARS[[:space:]]p005[[:space:]]freeform.mp3 filter=lfs diff=lfs merge=lfs -text
+prompt_audio/EARS[[:space:]]p028[[:space:]]freeform.mp3 filter=lfs diff=lfs merge=lfs -text
+prompt_audio/EARS[[:space:]]p036[[:space:]]freeform.mp3 filter=lfs diff=lfs merge=lfs -text
+prompt_audio/expresso_02_ex03-ex01_calm_005.wav filter=lfs diff=lfs merge=lfs -text
+prompt_audio/freesound_demon_chant(use_forcespeaker).mp3 filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,9 @@

+Copyright 2025 Jordan Darefsky
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

LICENSE-APACHE ADDED Viewed

	@@ -0,0 +1,203 @@

+This Apache 2.0 license applies only to autoencoder.py
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2024 Fish Audio
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

app.py ADDED Viewed

The diff for this file is too large to render. See raw diff

autoencoder.py ADDED Viewed

	@@ -0,0 +1,1227 @@

+# SPDX-FileCopyrightText: 2025 Jordan Darefsky
+# SPDX-License-Identifier: Apache-2.0
+#
+# This file contains portions adapted from:
+#   • Descript Audio Codec (DAC) — MIT License (full text appended below)
+#   • Fish-Speech S1 DAC Autoencoder — reference implementation (Apache-2.0 / CC-BY-NC),
+#     rewritten here in a single-file Torch module for interoperability and transparency.
+#
+# OVERALL LICENSE (this file): Apache-2.0, except where explicitly marked:
+#     # SPDX-License-Identifier: MIT
+# Keep these notices and the embedded MIT text if you redistribute this file.
+# NOTE (style/provenance):
+# Code in this module has been largely copy-and-pasted from the Fish-S1-DAC and DAC repositories,
+# and refactored with help from ChatGPT/Claude (these models also helped with licensing).
+# Thus, it stylistically differs from the rest of the codebase (I'm not even sure about internal consistency)
+# and is likely much messier than it would have been had it been written from scratch.
+from __future__ import annotations
+import math
+from dataclasses import dataclass
+from typing import List, Optional, Tuple, Union
+import numpy as np
+import torch
+from torch import Tensor, nn
+from torch.nn import functional as F
+from torch.nn.utils.parametrizations import weight_norm
+from torch.nn.utils.parametrize import remove_parametrizations
+from einops import rearrange
+# --------------------------------------------------------------------
+# Shared helpers
+# --------------------------------------------------------------------
+def find_multiple(n: int, k: int) -> int:
+    return n if n % k == 0 else n + k - (n % k)
+def unpad1d(x: Tensor, paddings: Tuple[int, int]) -> Tensor:
+    """Remove padding from x, handling properly zero padding. Only for 1d!"""
+    padding_left, padding_right = paddings
+    assert padding_left >= 0 and padding_right >= 0, (padding_left, padding_right)
+    assert (padding_left + padding_right) <= x.shape[-1]
+    end = x.shape[-1] - padding_right
+    return x[..., padding_left:end]
+def get_extra_padding_for_conv1d(
+    x: Tensor, kernel_size: int, stride: int, padding_total: int = 0
+) -> int:
+    """See pad_for_conv1d; enough right pad so striding evenly covers length."""
+    length = x.shape[-1]
+    n_frames = (length - kernel_size + padding_total) / stride + 1
+    ideal_length = (math.ceil(n_frames) - 1) * stride + (kernel_size - padding_total)
+    return ideal_length - length
+def pad1d(
+    x: Tensor,
+    paddings: Tuple[int, int],
+    mode: str = "zeros",
+    value: float = 0.0,
+) -> Tensor:
+    """
+    Reflect‑safe 1D pad: if reflect would underflow on small inputs, insert
+    temporary right zero-pad before reflecting.
+    """
+    length = x.shape[-1]
+    padding_left, padding_right = paddings
+    assert padding_left >= 0 and padding_right >= 0, (padding_left, padding_right)
+    if mode == "reflect":
+        max_pad = max(padding_left, padding_right)
+        extra_pad = 0
+        if length <= max_pad:
+            extra_pad = max_pad - length + 1
+            x = F.pad(x, (0, extra_pad))
+        padded = F.pad(x, (padding_left, padding_right), mode, value)
+        end = padded.shape[-1] - extra_pad
+        return padded[..., :end]
+    else:
+        return F.pad(x, (padding_left, padding_right), mode, value)
+# --------------------------------------------------------------------
+# DAC Layers (adapted) — MIT
+# Original: https://github.com/descriptinc/descript-audio-codec/blob/main/dac/nn/layers.py
+# SPDX-License-Identifier: MIT
+# --------------------------------------------------------------------
+def WNConv1d(*args, **kwargs):
+    return weight_norm(nn.Conv1d(*args, **kwargs))
+def WNConvTranspose1d(*args, **kwargs):
+    return weight_norm(nn.ConvTranspose1d(*args, **kwargs))
+@torch.jit.script
+def snake(x: Tensor, alpha: Tensor) -> Tensor:
+    shape = x.shape
+    x = x.reshape(shape[0], shape[1], -1)
+    x = x + (alpha + 1e-9).reciprocal() * torch.sin(alpha * x).pow(2)
+    x = x.reshape(shape)
+    return x
+class Snake1d(nn.Module):
+    def __init__(self, channels: int):
+        super().__init__()
+        self.alpha = nn.Parameter(torch.ones(1, channels, 1))
+    def forward(self, x: Tensor) -> Tensor:
+        return snake(x, self.alpha)
+# --------------------------------------------------------------------
+# DAC Vector Quantize (adapted) — MIT
+# Original: https://github.com/descriptinc/descript-audio-codec/blob/main/dac/nn/quantize.py
+# SPDX-License-Identifier: MIT
+# --------------------------------------------------------------------
+class VectorQuantize(nn.Module):
+    """
+    VQ with factorized, l2-normalized codes (ViT‑VQGAN style).
+    I/O in (B, D, T).
+    """
+    def __init__(self, input_dim: int, codebook_size: int, codebook_dim: int):
+        super().__init__()
+        self.codebook_size = codebook_size
+        self.codebook_dim = codebook_dim
+        self.in_proj  = WNConv1d(input_dim,  codebook_dim, kernel_size=1)
+        self.out_proj = WNConv1d(codebook_dim, input_dim,  kernel_size=1)
+        self.codebook = nn.Embedding(codebook_size, codebook_dim)
+    def forward(self, z: Tensor):
+        z_e = self.in_proj(z)                 # (B, D, T)
+        z_q, indices = self.decode_latents(z_e)
+        commitment_loss = F.mse_loss(z_e, z_q.detach(), reduction="none").mean([1, 2])
+        codebook_loss   = F.mse_loss(z_q, z_e.detach(), reduction="none").mean([1, 2])
+        z_q = z_e + (z_q - z_e).detach()      # straight‑through
+        z_q = self.out_proj(z_q)
+        return z_q, commitment_loss, codebook_loss, indices, z_e
+    def embed_code(self, embed_id: Tensor) -> Tensor:
+        return F.embedding(embed_id, self.codebook.weight)
+    def decode_code(self, embed_id: Tensor) -> Tensor:
+        return self.embed_code(embed_id).transpose(1, 2)
+    def decode_latents(self, latents: Tensor) -> Tuple[Tensor, Tensor]:
+        encodings = rearrange(latents, "b d t -> (b t) d")
+        codebook  = self.codebook.weight
+        encodings = F.normalize(encodings)
+        codebook  = F.normalize(codebook)
+        dist = (
+            encodings.pow(2).sum(1, keepdim=True)
+            - 2 * encodings @ codebook.t()
+            + codebook.pow(2).sum(1, keepdim=True).t()
+        )
+        indices = rearrange((-dist).max(1)[1], "(b t) -> b t", b=latents.size(0))
+        z_q = self.decode_code(indices)
+        return z_q, indices
+class ResidualVectorQuantize(nn.Module):
+    """SoundStream-style residual VQ stack."""
+    def __init__(
+        self,
+        input_dim: int = 512,
+        n_codebooks: int = 9,
+        codebook_size: int = 1024,
+        codebook_dim: Union[int, List[int]] = 8,
+        quantizer_dropout: float = 0.0,
+    ):
+        super().__init__()
+        if isinstance(codebook_dim, int):
+            codebook_dim = [codebook_dim for _ in range(n_codebooks)]
+        self.n_codebooks  = n_codebooks
+        self.codebook_dim = codebook_dim
+        self.codebook_size = codebook_size
+        self.quantizers = nn.ModuleList([
+            VectorQuantize(input_dim, codebook_size, codebook_dim[i])
+            for i in range(n_codebooks)
+        ])
+        self.quantizer_dropout = quantizer_dropout
+    def forward(self, z: Tensor, n_quantizers: Optional[int] = None):
+        z_q = 0
+        residual = z
+        commitment_loss = 0
+        codebook_loss   = 0
+        codebook_indices = []
+        latents = []
+        if n_quantizers is None:
+            n_quantizers = self.n_codebooks
+        if self.training:
+            n_quantizers = torch.ones((z.shape[0],)) * self.n_codebooks + 1
+            dropout = torch.randint(1, self.n_codebooks + 1, (z.shape[0],))
+            n_dropout = int(z.shape[0] * self.quantizer_dropout)
+            n_quantizers[:n_dropout] = dropout[:n_dropout]
+            n_quantizers = n_quantizers.to(z.device)
+        for i, quantizer in enumerate(self.quantizers):
+            if self.training is False and i >= n_quantizers:
+                break
+            z_q_i, commit_i, codebk_i, indices_i, z_e_i = quantizer(residual)
+            mask = (torch.full((z.shape[0],), fill_value=i, device=z.device) < n_quantizers)
+            z_q     = z_q + z_q_i * mask[:, None, None]
+            residual = residual - z_q_i
+            commitment_loss += (commit_i * mask).mean()
+            codebook_loss   += (codebk_i * mask).mean()
+            codebook_indices.append(indices_i)
+            latents.append(z_e_i)
+        codes   = torch.stack(codebook_indices, dim=1)
+        latents = torch.cat(latents, dim=1)
+        return z_q, codes, latents, commitment_loss, codebook_loss
+    def from_codes(self, codes: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
+        z_q = 0.0
+        z_p = []
+        n_codebooks = codes.shape[1]
+        for i in range(n_codebooks):
+            z_p_i = self.quantizers[i].decode_code(codes[:, i, :])
+            z_p.append(z_p_i)
+            z_q_i = self.quantizers[i].out_proj(z_p_i)
+            z_q = z_q + z_q_i
+        return z_q, torch.cat(z_p, dim=1), codes
+    def from_latents(self, latents: Tensor) -> Tuple[Tensor, Tensor, Tensor]:
+        z_q = 0
+        z_p = []
+        codes = []
+        dims = np.cumsum([0] + [q.codebook_dim for q in self.quantizers])
+        n_codebooks = np.where(dims <= latents.shape[1])[0].max(axis=0, keepdims=True)[0]
+        for i in range(n_codebooks):
+            j, k = dims[i], dims[i + 1]
+            z_p_i, codes_i = self.quantizers[i].decode_latents(latents[:, j:k, :])
+            z_p.append(z_p_i)
+            codes.append(codes_i)
+            z_q_i = self.quantizers[i].out_proj(z_p_i)
+            z_q = z_q + z_q_i
+        return z_q, torch.cat(z_p, dim=1), torch.stack(codes, dim=1)
+# --------------------------------------------------------------------
+# S1 DAC rvq
+# --------------------------------------------------------------------
+@dataclass
+class VQResult:
+    z: Tensor
+    codes: Tensor
+    latents: Tensor
+    codebook_loss: Tensor
+    commitment_loss: Tensor
+    semantic_distill_z: Optional[Tensor] = None
+class CausalConvNet(nn.Module):
+    def __init__(
+        self,
+        in_channels,
+        out_channels,
+        kernel_size,
+        dilation=1,
+        stride=1,
+        groups=1,
+        padding=None,
+    ):
+        super().__init__()
+        self.conv = nn.Conv1d(
+            in_channels, out_channels, kernel_size,
+            stride=stride, dilation=dilation, groups=groups,
+        )
+        self.stride = stride
+        self.kernel_size = (kernel_size - 1) * dilation + 1
+        self.dilation = dilation
+        self.padding = self.kernel_size - self.stride
+    def forward(self, x: Tensor) -> Tensor:
+        pad = self.padding
+        extra = get_extra_padding_for_conv1d(x, self.kernel_size, self.stride, pad)
+        x = pad1d(x, (pad, extra), mode="constant", value=0)
+        return self.conv(x).contiguous()
+    def weight_norm(self, name="weight", dim=0):
+        self.conv = weight_norm(self.conv, name=name, dim=dim)
+        return self
+    def remove_weight_norm(self):
+        self.conv = remove_parametrizations(self.conv)
+        return self
+class CausalTransConvNet(nn.Module):
+    def __init__(self, in_channels, out_channels, kernel_size, dilation=1, stride=1, padding=None):
+        super().__init__()
+        self.conv = nn.ConvTranspose1d(
+            in_channels, out_channels, kernel_size,
+            stride=stride, dilation=dilation
+        )
+        self.stride = stride
+        self.kernel_size = kernel_size
+    def forward(self, x: Tensor) -> Tensor:
+        x = self.conv(x)
+        pad = self.kernel_size - self.stride
+        padding_right = math.ceil(pad)
+        padding_left  = pad - padding_right
+        x = unpad1d(x, (padding_left, padding_right))
+        return x.contiguous()
+    def weight_norm(self, name="weight", dim=0):
+        self.conv = weight_norm(self.conv, name=name, dim=dim)
+        return self
+    def remove_weight_norm(self):
+        self.conv = remove_parametrizations(self.conv)
+        return self
+def CausalWNConv1d(*args, **kwargs):
+    return CausalConvNet(*args, **kwargs).weight_norm()
+def CausalWNConvTranspose1d(*args, **kwargs):
+    return CausalTransConvNet(*args, **kwargs).weight_norm()
+class ConvNeXtBlock(nn.Module):
+    r"""ConvNeXt Block (1D).
+    DwConv -> (N, C, L) → (N, L, C) -> LN -> Linear -> GELU -> Linear -> (N, C, L) with residual
+    """
+    def __init__(
+        self,
+        dim: int,
+        layer_scale_init_value: float = 1e-6,
+        mlp_ratio: float = 4.0,
+        kernel_size: int = 7,
+        dilation: int = 1,
+    ):
+        super().__init__()
+        convnet_type = CausalConvNet
+        self.dwconv = convnet_type(
+            dim, dim, kernel_size=kernel_size,
+            groups=dim, dilation=dilation,
+        )  # depthwise conv
+        self.norm = nn.LayerNorm(dim, eps=1e-6)
+        self.pwconv1 = nn.Linear(dim, int(mlp_ratio * dim))
+        self.act = nn.GELU()
+        self.pwconv2 = nn.Linear(int(mlp_ratio * dim), dim)
+        self.gamma = (
+            nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)
+            if layer_scale_init_value > 0 else None
+        )
+    def forward(self, x: Tensor, apply_residual: bool = True) -> Tensor:
+        inp = x
+        x = self.dwconv(x)
+        x = x.permute(0, 2, 1)     # (N, C, L) -> (N, L, C)
+        x = self.norm(x)
+        x = self.pwconv1(x)
+        x = self.act(x)
+        x = self.pwconv2(x)
+        if self.gamma is not None:
+            x = self.gamma * x
+        x = x.permute(0, 2, 1)     # (N, L, C) -> (N, C, L)
+        if apply_residual:
+            x = inp + x
+        return x
+class DownsampleResidualVectorQuantize(nn.Module):
+    def __init__(
+        self,
+        input_dim: int = 1024,
+        n_codebooks: int = 9,
+        codebook_dim: int = 8,
+        quantizer_dropout: float = 0.5,
+        codebook_size: int = 1024,
+        semantic_codebook_size: int = 4096,
+        downsample_factor: Tuple[int, ...] = (2, 2),
+        downsample_dims: Optional[Tuple[int, ...]] = None,
+        pre_module: Optional[nn.Module] = None,
+        post_module: Optional[nn.Module] = None,
+        semantic_predictor_module: Optional[nn.Module] = None,
+    ):
+        super().__init__()
+        if downsample_dims is None:
+            downsample_dims = tuple(input_dim for _ in range(len(downsample_factor)))
+        all_dims = (input_dim,) + tuple(downsample_dims)
+        self.semantic_quantizer = ResidualVectorQuantize(
+            input_dim=input_dim,
+            n_codebooks=1,
+            codebook_size=semantic_codebook_size,
+            codebook_dim=codebook_dim,
+            quantizer_dropout=0.0,
+        )
+        self.quantizer = ResidualVectorQuantize(
+            input_dim=input_dim,
+            n_codebooks=n_codebooks,
+            codebook_size=codebook_size,
+            codebook_dim=codebook_dim,
+            quantizer_dropout=quantizer_dropout,
+        )
+        convnet_type = CausalConvNet
+        transconvnet_type = CausalTransConvNet
+        self.downsample = nn.Sequential(
+            *[
+                nn.Sequential(
+                    convnet_type(all_dims[idx], all_dims[idx + 1], kernel_size=factor, stride=factor),
+                    ConvNeXtBlock(dim=all_dims[idx + 1]),
+                )
+                for idx, factor in enumerate(downsample_factor)
+            ]
+        )
+        self.upsample = nn.Sequential(
+            *[
+                nn.Sequential(
+                    transconvnet_type(all_dims[idx + 1], all_dims[idx], kernel_size=factor, stride=factor),
+                    ConvNeXtBlock(dim=all_dims[idx]),
+                )
+                for idx, factor in reversed(list(enumerate(downsample_factor)))
+            ]
+        )
+        self.apply(self._init_weights)
+        self.pre_module  = pre_module  if pre_module  is not None else nn.Identity()
+        self.post_module = post_module if post_module is not None else nn.Identity()
+        self.semantic_predictor_module = (
+            semantic_predictor_module if semantic_predictor_module is not None else nn.Identity()
+        )
+    @staticmethod
+    def _init_weights(m):
+        if isinstance(m, (nn.Conv1d, nn.Linear)):
+            nn.init.trunc_normal_(m.weight, std=0.02)
+            if getattr(m, "bias", None) is not None:
+                nn.init.constant_(m.bias, 0)
+    def forward(self, z: Tensor, n_quantizers: Optional[int] = None, semantic_len: Optional[Tensor] = None, **kwargs):
+        # z: (B, D, T)
+        original_shape = z.shape
+        if semantic_len is None:
+            semantic_len = torch.LongTensor([z.shape[-1]])
+        z = self.downsample(z)
+        z = self.pre_module(z)  # (B, D, T) or (B, T, D) depending on module; original uses channels-first in/out
+        semantic_z, semantic_codes, semantic_latents, semantic_commitment_loss, semantic_codebook_loss = \
+            self.semantic_quantizer(z)
+        residual_z = z - semantic_z
+        residual_z, codes, latents, commitment_loss, codebook_loss = self.quantizer(residual_z, n_quantizers=n_quantizers)
+        z = semantic_z + residual_z
+        commitment_loss = commitment_loss + semantic_commitment_loss
+        codebook_loss   = codebook_loss   + semantic_codebook_loss
+        codes   = torch.cat([semantic_codes, codes], dim=1)
+        latents = torch.cat([semantic_latents, latents], dim=1)
+        z = self.post_module(z)
+        z = self.upsample(z)
+        # Pad or crop z to match original shape (time dimension)
+        diff = original_shape[-1] - z.shape[-1]
+        right = 0
+        left  = abs(diff) - right
+        if diff > 0:
+            z = F.pad(z, (left, right))
+        elif diff < 0:
+            z = z[..., left:]
+        return VQResult(
+            z=z, codes=codes, latents=latents,
+            commitment_loss=commitment_loss, codebook_loss=codebook_loss,
+        )
+    def decode(self, indices: Tensor) -> Tensor:
+        new_indices = torch.zeros_like(indices)
+        new_indices[:, 0] = torch.clamp(indices[:, 0],  max=self.semantic_quantizer.codebook_size - 1)
+        new_indices[:, 1:] = torch.clamp(indices[:, 1:], max=self.quantizer.codebook_size - 1)
+        z_q_semantic = self.semantic_quantizer.from_codes(new_indices[:, :1])[0]
+        z_q_residual = self.quantizer.from_codes(new_indices[:, 1:])[0]
+        z_q = z_q_semantic + z_q_residual
+        z_q = self.post_module(z_q)
+        z_q = self.upsample(z_q)
+        return z_q
+# --------------------------------------------------------------------
+# Transformer stack
+# --------------------------------------------------------------------
+@dataclass
+class ModelArgs:
+    block_size: int = 2048
+    n_layer: int = 8
+    n_head: int = 8
+    dim: int = 512
+    intermediate_size: int = 1536
+    n_local_heads: int = -1
+    head_dim: int = 64
+    rope_base: float = 10000
+    norm_eps: float = 1e-5
+    dropout_rate: float = 0.1
+    attn_dropout_rate: float = 0.1
+    channels_first: bool = True  # to be compatible with conv1d input/output
+    pos_embed_type: str = "rope"  # "rope" or "conformer"
+    max_relative_position: int = 128
+    def __post_init__(self):
+        if self.n_local_heads == -1:
+            self.n_local_heads = self.n_head
+        if self.intermediate_size is None:
+            hidden_dim = 4 * self.dim
+            n_hidden = int(2 * hidden_dim / 3)
+            self.intermediate_size = find_multiple(n_hidden, 256)
+        assert self.pos_embed_type in ["rope", "conformer"]
+class KVCache(nn.Module):
+    def __init__(self, max_batch_size, max_seq_length, n_heads, head_dim, dtype=torch.bfloat16):
+        super().__init__()
+        cache_shape = (max_batch_size, n_heads, max_seq_length, head_dim)
+        self.register_buffer("k_cache", torch.zeros(cache_shape, dtype=dtype))
+        self.register_buffer("v_cache", torch.zeros(cache_shape, dtype=dtype))
+    def update(self, input_pos: Tensor, k_val: Tensor, v_val: Tensor):
+        # input_pos: [S], k_val: [B, H, S, D]
+        assert input_pos.shape[0] == k_val.shape[2]
+        k_out = self.k_cache
+        v_out = self.v_cache
+        k_out[:, :, input_pos] = k_val
+        v_out[:, :, input_pos] = v_val
+        return (
+            k_out[:, :, : input_pos.max() + 1, :],
+            v_out[:, :, : input_pos.max() + 1, :],
+        )
+    def clear_cache(self, prompt_len: int):
+        self.k_cache[:, :, prompt_len:, :].fill_(0)
+        self.v_cache[:, :, prompt_len:, :].fill_(0)
+class Transformer(nn.Module):
+    def __init__(self, config: ModelArgs) -> None:
+        super().__init__()
+        self.config = config
+        self.layers = nn.ModuleList(TransformerBlock(config) for _ in range(config.n_layer))
+        self.norm   = RMSNorm(config.dim, eps=config.norm_eps)
+        if config.pos_embed_type == "rope":
+            freqs_cis = precompute_freqs_cis(self.config.block_size, self.config.head_dim, self.config.rope_base)
+            self.register_buffer("freqs_cis", freqs_cis)
+        else:
+            self.register_buffer("freqs_cis", None)
+        causal_mask = torch.tril(torch.ones(self.config.block_size, self.config.block_size, dtype=torch.bool))
+        self.register_buffer("causal_mask", causal_mask)
+        self.max_batch_size = -1
+        self.max_seq_length = -1
+        self.use_kv_cache = False
+    def setup_caches(self, max_batch_size, max_seq_length):
+        head_dim = self.config.dim // self.config.n_head
+        max_seq_length = find_multiple(max_seq_length, 8)
+        self.max_seq_length = max_seq_length
+        self.max_batch_size = max_batch_size
+        dtype  = self.norm.weight.dtype
+        device = self.norm.weight.device
+        for b in self.layers:
+            b.attention.kv_cache = KVCache(
+                max_batch_size, max_seq_length, self.config.n_local_heads, head_dim, dtype
+            ).to(device)
+        self.use_kv_cache = True
+    def forward(self, x: Tensor, input_pos: Optional[Tensor] = None, mask: Optional[Tensor] = None) -> Tensor:
+        if self.config.pos_embed_type == "rope":
+            assert self.freqs_cis is not None
+            freqs_cis = self.freqs_cis[input_pos]
+        else:
+            freqs_cis = None
+        if mask is None:
+            if not self.training and self.use_kv_cache:
+                mask = self.causal_mask[None, None, input_pos]
+                mask = mask[..., : input_pos.max() + 1]
+            else:
+                mask = self.causal_mask[None, None, input_pos]
+                mask = mask[..., input_pos]
+        for layer in self.layers:
+            x = layer(x, input_pos, freqs_cis, mask)
+        x = self.norm(x)
+        return x
+class TransformerBlock(nn.Module):
+    def __init__(self, config: ModelArgs) -> None:
+        super().__init__()
+        self.attention = Attention(config)
+        self.feed_forward = FeedForward(config)
+        self.ffn_norm = RMSNorm(config.dim, eps=config.norm_eps)
+        self.attention_norm = RMSNorm(config.dim, eps=config.norm_eps)
+        self.attention_layer_scale = LayerScale(config.dim, inplace=True)
+        self.ffn_layer_scale = LayerScale(config.dim, inplace=True)
+    def forward(self, x: Tensor, input_pos: Tensor, freqs_cis: Tensor, mask: Tensor) -> Tensor:
+        h = x + self.attention_layer_scale(
+            self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)
+        )
+        out = h + self.ffn_layer_scale(self.feed_forward(self.ffn_norm(h)))
+        return out
+class Attention(nn.Module):
+    def __init__(self, config: ModelArgs):
+        super().__init__()
+        assert config.dim % config.n_head == 0
+        total_head_dim = (config.n_head + 2 * config.n_local_heads) * config.head_dim
+        self.wqkv = nn.Linear(config.dim, total_head_dim, bias=False)
+        self.wo   = nn.Linear(config.head_dim * config.n_head, config.dim, bias=False)
+        self.kv_cache = None
+        self.n_head = config.n_head
+        self.head_dim = config.head_dim
+        self.n_local_heads = config.n_local_heads
+        self.dim = config.dim
+        self.attn_dropout_rate = config.attn_dropout_rate
+        self.pos_embed_type = config.pos_embed_type
+        if self.pos_embed_type == "conformer":
+            self.max_relative_position = config.max_relative_position
+            num_pos_embeddings = 2 * config.max_relative_position + 1
+            self.rel_pos_embeddings = nn.Parameter(torch.zeros(num_pos_embeddings, self.head_dim))
+            nn.init.normal_(self.rel_pos_embeddings, mean=0.0, std=0.02)
+    def _compute_conformer_pos_scores(self, q: Tensor, seqlen: int) -> Tensor:
+        positions = torch.arange(seqlen, device=q.device)
+        relative_positions = positions.unsqueeze(1) - positions.unsqueeze(0)  # [S, S]
+        relative_positions = torch.clamp(relative_positions + self.max_relative_position,
+                                         0, 2 * self.max_relative_position)
+        rel_embeddings = self.rel_pos_embeddings[relative_positions]  # [S, S, D]
+        q = q.transpose(1, 2)  # [B, S, H, D]
+        rel_logits = torch.matmul(q, rel_embeddings.transpose(-2, -1))  # [B, S, H, S]
+        rel_logits = rel_logits.transpose(1, 2)  # [B, H, S, S]
+        return rel_logits
+    def forward(self, x: Tensor, freqs_cis: Tensor, mask: Tensor, input_pos: Optional[Tensor] = None) -> Tensor:
+        bsz, seqlen, _ = x.shape
+        kv_size = self.n_local_heads * self.head_dim
+        q, k, v = self.wqkv(x).split([kv_size, kv_size, kv_size], dim=-1)
+        context_seqlen = seqlen
+        q = q.view(bsz, seqlen, self.n_head,        self.head_dim)
+        k = k.view(bsz, context_seqlen, self.n_local_heads, self.head_dim)
+        v = v.view(bsz, context_seqlen, self.n_local_heads, self.head_dim)
+        if self.pos_embed_type == "rope":
+            q = apply_rotary_emb(q, freqs_cis)
+            k = apply_rotary_emb(k, freqs_cis)
+        q, k, v = map(lambda t: t.transpose(1, 2), (q, k, v))
+        if self.kv_cache is not None:
+            k, v = self.kv_cache.update(input_pos, k, v)
+        k = k.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
+        v = v.repeat_interleave(self.n_head // self.n_local_heads, dim=1)
+        if self.pos_embed_type == "conformer":
+            scale = 1.0 / math.sqrt(self.head_dim)
+            scores = torch.matmul(q, k.transpose(-2, -1)) * scale
+            rel_scores = self._compute_conformer_pos_scores(q, seqlen)
+            scores = scores + rel_scores
+            if mask is not None:
+                scores = scores.masked_fill(~mask, float("-inf"))
+            attn = F.softmax(scores, dim=-1)
+            if self.attn_dropout_rate > 0 and self.training:
+                attn = F.dropout(attn, p=self.attn_dropout_rate)
+            y = torch.matmul(attn, v)
+        else:
+            y = F.scaled_dot_product_attention(
+                q, k, v,
+                dropout_p=self.attn_dropout_rate if self.training else 0.0,
+                attn_mask=mask,
+            )
+        y = y.transpose(1, 2).contiguous().view(bsz, seqlen, self.head_dim * self.n_head)
+        y = self.wo(y)
+        return y
+class FeedForward(nn.Module):
+    def __init__(self, config: ModelArgs) -> None:
+        super().__init__()
+        self.w1 = nn.Linear(config.dim, config.intermediate_size, bias=False)
+        self.w3 = nn.Linear(config.dim, config.intermediate_size, bias=False)
+        self.w2 = nn.Linear(config.intermediate_size, config.dim, bias=False)
+        self.dropout = nn.Dropout(config.dropout_rate)
+    def forward(self, x: Tensor) -> Tensor:
+        return self.w2(self.dropout(F.silu(self.w1(x)) * self.w3(x)))
+class RMSNorm(nn.Module):
+    def __init__(self, dim: int, eps: float = 1e-5):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+    def _norm(self, x):
+        return x * torch.rsqrt(torch.mean(x * x, dim=-1, keepdim=True) + self.eps)
+    def forward(self, x: Tensor) -> Tensor:
+        output = self._norm(x.float()).type_as(x)
+        return output * self.weight
+class LayerScale(nn.Module):
+    def __init__(self, dim: int, init_values: Union[float, Tensor] = 1e-2, inplace: bool = False) -> None:
+        super().__init__()
+        self.inplace = inplace
+        self.gamma = nn.Parameter(init_values * torch.ones(dim))
+    def forward(self, x: Tensor) -> Tensor:
+        return x.mul_(self.gamma) if self.inplace else x * self.gamma
+class WindowLimitedTransformer(Transformer):
+    """Transformer with window-limited causal attention."""
+    def __init__(
+        self,
+        config: ModelArgs,
+        input_dim: int = 512,
+        window_size: Optional[int] = None,
+        causal: bool = True,
+        look_ahead_conv: Optional[nn.Module] = None,
+    ):
+        super().__init__(config)
+        self.window_size = window_size
+        self.causal = causal
+        self.channels_first = config.channels_first
+        self.look_ahead_conv = look_ahead_conv if look_ahead_conv is not None else nn.Identity()
+        self.input_proj = nn.Linear(input_dim, config.dim) if input_dim != config.dim else nn.Identity()
+        self.output_proj = nn.Linear(config.dim, input_dim) if input_dim != config.dim else nn.Identity()
+    def make_window_limited_mask(self, max_length: int, x_lens: Optional[Tensor] = None) -> Tensor:
+        if self.causal:
+            mask = torch.tril(torch.ones(max_length, max_length))
+            row_indices = torch.arange(max_length).view(-1, 1)
+            window_size = self.window_size or max_length
+            valid_range = (row_indices - window_size + 1).clamp(min=0)
+            column_indices = torch.arange(max_length)
+            mask = (column_indices >= valid_range) & mask.bool()
+        else:
+            raise NotImplementedError
+        mask = mask.bool()[None, None]
+        return mask
+    def make_mask(self, max_length: int, x_lens: Optional[Tensor] = None) -> Tensor:
+        if self.causal:
+            mask = torch.tril(torch.ones(max_length, max_length))
+        else:
+            mask = torch.ones(max_length, max_length)
+            mask = mask.bool()[None, None]
+            for i, x_len in enumerate(x_lens):
+                mask[:x_len, i] = 0
+        mask = mask.bool()[None, None]
+        return mask
+    def forward(self, x: Tensor, x_lens: Optional[Tensor] = None) -> Tensor:
+        if self.channels_first:
+            x = x.transpose(1, 2)
+        x = self.input_proj(x)
+        x = self.look_ahead_conv(x)
+        input_pos = torch.arange(x.shape[1], device=x.device)
+        max_length = x.shape[1]
+        if self.window_size is not None:
+            mask = self.make_window_limited_mask(max_length, x_lens)
+        else:
+            mask = self.make_mask(max_length, x_lens)
+        mask = mask.to(x.device)
+        x = super().forward(x, input_pos, mask)
+        x = self.output_proj(x)
+        if self.channels_first:
+            x = x.transpose(1, 2)
+        return x
+def precompute_freqs_cis(
+    seq_len: int, n_elem: int, base: int = 10000, dtype: torch.dtype = torch.bfloat16
+) -> Tensor:
+    freqs = 1.0 / (base ** (torch.arange(0, n_elem, 2)[: (n_elem // 2)].float() / n_elem))
+    t = torch.arange(seq_len, device=freqs.device)
+    freqs = torch.outer(t, freqs)
+    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)
+    cache = torch.stack([freqs_cis.real, freqs_cis.imag], dim=-1)
+    return cache.to(dtype=dtype)
+def apply_rotary_emb(x: Tensor, freqs_cis: Tensor) -> Tensor:
+    xshaped = x.float().reshape(*x.shape[:-1], -1, 2)
+    freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2)
+    x_out2 = torch.stack(
+        [
+            xshaped[..., 0] * freqs_cis[..., 0] - xshaped[..., 1] * freqs_cis[..., 1],
+            xshaped[..., 1] * freqs_cis[..., 0] + xshaped[..., 0] * freqs_cis[..., 1],
+        ],
+        -1,
+    )
+    x_out2 = x_out2.flatten(3)
+    return x_out2.type_as(x)
+def init_weights(m):
+    if isinstance(m, nn.Conv1d):
+        nn.init.trunc_normal_(m.weight, std=0.02)
+        nn.init.constant_(m.bias, 0)
+# --------------------------------------------------------------------
+# Top-level AE
+# --------------------------------------------------------------------
+class EncoderBlock(nn.Module):
+    def __init__(
+        self,
+        dim: int = 16,
+        stride: int = 1,
+        causal: bool = False,
+        n_t_layer: int = 0,
+        transformer_general_config=None,
+    ):
+        super().__init__()
+        conv_class = CausalWNConv1d if causal else WNConv1d
+        transformer_module = (
+            nn.Identity()
+            if n_t_layer == 0
+            else WindowLimitedTransformer(
+                causal=causal,
+                input_dim=dim,
+                window_size=512,
+                config=transformer_general_config(
+                    n_layer=n_t_layer,
+                    n_head=dim // 64,
+                    dim=dim,
+                    intermediate_size=dim * 3,
+                ),
+            )
+        )
+        self.block = nn.Sequential(
+            # three multi‑receptive‑field residual units
+            ResidualUnit(dim // 2, dilation=1, causal=causal),
+            ResidualUnit(dim // 2, dilation=3, causal=causal),
+            ResidualUnit(dim // 2, dilation=9, causal=causal),
+            Snake1d(dim // 2),
+            conv_class(dim // 2, dim, kernel_size=2 * stride, stride=stride, padding=math.ceil(stride / 2)),
+            transformer_module,
+        )
+    def forward(self, x: Tensor) -> Tensor:
+        return self.block(x)
+class ResidualUnit(nn.Module):
+    def __init__(self, dim: int = 16, dilation: int = 1, causal: bool = False):
+        super().__init__()
+        conv_class = CausalWNConv1d if causal else WNConv1d
+        pad = ((7 - 1) * dilation) // 2
+        self.block = nn.Sequential(
+            Snake1d(dim),
+            conv_class(dim, dim, kernel_size=7, dilation=dilation, padding=pad),
+            Snake1d(dim),
+            conv_class(dim, dim, kernel_size=1),
+        )
+        self.causal = causal
+    def forward(self, x: Tensor) -> Tensor:
+        y = self.block(x)
+        pad = x.shape[-1] - y.shape[-1]
+        if pad > 0:
+            if self.causal:
+                x = x[..., :-pad]
+            else:
+                x = x[..., pad // 2 : -pad // 2]
+        return x + y
+class Encoder(nn.Module):
+    def __init__(
+        self,
+        d_model: int = 64,
+        strides: List[int] = [2, 4, 8, 8],
+        d_latent: int = 64,
+        n_transformer_layers: List[int] = [0, 0, 4, 4],
+        transformer_general_config: Optional[ModelArgs] = None,
+        causal: bool = False,
+    ):
+        super().__init__()
+        conv_class = CausalWNConv1d if causal else WNConv1d
+        layers: List[nn.Module] = [conv_class(1, d_model, kernel_size=7, padding=3)]
+        for stride, n_t_layer in zip(strides, n_transformer_layers):
+            d_model *= 2
+            layers.append(
+                EncoderBlock(
+                    d_model, stride=stride, causal=causal,
+                    n_t_layer=n_t_layer, transformer_general_config=transformer_general_config,
+                )
+            )
+        layers += [Snake1d(d_model), conv_class(d_model, d_latent, kernel_size=3, padding=1)]
+        self.block = nn.Sequential(*layers)
+        self.enc_dim = d_model
+    def forward(self, x: Tensor) -> Tensor:
+        return self.block(x)
+class DecoderBlock(nn.Module):
+    def __init__(
+        self,
+        input_dim: int = 16,
+        output_dim: int = 8,
+        stride: int = 1,
+        causal: bool = False,
+        n_t_layer: int = 0,
+        transformer_general_config=None,
+    ):
+        super().__init__()
+        conv_trans_class = CausalWNConvTranspose1d if causal else WNConvTranspose1d
+        transformer_module = (
+            nn.Identity()
+            if n_t_layer == 0
+            else WindowLimitedTransformer(
+                causal=causal,
+                input_dim=input_dim,
+                window_size=None,
+                config=transformer_general_config(
+                    n_layer=n_t_layer,
+                    n_head=input_dim // 64,
+                    dim=input_dim,
+                    intermediate_size=input_dim * 3,
+                ),
+            )
+        )
+        self.block = nn.Sequential(
+            Snake1d(input_dim),
+            conv_trans_class(input_dim, output_dim, kernel_size=2 * stride, stride=stride, padding=math.ceil(stride / 2)),
+            ResidualUnit(output_dim, dilation=1, causal=causal),
+            ResidualUnit(output_dim, dilation=3, causal=causal),
+            ResidualUnit(output_dim, dilation=9, causal=causal),
+        )
+    def forward(self, x: Tensor) -> Tensor:
+        return self.block(x)
+class Decoder(nn.Module):
+    def __init__(
+        self,
+        input_channel: int,
+        channels: int,
+        rates: List[int],
+        d_out: int = 1,
+        causal: bool = False,
+        n_transformer_layers: List[int] = [0, 0, 0, 0],
+        transformer_general_config=None,
+    ):
+        super().__init__()
+        conv_class = CausalWNConv1d if causal else WNConv1d
+        layers: List[nn.Module] = [conv_class(input_channel, channels, kernel_size=7, padding=3)]
+        for i, (stride, n_t_layer) in enumerate(zip(rates, n_transformer_layers)):
+            input_dim  = channels // 2**i
+            output_dim = channels // 2 ** (i + 1)
+            layers.append(
+                DecoderBlock(
+                    input_dim, output_dim, stride, causal=causal,
+                    n_t_layer=n_t_layer, transformer_general_config=transformer_general_config,
+                )
+            )
+        layers += [Snake1d(output_dim), conv_class(output_dim, d_out, kernel_size=7, padding=3), nn.Tanh()]
+        self.model = nn.Sequential(*layers)
+    def forward(self, x: Tensor) -> Tensor:
+        return self.model(x)
+class DAC(nn.Module):
+    def __init__(
+        self,
+        encoder_dim: int = 64,
+        encoder_rates: List[int] = [2, 4, 8, 8],
+        latent_dim: Optional[int] = None,
+        decoder_dim: int = 1536,
+        decoder_rates: List[int] = [8, 8, 4, 2],
+        quantizer: Optional[nn.Module] = None,
+        sample_rate: int = 44100,
+        causal: bool = True,
+        encoder_transformer_layers: List[int] = [0, 0, 0, 0],
+        decoder_transformer_layers: List[int] = [0, 0, 0, 0],
+        transformer_general_config=None,
+    ):
+        super().__init__()
+        self.encoder_dim = encoder_dim
+        self.encoder_rates = encoder_rates
+        self.decoder_dim = decoder_dim
+        self.decoder_rates = decoder_rates
+        self.sample_rate = sample_rate
+        if latent_dim is None:
+            latent_dim = encoder_dim * (2 ** len(encoder_rates))
+        self.latent_dim = latent_dim
+        self.hop_length = int(np.prod(encoder_rates))
+        self.encoder = Encoder(
+            encoder_dim, encoder_rates, latent_dim, causal=causal,
+            n_transformer_layers=encoder_transformer_layers,
+            transformer_general_config=transformer_general_config,
+        )
+        self.quantizer = quantizer
+        self.decoder = Decoder(
+            latent_dim, decoder_dim, decoder_rates, causal=causal,
+            n_transformer_layers=decoder_transformer_layers,
+            transformer_general_config=transformer_general_config,
+        )
+        self.sample_rate = sample_rate
+        self.apply(init_weights)
+        self.delay = self.get_delay()
+        self.frame_length = self.hop_length * 4
+    def get_output_length(self, input_length: int) -> int:
+        length = input_length
+        for stride in self.encoder_rates:
+            length = math.ceil(length / stride)
+        return length
+    def get_delay(self) -> int:
+        l_out = self.get_output_length(0)
+        L = l_out
+        layers = [layer for layer in self.modules() if isinstance(layer, (nn.Conv1d, nn.ConvTranspose1d))]
+        for layer in reversed(layers):
+            d = layer.dilation[0]
+            k = layer.kernel_size[0]
+            s = layer.stride[0]
+            if isinstance(layer, nn.ConvTranspose1d):
+                L = ((L - d * (k - 1) - 1) / s) + 1
+            elif isinstance(layer, nn.Conv1d):
+                L = (L - 1) * s + d * (k - 1) + 1
+            L = math.ceil(L)
+        l_in = L
+        return (l_in - l_out) // 2
+    def preprocess(self, audio_data: Tensor, sample_rate: Optional[int]) -> Tensor:
+        if sample_rate is None:
+            sample_rate = self.sample_rate
+        assert sample_rate == self.sample_rate
+        length = audio_data.shape[-1]
+        right_pad = math.ceil(length / self.hop_length) * self.hop_length - length
+        audio_data = F.pad(audio_data, (0, right_pad))
+        return audio_data
+    def encode(
+        self,
+        audio_data: Tensor,
+        audio_lengths: Optional[Tensor] = None,
+        n_quantizers: Optional[int] = None,
+        **kwargs,
+    ):
+        """Encode audio to quantized code indices."""
+        if audio_data.ndim == 2:
+            audio_data = audio_data.unsqueeze(1)
+        length = audio_data.shape[-1]
+        right_pad = math.ceil(length / self.frame_length) * self.frame_length - length
+        audio_data = F.pad(audio_data, (0, right_pad))
+        if audio_lengths is None:
+            audio_lengths = torch.LongTensor([length + right_pad]).to(audio_data.device)
+        z = self.encoder(audio_data)
+        vq_results = self.quantizer(z, n_quantizers, **kwargs)
+        indices = vq_results.codes
+        indices_lens = torch.ceil(audio_lengths / self.frame_length).long()
+        return indices, indices_lens
+    def decode(self, indices: Tensor, feature_lengths: Tensor):
+        """Decode code indices to audio."""
+        if indices.ndim == 2:
+            indices = indices[None]
+        z = self.quantizer.decode(indices)
+        audio_lengths = feature_lengths * self.frame_length
+        return self.decoder(z), audio_lengths
+    def encode_to_codes(self, audio: Tensor, audio_lengths: Optional[Tensor] = None, n_quantizers: Optional[int] = None, **kw):
+        return self.encode(audio, audio_lengths, n_quantizers, **kw)
+    def decode_codes(self, indices: Tensor, feature_lengths: Tensor):
+        return self.decode(indices, feature_lengths)
+    @torch.no_grad()
+    def encode_zq(self, audio_data: Tensor) -> Tensor:
+        indices, _ = self.encode(audio_data)
+        new_indices = torch.zeros_like(indices)
+        new_indices[:, 0] = torch.clamp(indices[:, 0],  max=self.quantizer.semantic_quantizer.codebook_size - 1)
+        new_indices[:, 1:] = torch.clamp(indices[:, 1:], max=self.quantizer.quantizer.codebook_size - 1)
+        z_q_semantic = self.quantizer.semantic_quantizer.from_codes(new_indices[:, :1])[0]
+        z_q_residual = self.quantizer.quantizer.from_codes(new_indices[:, 1:])[0]
+        z_q = z_q_semantic + z_q_residual
+        return z_q
+    @torch.no_grad()
+    def decode_zq(self, z_q: Tensor) -> Tensor:
+        z_q = self.quantizer.post_module(z_q)
+        z_q = self.quantizer.upsample(z_q)
+        return self.decoder(z_q)
+    @property
+    def device(self) -> torch.device: return next(self.parameters()).device
+    @property
+    def dtype(self) -> torch.dtype: return next(self.parameters()).dtype
+# --------------------------------------------------------------------
+# Build helpers
+# --------------------------------------------------------------------
+def build_ae(**cfg) -> DAC:
+    """
+    Factory used by external loaders
+    """
+    # Shared transformer config for the RVQ pre/post modules
+    q_config = ModelArgs(
+        block_size=4096, n_layer=8, n_head=16, dim=1024,
+        intermediate_size=3072, head_dim=64, norm_eps=1e-5,
+        dropout_rate=0.1, attn_dropout_rate=0.1, channels_first=True
+    )
+    def make_transformer():
+        return WindowLimitedTransformer(
+            causal=True, window_size=128, input_dim=1024, config=q_config
+        )
+    quantizer = DownsampleResidualVectorQuantize(
+        input_dim=1024, n_codebooks=9, codebook_size=1024, codebook_dim=8,
+        quantizer_dropout=0.5, downsample_factor=(2, 2),
+        semantic_codebook_size=4096,
+        pre_module=make_transformer(),
+        post_module=make_transformer(),
+    )
+    def transformer_general_config(**kw):
+        return ModelArgs(
+            block_size=kw.get("block_size", 16384),
+            n_layer=kw.get("n_layer", 8),
+            n_head=kw.get("n_head", 8),
+            dim=kw.get("dim", 512),
+            intermediate_size=kw.get("intermediate_size", 1536),
+            n_local_heads=kw.get("n_local_heads", -1),
+            head_dim=kw.get("head_dim", 64),
+            rope_base=kw.get("rope_base", 10000),
+            norm_eps=kw.get("norm_eps", 1e-5),
+            dropout_rate=kw.get("dropout_rate", 0.1),
+            attn_dropout_rate=kw.get("attn_dropout_rate", 0.1),
+            channels_first=kw.get("channels_first", True),
+        )
+    dac = DAC(
+        encoder_dim=64, encoder_rates=[2, 4, 8, 8], latent_dim=1024,
+        decoder_dim=1536, decoder_rates=[8, 8, 4, 2],
+        quantizer=quantizer, sample_rate=44100, causal=True,
+        encoder_transformer_layers=[0, 0, 0, 4],
+        decoder_transformer_layers=[4, 0, 0, 0],
+        transformer_general_config=transformer_general_config,
+    )
+    return dac
+__all__ = [
+    "DAC",
+    "build_ae",
+    "VectorQuantize",
+    "ResidualVectorQuantize",
+    "DownsampleResidualVectorQuantize",
+]
+# ----- BEGIN DAC MIT LICENSE -----
+# MIT License
+# Copyright (c) 2023-present, Descript
+#
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+#
+# The above copyright notice and this permission notice shall be included in all
+# copies or substantial portions of the Software.
+#
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+# ----- END DAC MIT LICENSE -----

inference.py ADDED Viewed

	@@ -0,0 +1,290 @@

+from dataclasses import dataclass
+from typing import Callable, List, Tuple
+import torch
+import safetensors.torch as st
+from huggingface_hub import hf_hub_download
+from model import EchoDiT
+from autoencoder import build_ae, DAC
+import torchaudio
+from torchcodec.decoders import AudioDecoder
+# from samplers import Sampler
+SampleFn = Callable[
+    [EchoDiT, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, int],
+    torch.Tensor
+]
+### Loading
+def load_model_from_hf(repo_id: str = 'jordand/echo-tts', device: str = 'cuda', dtype: torch.dtype | None = torch.bfloat16, compile: bool = False, token: str | None = None) -> EchoDiT:
+    with torch.device('meta'):
+        model = EchoDiT(
+            latent_size=80, model_size=2048, num_layers=24, num_heads=16,
+            intermediate_size=5888, norm_eps=1e-5, max_seq_len=640,
+            text_vocab_size=256, text_model_size=1280, text_num_layers=14,
+            text_num_heads=10, text_intermediate_size=3328, text_max_seq_len=768,
+            speaker_patch_size=4, speaker_model_size=1280, speaker_num_layers=14,
+            speaker_num_heads=10, speaker_intermediate_size=3328,
+            speaker_max_patched_seq_len=640, timestep_embed_size=512, adaln_rank=256,
+        )
+    w_path = hf_hub_download(repo_id, 'pytorch_model.safetensors', token=token)
+    # Load to CPU first
+    state = st.load_file(w_path, device='cpu')
+    # Convert dtype on CPU if needed
+    if dtype is not None:
+        state = {k: v.to(dtype=dtype) for k, v in state.items()}
+    # Now move to device
+    state = {k: v.to(device=device) for k, v in state.items()}
+    model.load_state_dict(state, strict=True, assign=True)
+    model = model.eval()
+    if compile:
+        model = torch.compile(model)
+        model.get_kv_cache = torch.compile(model.get_kv_cache)
+    return model
+def load_fish_ae_from_hf(repo_id: str = 'jordand/fish-s1-dac-min', device: str = 'cuda', dtype: torch.dtype | None = torch.float32, compile: bool = False, token: str | None = None) -> DAC:
+    # have not tested lower precisions with fish AE yet
+    with torch.device('meta'):
+        fish_ae = build_ae()
+    w_path = hf_hub_download(repo_id, 'pytorch_model.safetensors', token=token)
+    if dtype is not None and dtype != torch.float32:
+        state = st.load_file(w_path, device='cpu')
+        state = {k: v.to(dtype=dtype) for k, v in state.items()}
+        state = {k: v.to(device=device) for k, v in state.items()}
+        fish_ae.load_state_dict(state, strict=False, assign=True)
+    else:
+        state = st.load_file(w_path, device=device)
+        fish_ae.load_state_dict(state, strict=False, assign=True)
+    fish_ae = fish_ae.eval().to(device)
+    if compile:
+        fish_ae.encoder = torch.compile(fish_ae.encoder)
+        fish_ae.decoder = torch.compile(fish_ae.decoder)
+    return fish_ae
+@dataclass
+class PCAState:
+    pca_components: torch.Tensor
+    pca_mean: torch.Tensor
+    latent_scale: float
+def load_pca_state_from_hf(repo_id: str = 'jordand/echo-tts', device: str = 'cuda', filename: str = 'pca_state.safetensors', token: str | None = None) -> PCAState:
+    p_path = hf_hub_download(repo_id, filename, token=token)
+    t = st.load_file(p_path, device=device)
+    return PCAState(
+        pca_components=t["pca_components"],
+        pca_mean=t["pca_mean"],
+        latent_scale=float(t["latent_scale"].item()),
+    )
+### default load audio
+def load_audio(path: str) -> torch.Tensor:
+    decoder = AudioDecoder(path)
+    sr = decoder.metadata.sample_rate
+    audio = decoder.get_samples_played_in_range(0, 120)
+    audio = audio.data.mean(dim=0).unsqueeze(0)
+    audio = torchaudio.functional.resample(audio, sr, 44_100)
+    audio = audio / torch.maximum(audio.abs().max(), torch.tensor(1.))
+    # TODO is this better than clipping? should we target a specific energy level?
+    return audio
+### Text helpers
+def tokenizer_encode(text: str, append_bos: bool = True, normalize: bool = True) -> torch.Tensor:
+    if normalize:
+        text = text.replace('…', '...')
+        text = text.replace('“', '"')
+        text = text.replace('”', '"')
+        text = text.replace('’', "'")
+        text = text.replace('\n', " ")
+    b = list(text.encode('utf-8'))
+    if append_bos:
+        b.insert(0, 0)
+    return torch.tensor(b)
+def get_text_input_ids_and_mask(text_arr: List[str], max_length: int | None, device: str | None = None) -> tuple[torch.Tensor, torch.Tensor]:
+    batch_size = len(text_arr)
+    if max_length is None:
+        max_length = max(len(tokenizer_encode(text)) for text in text_arr) # obviously bad...
+    tokens = torch.zeros((batch_size, max_length), dtype=torch.int32)
+    mask = torch.zeros((batch_size, max_length), dtype=torch.bool)
+    for i, text in enumerate(text_arr):
+        encoded = tokenizer_encode(text)
+        length = min(len(encoded), max_length)
+        tokens[i, :length] = encoded[:length]
+        mask[i, :length] = 1
+    if device is not None:
+        tokens = tokens.to(device)
+        mask = mask.to(device)
+    return tokens, mask
+### Autoencoder Inference
+@torch.inference_mode()
+def ae_encode(fish_ae: DAC, pca_state: PCAState, audio: torch.Tensor) -> torch.Tensor:
+    assert audio.ndim == 3 and audio.shape[1] == 1 # (b, 1, length)
+    z_q = fish_ae.encode_zq(audio).float()
+    z_q = (z_q.transpose(1, 2) - pca_state.pca_mean) @ pca_state.pca_components.T
+    z_q = z_q * pca_state.latent_scale
+    return z_q
+@torch.inference_mode()
+def ae_decode(fish_ae: DAC, pca_state: PCAState, z_q: torch.Tensor) -> torch.Tensor:
+    z_q = (z_q / pca_state.latent_scale) @ pca_state.pca_components + pca_state.pca_mean
+    return fish_ae.decode_zq(z_q.transpose(1, 2).to(fish_ae.dtype)).float()
+@torch.inference_mode()
+def ae_reconstruct(fish_ae: DAC, pca_state: PCAState, audio: torch.Tensor) -> torch.Tensor:
+    # (audio is (b, 1, length))
+    z_q = ae_encode(fish_ae, pca_state, audio.to(fish_ae.dtype))
+    return ae_decode(fish_ae, pca_state, z_q)
+@torch.inference_mode()
+def get_speaker_latent_and_mask(
+    fish_ae: DAC,
+    pca_state: PCAState,
+    audio: torch.Tensor, # (1, length)
+    max_speaker_latent_len: int = 2560, # pretrained max length
+    audio_chunk_size: int = 640 * 2048 # (~30 seconds, 1/4 max speaker condition size)
+) -> tuple[torch.Tensor, torch.Tensor]:
+    # gets speaker latent and mask from audio, computes in chunks and concatenates (similar to pretraining setup)
+    AE_DOWNSAMPLE_FACTOR = 2048
+    max_audio_len = max_speaker_latent_len * AE_DOWNSAMPLE_FACTOR
+    assert audio.ndim == 2 and audio.shape[0] == 1  # (1, length)
+    audio = audio[:, :max_audio_len]
+    audio_len = audio.shape[1]
+    latent_arr = []
+    for i in range(0, audio_len, audio_chunk_size):
+        audio_chunk = audio[:, i:i + audio_chunk_size]
+        if audio_chunk.shape[1] < audio_chunk_size:
+            audio_chunk = torch.nn.functional.pad(audio_chunk, (0, audio_chunk_size - audio_chunk.shape[1]))
+        latent_chunk = ae_encode(fish_ae, pca_state, audio_chunk.unsqueeze(0))
+        latent_arr.append(latent_chunk)
+    speaker_latent = torch.cat(latent_arr, dim=1)
+    actual_latent_len = audio_len // AE_DOWNSAMPLE_FACTOR
+    speaker_mask = (torch.arange(speaker_latent.shape[1], device=speaker_latent.device) < actual_latent_len).unsqueeze(0)
+    if speaker_latent.shape[1] < max_speaker_latent_len:
+        speaker_latent = torch.nn.functional.pad(speaker_latent, (0, 0, 0, max_speaker_latent_len - speaker_latent.shape[1]))
+        speaker_mask = torch.nn.functional.pad(speaker_mask, (0, max_speaker_latent_len - speaker_mask.shape[1]))
+    return speaker_latent, speaker_mask
+### Full sample pipeline
+def find_flattening_point(data, target_value=0.0, window_size=20, std_threshold=0.05):
+    padded_data = torch.cat([data, torch.zeros(window_size, *data.shape[1:], device=data.device, dtype=data.dtype)])
+    for i in range(len(padded_data) - window_size):
+        window = padded_data[i:i + window_size]
+        if window.std() < std_threshold and abs(window.mean() - target_value) < 0.1:
+            return i
+    return len(data)
+@torch.inference_mode()
+def sample_pipeline(
+    model: EchoDiT,
+    fish_ae: DAC,
+    pca_state: PCAState,
+    sample_fn: SampleFn,
+    text_prompt: str,
+    speaker_audio: torch.Tensor | None,
+    rng_seed: int,
+    pad_to_max_speaker_latent_len: int | None = 2560,
+    pad_to_max_text_seq_len: int | None = 768,
+) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]:
+    MAX_SPEAKER_LATENT_LEN = 2560
+    MAX_TEXT_SEQ_LEN = 768
+    device, dtype = model.device, model.dtype
+    text_input_ids, text_mask = get_text_input_ids_and_mask([text_prompt], min(pad_to_max_text_seq_len or MAX_TEXT_SEQ_LEN, MAX_TEXT_SEQ_LEN), device=device)
+    # print('initial text input ids length: ', text_input_ids.shape[1])
+    # torch.cuda.synchronize()
+    # import time
+    # t0 = time.time()
+    if speaker_audio is None:
+        # No speaker prompt - use zero speaker latent and mask
+        speaker_latent = torch.zeros((1, pad_to_max_speaker_latent_len if pad_to_max_speaker_latent_len else MAX_SPEAKER_LATENT_LEN, 80), device=device, dtype=dtype)
+        speaker_mask = torch.zeros((1, pad_to_max_speaker_latent_len if pad_to_max_speaker_latent_len else MAX_SPEAKER_LATENT_LEN), device=device, dtype=torch.bool)
+        # print("Using zero speaker latent and mask (no speaker prompt)")
+    else:
+        speaker_latent, speaker_mask = get_speaker_latent_and_mask(
+            fish_ae,
+            pca_state,
+            speaker_audio.to(fish_ae.dtype),
+            max_speaker_latent_len=pad_to_max_speaker_latent_len if pad_to_max_speaker_latent_len else MAX_SPEAKER_LATENT_LEN
+        )
+        speaker_latent = speaker_latent.to(device)
+        speaker_mask = speaker_mask.to(device)
+        # print('speaker latent shape: ', speaker_latent.shape)
+        # print('speaker mask shape: ', speaker_mask.shape)
+    # torch.cuda.synchronize()
+    # t1 = time.time()
+    # print(f"Time taken encode: {t1 - t0} seconds")
+    latent_out = sample_fn(model, speaker_latent, speaker_mask, text_input_ids, text_mask, rng_seed)
+    # torch.cuda.synchronize()
+    # t2 = time.time()
+    # print(f"Time taken sample: {t2 - t1} seconds")
+    audio_out = ae_decode(fish_ae, pca_state, latent_out)
+    # torch.cuda.synchronize()
+    # t3 = time.time()
+    # print(f"Time taken decode: {t3 - t2} seconds")
+    flattening_point = find_flattening_point(latent_out[0])
+    audio_out = audio_out[..., :flattening_point * 2048]
+    # print(f"\nTime taken total: {t3 - t0} seconds")
+    # peak_mem = torch.cuda.max_memory_allocated()
+    # print(f"Peak memory: {peak_mem / 1024**2:.2f} MB")
+    # print(torch.cuda.memory_summary(abbreviated=True))
+    return audio_out

model.py ADDED Viewed

	@@ -0,0 +1,650 @@

+from typing import Tuple, List
+import torch
+import torch.nn as nn
+import torch.optim as optim
+import torch.nn.functional as F
+def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0) -> torch.Tensor:
+    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)] / dim))
+    t = torch.arange(end)
+    freqs = torch.outer(t, freqs)
+    freqs_cis = torch.complex(torch.cos(freqs), torch.sin(freqs))
+    return freqs_cis
+def apply_rotary_emb(
+    x: torch.Tensor,
+    freqs_cis: torch.Tensor,
+) -> torch.Tensor:
+    x_ = torch.view_as_complex(x.float().reshape(*x.shape[:3], -1, 2))
+    x_ = x_ * freqs_cis[..., None, :]
+    x_ = torch.view_as_real(x_).reshape(x.shape)
+    return x_.type_as(x)
+def get_timestep_embedding(
+    timestep: torch.Tensor,
+    embed_size: int,
+) -> torch.Tensor:
+    assert embed_size % 2 == 0
+    half = embed_size // 2
+    freqs = 1000 * torch.exp(
+        -torch.log(torch.tensor(10000.0)) *
+        torch.arange(start=0, end=half, dtype=torch.float32) / half
+    ).to(timestep.device)
+    args = timestep[..., None] * freqs[None]
+    embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+    return embedding.to(timestep.dtype)
+class LowRankAdaLN(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        rank: int,
+        eps: float
+    ):
+        super().__init__()
+        self.eps = eps
+        self.shift_down = nn.Linear(model_size, rank, bias=False)
+        self.scale_down = nn.Linear(model_size, rank, bias=False)
+        self.gate_down = nn.Linear(model_size, rank, bias=False)
+        self.shift_up = nn.Linear(rank, model_size, bias=True)
+        self.scale_up = nn.Linear(rank, model_size, bias=True)
+        self.gate_up = nn.Linear(rank, model_size, bias=True)
+    def forward(
+        self,
+        x: torch.Tensor,
+        cond_embed: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        shift, scale, gate = cond_embed.chunk(3, dim=-1)
+        shift = self.shift_up(self.shift_down(F.silu(shift))) + shift
+        scale = self.scale_up(self.scale_down(F.silu(scale))) + scale
+        gate = self.gate_up(self.gate_down(F.silu(gate))) + gate
+        x_dtype = x.dtype
+        x = x.float()
+        x = x * torch.rsqrt(torch.pow(x.float(), 2).mean(dim=-1, keepdim=True) + self.eps)
+        x = x * (scale + 1) + shift
+        gate = torch.tanh(gate)
+        return x.to(x_dtype), gate
+class RMSNorm(nn.Module): # could also just use torch rmsnorm
+    def __init__(
+        self,
+        model_size: int | Tuple[int, int],
+        eps: float
+    ):
+        super().__init__()
+        self.eps = eps
+        if isinstance(model_size, int):
+            model_size = (model_size, )
+        self.weight = nn.Parameter(torch.ones(model_size))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x_dtype = x.dtype
+        x = x.float()
+        x = x * torch.rsqrt(torch.pow(x.float(), 2).mean(dim=-1, keepdim=True) + self.eps)
+        x = x * self.weight
+        return x.to(x_dtype)
+class SelfAttention(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        num_heads: int,
+        is_causal: bool,
+        norm_eps: float
+    ):
+        super().__init__()
+        self.num_heads = num_heads
+        self.is_causal = is_causal
+        self.wq = nn.Linear(model_size, model_size, bias=False)
+        self.wk = nn.Linear(model_size, model_size, bias=False)
+        self.wv = nn.Linear(model_size, model_size, bias=False)
+        self.wo = nn.Linear(model_size, model_size, bias=False)
+        self.gate = nn.Linear(model_size, model_size, bias=False)
+        assert model_size % num_heads == 0
+        self.q_norm = RMSNorm((num_heads, model_size // num_heads), eps=norm_eps)
+        self.k_norm = RMSNorm((num_heads, model_size // num_heads), eps=norm_eps)
+    def forward(self, x: torch.Tensor, mask: torch.Tensor | None, freqs_cis: torch.Tensor) -> torch.Tensor:
+        batch_size, seq_len = x.shape[:2]
+        xq = self.wq(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xk = self.wk(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xv = self.wv(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        gate = self.gate(x)
+        xq = self.q_norm(xq)
+        xk = self.k_norm(xk)
+        xq = apply_rotary_emb(xq, freqs_cis[:seq_len])
+        xk = apply_rotary_emb(xk, freqs_cis[:seq_len])
+        if mask is not None:
+            assert mask.ndim == 2 # (b, s)
+            mask = mask[:, None, None]
+        output = F.scaled_dot_product_attention(
+            query=xq.transpose(1, 2),
+            key=xk.transpose(1, 2),
+            value=xv.transpose(1, 2),
+            attn_mask=mask,
+            is_causal=self.is_causal
+        ).transpose(1, 2)
+        output = output.reshape(batch_size, seq_len, -1)
+        output = output * torch.sigmoid(gate)
+        output = self.wo(output)
+        return output
+class JointAttention(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        num_heads: int,
+        text_model_size: int,
+        speaker_model_size: int,
+        speaker_patch_size: int,
+        norm_eps: float
+    ):
+        super().__init__()
+        self.speaker_patch_size = speaker_patch_size
+        self.num_heads = num_heads
+        self.wq = nn.Linear(model_size, model_size, bias=False)
+        self.wk = nn.Linear(model_size, model_size, bias=False)
+        self.wv = nn.Linear(model_size, model_size, bias=False)
+        self.wk_text = nn.Linear(text_model_size, model_size, bias=False)
+        self.wv_text = nn.Linear(text_model_size, model_size, bias=False)
+        self.wk_speaker = nn.Linear(speaker_model_size, model_size, bias=False)
+        self.wv_speaker = nn.Linear(speaker_model_size, model_size, bias=False)
+        assert model_size % num_heads == 0
+        self.q_norm = RMSNorm((num_heads, model_size // num_heads), eps=norm_eps)
+        self.k_norm = RMSNorm((num_heads, model_size // num_heads), eps=norm_eps)
+        self.gate = nn.Linear(model_size, model_size, bias=False)
+        self.wo = nn.Linear(model_size, model_size, bias=False)
+    def forward(
+        self,
+        x: torch.Tensor,
+        text_state: torch.Tensor | None,
+        text_mask: torch.Tensor,
+        speaker_state: torch.Tensor | None,
+        speaker_mask: torch.Tensor,
+        freqs_cis: torch.Tensor,
+        kv_cache: Tuple[torch.Tensor, torch.Tensor] | None = None,
+    ) -> torch.Tensor:
+        batch_size, seq_len = x.shape[:2]
+        xq = self.wq(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xk_self = self.wk(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xv_self = self.wv(x).reshape(batch_size, seq_len, self.num_heads, -1)
+        xq = self.q_norm(xq)
+        xk_self = self.k_norm(xk_self)
+        gate = self.gate(x)
+        def _apply_rotary_half(y: torch.Tensor, fc: torch.Tensor) -> torch.Tensor:
+            y1, y2 = y.chunk(2, dim=-2)
+            y1 = apply_rotary_emb(y1, fc)
+            return torch.cat([y1, y2], dim=-2)
+        xq = _apply_rotary_half(xq, freqs_cis)
+        xk_self = _apply_rotary_half(xk_self, freqs_cis)
+        if kv_cache is None:
+            xk_text = self.wk_text(text_state).reshape(batch_size, text_state.shape[1], self.num_heads, -1)
+            xv_text = self.wv_text(text_state).reshape(batch_size, text_state.shape[1], self.num_heads, -1)
+            xk_speaker = self.wk_speaker(speaker_state).reshape(batch_size, speaker_state.shape[1], self.num_heads, -1)
+            xv_speaker = self.wv_speaker(speaker_state).reshape(batch_size, speaker_state.shape[1], self.num_heads, -1)
+            xk_text = self.k_norm(xk_text)
+            xk_speaker = self.k_norm(xk_speaker)
+            xk = torch.cat([xk_self, xk_text, xk_speaker], dim=1)
+            xv = torch.cat([xv_self, xv_text, xv_speaker], dim=1)
+        else:
+            xk_cross, xv_cross = kv_cache
+            xk = torch.cat([xk_self, xk_cross], dim=1)
+            xv = torch.cat([xv_self, xv_cross], dim=1)
+        self_mask = torch.ones((batch_size, seq_len), dtype=torch.bool, device=x.device)
+        mask = torch.cat([self_mask, text_mask, speaker_mask], dim=1)
+        mask = mask[:, None, None]
+        output = F.scaled_dot_product_attention(
+            query=xq.transpose(1, 2),
+            key=xk.transpose(1, 2),
+            value=xv.transpose(1, 2),
+            attn_mask=mask,
+            is_causal=False
+        ).transpose(1, 2)
+        output = output.reshape(batch_size, seq_len, -1)
+        output = output * torch.sigmoid(gate)
+        output = self.wo(output)
+        return output
+    def get_kv_cache(
+        self,
+        text_state: torch.Tensor,
+        speaker_state: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        batch_size = text_state.shape[0]
+        xk_text = self.wk_text(text_state).reshape(batch_size, text_state.shape[1], self.num_heads, -1)
+        xv_text = self.wv_text(text_state).reshape(batch_size, text_state.shape[1], self.num_heads, -1)
+        xk_speaker = self.wk_speaker(speaker_state).reshape(batch_size, speaker_state.shape[1], self.num_heads, -1)
+        xv_speaker = self.wv_speaker(speaker_state).reshape(batch_size, speaker_state.shape[1], self.num_heads, -1)
+        xk = torch.cat([xk_text, xk_speaker], dim=1)
+        xv = torch.cat([xv_text, xv_speaker], dim=1)
+        xk = self.k_norm(xk)
+        return xk, xv
+class MLP(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        intermediate_size: int
+    ):
+        super().__init__()
+        self.w1 = nn.Linear(model_size, intermediate_size, bias=False)
+        self.w3 = nn.Linear(model_size, intermediate_size, bias=False)
+        self.w2 = nn.Linear(intermediate_size, model_size, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.w2(F.silu(self.w1(x)) * self.w3(x))
+class EncoderTransformerBlock(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        num_heads: int,
+        intermediate_size: int,
+        is_causal: bool,
+        norm_eps: float
+    ):
+        super().__init__()
+        self.attention = SelfAttention(
+            model_size=model_size,
+            num_heads=num_heads,
+            is_causal=is_causal,
+            norm_eps=norm_eps
+        )
+        self.mlp = MLP(
+            model_size=model_size,
+            intermediate_size=intermediate_size
+        )
+        self.attention_norm = RMSNorm(model_size, norm_eps)
+        self.mlp_norm = RMSNorm(model_size, norm_eps)
+    def forward(self, x: torch.Tensor, mask: torch.Tensor | None, freqs_cis: torch.Tensor) -> torch.Tensor:
+        x = x + self.attention(self.attention_norm(x), mask, freqs_cis)
+        x = x + self.mlp(self.mlp_norm(x))
+        return x
+class TransformerBlock(nn.Module):
+    def __init__(
+        self,
+        model_size: int,
+        num_heads: int,
+        intermediate_size: int,
+        norm_eps: float,
+        text_model_size: int,
+        speaker_model_size: int,
+        speaker_patch_size: int,
+        adaln_rank: int,
+    ):
+        super().__init__()
+        self.attention = JointAttention(
+            model_size=model_size,
+            num_heads=num_heads,
+            text_model_size=text_model_size,
+            speaker_model_size=speaker_model_size,
+            speaker_patch_size=speaker_patch_size,
+            norm_eps=norm_eps
+        )
+        self.mlp = MLP(
+            model_size=model_size,
+            intermediate_size=intermediate_size
+        )
+        self.attention_adaln = LowRankAdaLN(model_size=model_size, rank=adaln_rank, eps=norm_eps)
+        self.mlp_adaln = LowRankAdaLN(model_size=model_size, rank=adaln_rank, eps=norm_eps)
+    def forward(
+        self,
+        x: torch.Tensor,
+        cond_embed: torch.Tensor,
+        text_state: torch.Tensor | None,
+        text_mask: torch.Tensor,
+        speaker_state: torch.Tensor | None,
+        speaker_mask: torch.Tensor,
+        freqs_cis: torch.Tensor,
+        kv_cache: Tuple[torch.Tensor, torch.Tensor] | None = None,
+    ) -> torch.Tensor:
+        x_norm, attention_gate = self.attention_adaln(x, cond_embed)
+        x = x + attention_gate * self.attention(x_norm, text_state, text_mask, speaker_state, speaker_mask, freqs_cis, kv_cache)
+        x_norm, mlp_gate = self.mlp_adaln(x, cond_embed)
+        x = x + mlp_gate * self.mlp(x_norm)
+        return x
+    def get_kv_cache(
+        self,
+        text_state: torch.Tensor,
+        speaker_state: torch.Tensor,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        return self.attention.get_kv_cache(text_state, speaker_state)
+class TextEncoder(nn.Module):
+    def __init__(
+        self,
+        vocab_size: int,
+        model_size: int,
+        num_layers: int,
+        num_heads: int,
+        intermediate_size: int,
+        norm_eps: float,
+        max_seq_len: int,
+    ):
+        super().__init__()
+        self.text_embedding = nn.Embedding(vocab_size, model_size)
+        self.blocks = nn.ModuleList()
+        for i in range(num_layers):
+            block = EncoderTransformerBlock(
+                model_size=model_size,
+                num_heads=num_heads,
+                intermediate_size=intermediate_size,
+                is_causal=False,
+                norm_eps=norm_eps
+            )
+            self.blocks.append(block)
+        self.head_dim = model_size // num_heads
+    def forward(self, input_ids: torch.Tensor, mask: torch.Tensor | None = None) -> torch.Tensor:
+        x = self.text_embedding(input_ids)
+        freqs_cis = precompute_freqs_cis(self.head_dim, input_ids.shape[1]).to(x.device) # see below about avoiding recomputation
+        for block in self.blocks:
+            x = block(x, mask, freqs_cis)
+        return x
+class SpeakerEncoder(nn.Module):
+    def __init__(
+        self,
+        latent_size: int,
+        patch_size: int,
+        model_size: int,
+        num_layers: int,
+        num_heads: int,
+        intermediate_size: int,
+        norm_eps: float,
+        max_patched_seq_len: int,
+    ):
+        super().__init__()
+        self.patch_size = patch_size
+        self.in_proj = nn.Linear(latent_size * patch_size, model_size, bias=True)
+        self.blocks = nn.ModuleList()
+        for i in range(num_layers):
+            block = EncoderTransformerBlock(
+                model_size=model_size,
+                num_heads=num_heads,
+                intermediate_size=intermediate_size,
+                is_causal=True,
+                norm_eps=norm_eps
+            )
+            self.blocks.append(block)
+        self.head_dim = model_size // num_heads
+    def forward(self, latent: torch.Tensor) -> torch.Tensor:
+        x = latent.reshape(*latent.shape[:-2], latent.shape[-2] // self.patch_size, latent.shape[-1] * self.patch_size)
+        x = self.in_proj(x)
+        x = x / 6. # this helped with initial activation dynamics in early ablations, could also bake into in_proj
+        freqs_cis = precompute_freqs_cis(self.head_dim, x.shape[1]).to(x.device) # see below about avoiding recomputation
+        for block in self.blocks:
+            x = block(x, None, freqs_cis)
+        return x
+class EchoDiT(nn.Module):
+    def __init__(
+        self,
+        latent_size: int,
+        #
+        model_size: int,
+        num_layers: int,
+        num_heads: int,
+        intermediate_size: int,
+        norm_eps: float,
+        max_seq_len: int,
+        #
+        text_vocab_size: int,
+        text_model_size: int,
+        text_num_layers: int,
+        text_num_heads: int,
+        text_intermediate_size: int,
+        text_max_seq_len: int,
+        #
+        speaker_patch_size: int,
+        speaker_model_size: int,
+        speaker_num_layers: int,
+        speaker_num_heads: int,
+        speaker_intermediate_size: int,
+        speaker_max_patched_seq_len: int,
+        #
+        timestep_embed_size: int,
+        adaln_rank: int,
+    ):
+        super().__init__()
+        self.speaker_patch_size = speaker_patch_size
+        self.timestep_embed_size = timestep_embed_size
+        self.text_encoder = TextEncoder(
+            vocab_size=text_vocab_size,
+            model_size=text_model_size,
+            num_layers=text_num_layers,
+            num_heads=text_num_heads,
+            intermediate_size=text_intermediate_size,
+            norm_eps=norm_eps,
+            max_seq_len=text_max_seq_len,
+        )
+        self.speaker_encoder = SpeakerEncoder(
+            latent_size=latent_size,
+            patch_size=speaker_patch_size,
+            model_size=speaker_model_size,
+            num_layers=speaker_num_layers,
+            num_heads=speaker_num_heads,
+            intermediate_size=speaker_intermediate_size,
+            norm_eps=norm_eps,
+            max_patched_seq_len=speaker_max_patched_seq_len,
+        )
+        self.text_norm = RMSNorm(text_model_size, norm_eps)
+        self.speaker_norm = RMSNorm(speaker_model_size, norm_eps)
+        self.cond_module = nn.Sequential(
+            nn.Linear(timestep_embed_size, model_size, bias=False),
+            nn.SiLU(),
+            nn.Linear(model_size, model_size, bias=False),
+            nn.SiLU(),
+            nn.Linear(model_size, model_size * 3, bias=False),
+        )
+        self.in_proj = nn.Linear(latent_size, model_size, bias=True)
+        self.blocks = nn.ModuleList()
+        for i in range(num_layers):
+            block = TransformerBlock(
+                model_size=model_size,
+                num_heads=num_heads,
+                intermediate_size=intermediate_size,
+                norm_eps=norm_eps,
+                text_model_size=text_model_size,
+                speaker_model_size=speaker_model_size,
+                speaker_patch_size=speaker_patch_size,
+                adaln_rank=adaln_rank,
+            )
+            self.blocks.append(block)
+        self.out_norm = RMSNorm(model_size, norm_eps)
+        self.out_proj = nn.Linear(model_size, latent_size, bias=True)
+        self.head_dim = model_size // num_heads
+    def forward(
+        self,
+        x: torch.Tensor,
+        t: torch.Tensor,
+        text_input_ids: torch.Tensor,
+        text_mask: torch.Tensor | None,
+        speaker_latent: torch.Tensor,
+        speaker_mask: torch.Tensor | None,
+        kv_cache: List[Tuple[torch.Tensor, torch.Tensor]] | None = None,
+    ) -> torch.Tensor:
+        """
+        x: (b, s, d)
+        t: (b,)
+        text_input_ids: (b, s_t) # not used when kv_cache is provided
+        text_mask: (b, s_t)
+        speaker_latent: (b, s_r, d) # not used when kv_cache is provided
+        speaker_mask: (b, s_r)
+        kv_cache: List[Tuple[torch.Tensor, torch.Tensor]]
+        returns: (b, s, d)
+        """
+        freqs_cis = precompute_freqs_cis(self.head_dim, x.shape[1]).to(x.device)
+        # can't register as buffer because we'd like it to stay in fp32; however, could optionally pass in to avoid recomputing
+        if kv_cache is None and speaker_state is None:
+            text_state = self.text_encoder(text_input_ids, text_mask)
+            text_state = self.text_norm(text_state)
+            speaker_state = self.speaker_encoder(speaker_latent)
+            speaker_state = self.speaker_norm(speaker_state)
+        else:
+            text_state, speaker_state = None, None
+        speaker_mask = speaker_mask[..., ::self.speaker_patch_size]
+        cond_embed = self.cond_module(get_timestep_embedding(t, self.timestep_embed_size))
+        assert cond_embed.ndim == 2
+        cond_embed = cond_embed[:, None]
+        x = self.in_proj(x)
+        for i, block in enumerate(self.blocks):
+            x = block(
+                x=x,
+                cond_embed=cond_embed,
+                text_state=text_state,
+                text_mask=text_mask,
+                speaker_state=speaker_state,
+                speaker_mask=speaker_mask,
+                freqs_cis=freqs_cis,
+                kv_cache=kv_cache[i] if kv_cache is not None else None,
+            )
+        x = self.out_norm(x)
+        x = self.out_proj(x)
+        return x.float()
+    def get_kv_cache(
+        self,
+        speaker_latent: torch.Tensor,
+        speaker_mask: torch.Tensor,
+        text_input_ids: torch.Tensor,
+        text_mask: torch.Tensor,
+    ) -> List[Tuple[torch.Tensor, torch.Tensor]]:
+        speaker_state = self.speaker_encoder(speaker_latent)
+        speaker_state = self.speaker_norm(speaker_state)
+        text_state = self.text_encoder(text_input_ids, text_mask)
+        text_state = self.text_norm(text_state)
+        return [self.blocks[i].get_kv_cache(text_state, speaker_state) for i in range(len(self.blocks))]
+    def get_kv_cache_from_precomputed_speaker_state(
+        self,
+        speaker_state: torch.Tensor,
+        speaker_mask: torch.Tensor,
+        text_input_ids: torch.Tensor,
+        text_mask: torch.Tensor,
+    ) -> List[Tuple[torch.Tensor, torch.Tensor]]:
+        # here, speaker state is already computed from the speaker latent encoder transformer
+        text_state = self.text_encoder(text_input_ids, text_mask)
+        text_state = self.text_norm(text_state)
+        return [self.blocks[i].get_kv_cache(text_state, speaker_state) for i in range(len(self.blocks))]
+    @property
+    def device(self) -> torch.device: return next(self.parameters()).device
+    @property
+    def dtype(self) -> torch.dtype: return next(self.parameters()).dtype

packages.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ ffmpeg

prompt_audio/EARS p004 freeform.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:68947a209bc11064f749ca0a61b7959243df83565a0e462b87dfc0ffe03aa7b0
+size 1526439

prompt_audio/EARS p005 freeform.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:07344d073eb3e22c249ebfe15f31f4ba63fd9f17c71aeee93da199ff3b53fc45
+size 1351147

prompt_audio/EARS p028 freeform.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8351eed5982f1fb5763a475c0fb69dba98a4bb49b0f2bbab12b978ff2b0fedeb
+size 1211565

prompt_audio/EARS p036 freeform.mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ce77dbb86ea7c29edf2b9804ce9c9315334e9cfeef532dc0c50898a09bae1583
+size 1227585

prompt_audio/expresso_02_ex03-ex01_calm_005.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f2be4d1cb5646b3523a460ec40bf171f959a9b33bde918e6d0f795d00284f52a
+size 21168080

prompt_audio/freesound_demon_chant(use_forcespeaker).mp3 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:471f67fff5ea613ec4617b9822b1396da123a1133f199925436a2c40e5d1eb91
+size 303438

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+torch
+torchaudio
+torchcodec
+gradio>=5.49
+huggingface-hub
+numpy
+safetensors
+einops

sampler_presets.json ADDED Viewed

	@@ -0,0 +1,120 @@

+{
+  "Independent (High Speaker CFG)": {
+    "num_steps": "40",
+    "cfg_mode": "independent",
+    "cfg_scale_text": "3.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "1.",
+    "rescale_k": "1.",
+    "rescale_sigma": "3.0"
+  },
+  "Independent (High Speaker CFG) Flat": {
+    "num_steps": "40",
+    "cfg_mode": "independent",
+    "cfg_scale_text": "3.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "0.8",
+    "rescale_k": "1.2",
+    "rescale_sigma": "3.0"
+  },
+  "APG": {
+    "num_steps": "40",
+    "cfg_mode": "apg-independent",
+    "cfg_scale_text": "8.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "1.",
+    "rescale_k": "1.",
+    "rescale_sigma": "3.0",
+    "speaker_k_enable": false,
+    "speaker_k_scale": "1.5",
+    "speaker_k_min_t": "0.9",
+    "speaker_k_max_layers": "24",
+    "apg_eta_text": "0.5",
+    "apg_eta_speaker": "0.5",
+    "apg_momentum_text": "0.0",
+    "apg_momentum_speaker": "0.0"
+  },
+  "APG Flat": {
+    "num_steps": "40",
+    "cfg_mode": "apg-independent",
+    "cfg_scale_text": "8.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "0.8",
+    "rescale_k": "1.2",
+    "rescale_sigma": "3.0",
+    "speaker_k_enable": false,
+    "speaker_k_scale": "1.5",
+    "speaker_k_min_t": "0.9",
+    "speaker_k_max_layers": "24",
+    "apg_eta_text": "0.5",
+    "apg_eta_speaker": "0.5",
+    "apg_momentum_text": "0.0",
+    "apg_momentum_speaker": "0.0"
+  },
+  "Independent (High CFG)": {
+    "num_steps": "40",
+    "cfg_mode": "independent",
+    "cfg_scale_text": "8.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "1.",
+    "rescale_k": "1.",
+    "rescale_sigma": "3.0"
+  },
+  "Independent (High CFG) Flat": {
+    "num_steps": "40",
+    "cfg_mode": "independent",
+    "cfg_scale_text": "8.0",
+    "cfg_scale_speaker": "8.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "0.8",
+    "rescale_k": "1.2",
+    "rescale_sigma": "3.0"
+  },
+  "Independent": {
+    "num_steps": "40",
+    "cfg_mode": "independent",
+    "cfg_scale_text": "3.0",
+    "cfg_scale_speaker": "5.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "1.",
+    "rescale_k": "1.",
+    "rescale_sigma": "3.0"
+  },
+  "Independent Flat": {
+    "num_steps": "40",
+    "cfg_mode": "independent",
+    "cfg_scale_text": "3.0",
+    "cfg_scale_speaker": "5.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "0.8",
+    "rescale_k": "1.2",
+    "rescale_sigma": "3.0"
+  },
+  "Joint 20-step Flat": {
+    "num_steps": "20",
+    "cfg_mode": "joint-unconditional",
+    "cfg_scale_text": "3.0",
+    "cfg_scale_speaker": "3.0",
+    "cfg_min_t": "0.5",
+    "cfg_max_t": "1.0",
+    "truncation_factor": "0.8",
+    "rescale_k": "1.2",
+    "rescale_sigma": "3.0"
+  }
+}

samplers.py ADDED Viewed

	@@ -0,0 +1,690 @@

+from typing import List, Tuple
+from enum import Enum
+import torch
+from model import EchoDiT
+# helper
+def _get_uncond_text_input_ids_and_mask(batch_size: int, max_length: int, device: str | None = None) -> tuple[torch.Tensor, torch.Tensor]:
+    # returns zeros for text input ids, and (True, False, False, ... ) for text mask
+    text_input_ids_uncond = torch.zeros((batch_size, max_length), dtype=torch.int32)
+    text_mask_uncond = torch.zeros((batch_size, max_length), dtype=torch.bool)
+    text_mask_uncond[:, 0] = True
+    if device is not None:
+        text_input_ids_uncond = text_input_ids_uncond.to(device)
+        text_mask_uncond = text_mask_uncond.to(device)
+    return text_input_ids_uncond, text_mask_uncond
+# SIMPLE SAMPLER FOR REFERENCE, SHOULD PROBABLY AVOID
+@torch.inference_mode()
+def sample_euler_cfg_simple(
+    model: EchoDiT,
+    speaker_latent: torch.Tensor,
+    speaker_mask: torch.Tensor,
+    text_input_ids: torch.Tensor,
+    text_mask: torch.Tensor,
+    rng_seed: int,
+    num_steps: int,
+    cfg_scale: float,
+) -> torch.Tensor:
+    device, dtype = model.device, model.dtype
+    batch_size = text_input_ids.shape[0]
+    torch.manual_seed(rng_seed)
+    t_schedule = torch.linspace(1., 0., num_steps + 1, device=device)
+    text_input_ids_uncond, text_mask_uncond = _get_uncond_text_input_ids_and_mask(text_input_ids.shape[0], text_input_ids.shape[1], device=device)
+    speaker_latent_uncond, speaker_mask_uncond = torch.zeros_like(speaker_latent), torch.zeros_like(speaker_mask)
+    full_text_input_ids = torch.cat([text_input_ids, text_input_ids_uncond], dim=0)
+    full_text_mask = torch.cat([text_mask, text_mask_uncond], dim=0)
+    full_speaker_latent = torch.cat([speaker_latent, speaker_latent_uncond], dim=0)
+    full_speaker_mask = torch.cat([speaker_mask, speaker_mask_uncond], dim=0)
+    kv_cache = model.get_kv_cache(
+        speaker_latent=full_speaker_latent.to(dtype),
+        speaker_mask=full_speaker_mask,
+        text_input_ids=full_text_input_ids,
+        text_mask=full_text_mask,
+    )
+    x_t = torch.randn((batch_size, 640, 80), device=device, dtype=torch.float32)
+    for i in range(num_steps):
+        t, t_next = t_schedule[i], t_schedule[i+1]
+        v_cond, v_uncond = model(
+            x=torch.cat([x_t, x_t], dim=0).to(dtype),
+            t=(torch.ones((batch_size * 2,), device=device) * t).to(dtype),
+            text_input_ids=None,
+            text_mask=full_text_mask,
+            speaker_latent=None,
+            speaker_mask=full_speaker_mask,
+            kv_cache=kv_cache,
+        ).float().chunk(2, dim=0)
+        v_pred = v_cond + cfg_scale * (v_cond - v_uncond)
+        # note: x_0_pred is x_t - v_pred * t
+        x_t = x_t + v_pred * (t_next - t)
+    return x_t
+######
+def _temporal_score_rescale(v_pred: torch.Tensor, x_t: torch.Tensor, t: float, rescale_k: float, rescale_sigma: float) -> torch.Tensor:
+    if t < 1:
+        snr = (1 - t) ** 2 / (t ** 2)
+        ratio = (snr * rescale_sigma ** 2 + 1) / (snr * rescale_sigma ** 2 / rescale_k + 1)
+        return 1 / (1 - t) * (ratio * ((1 - t) * v_pred + x_t) - x_t)
+    return v_pred
+def _get_first_n_kv_cache(kv_cache: List[List[torch.Tensor]], n: int) -> List[List[torch.Tensor]]:
+    return [[kv_cache[i][0][:n], kv_cache[i][1][:n]] for i in range(len(kv_cache))]
+def _multiply_speaker_kv_cache(
+    kv_cache: List[List[torch.Tensor]],
+    scale: float,
+    text_length: int,
+    max_layers: int = 24,
+) -> List[List[torch.Tensor]]:
+    # multiplies speaker kv cache by scale
+    # speaker keys start after text keys (at position text_length)
+    for i in range(min(max_layers, len(kv_cache))):
+        for j in range(len(kv_cache[i])):
+            kv_cache[i][j][:, text_length:] *= scale
+@torch.inference_mode()
+def sample_euler_cfg(
+    model: EchoDiT,
+    speaker_latent: torch.Tensor,
+    speaker_mask: torch.Tensor,
+    text_input_ids: torch.Tensor,
+    text_mask: torch.Tensor,
+    rng_seed: int,
+    num_steps: int,
+    cfg_scale: float,
+    cfg_min_t: float,
+    cfg_max_t: float,
+    truncation_factor: float | None,
+    rescale_k: float | None,
+    rescale_sigma: float | None,
+    speaker_k_scale: float | None,
+    speaker_k_max_layers: int | None,
+    speaker_k_min_t: float | None,
+    block_size: int | None = None,
+) -> torch.Tensor:
+    if block_size is None:
+        block_size = 640
+    torch.manual_seed(rng_seed)
+    INIT_SCALE = 0.999
+    device, dtype = model.device, model.dtype
+    batch_size = text_input_ids.shape[0]
+    t_schedule = torch.linspace(1., 0., num_steps + 1, device=device) * INIT_SCALE
+    text_input_ids_uncond, text_mask_uncond = _get_uncond_text_input_ids_and_mask(text_input_ids.shape[0], text_input_ids.shape[1], device=device)
+    speaker_latent_uncond, speaker_mask_uncond = torch.zeros_like(speaker_latent), torch.zeros_like(speaker_mask)
+    full_text_input_ids = torch.cat([text_input_ids, text_input_ids_uncond], dim=0)
+    full_text_mask = torch.cat([text_mask, text_mask_uncond], dim=0)
+    full_speaker_latent = torch.cat([speaker_latent, speaker_latent_uncond], dim=0)
+    full_speaker_mask = torch.cat([speaker_mask, speaker_mask_uncond], dim=0)
+    kv_cache_full = model.get_kv_cache(
+        speaker_latent=full_speaker_latent.to(dtype),
+        speaker_mask=full_speaker_mask,
+        text_input_ids=full_text_input_ids,
+        text_mask=full_text_mask,
+    )  # could make faster by not computing fully / recomputing for unconditional batch elements
+    kv_cache = _get_first_n_kv_cache(kv_cache_full, batch_size)
+    if speaker_k_scale is not None:
+        _multiply_speaker_kv_cache(kv_cache_full, speaker_k_scale, text_input_ids.shape[-1], speaker_k_max_layers)
+    x_t = torch.randn((batch_size, block_size, 80), device=device, dtype=torch.float32)
+    if truncation_factor is not None:
+        x_t = x_t * truncation_factor
+    for i in range(num_steps):
+        t, t_next = t_schedule[i], t_schedule[i+1]
+        has_cfg = ((t >= cfg_min_t) * (t <= cfg_max_t)).item()
+        if has_cfg:
+            v_cond, v_uncond = model(
+                x=torch.cat([x_t, x_t], dim=0).to(dtype),
+                t=(torch.ones((batch_size * 2,), device=device) * t).to(dtype),
+                text_input_ids=None,
+                text_mask=full_text_mask,
+                speaker_latent=None,
+                speaker_mask=full_speaker_mask,
+                kv_cache=kv_cache_full,
+            ).float().chunk(2, dim=0)
+            v_pred = v_cond + cfg_scale * (v_cond - v_uncond)
+        else:
+            v_pred = model(
+                x=x_t.to(dtype),
+                t=(torch.ones((batch_size,), device=device) * t).to(dtype),
+                text_input_ids=None,
+                text_mask=text_mask,
+                speaker_latent=None,
+                speaker_mask=speaker_mask,
+                kv_cache=kv_cache,
+            ).float()
+        if rescale_k is not None and rescale_sigma is not None:
+            v_pred = _temporal_score_rescale(v_pred, x_t, t, rescale_k, rescale_sigma)
+        if speaker_k_scale is not None and t_next < speaker_k_min_t and t >= speaker_k_min_t:
+            _multiply_speaker_kv_cache(kv_cache_full, 1. / speaker_k_scale, text_input_ids.shape[-1], speaker_k_max_layers)
+        x_t = x_t + v_pred * (t_next - t)
+    return x_t
+@torch.inference_mode()
+def sample_euler_cfg_independent_guidances(
+    model: EchoDiT,
+    speaker_latent: torch.Tensor,
+    speaker_mask: torch.Tensor,
+    text_input_ids: torch.Tensor,
+    text_mask: torch.Tensor,
+    rng_seed: int,
+    num_steps: int,
+    cfg_scale_text: float,
+    cfg_scale_speaker: float,
+    cfg_min_t: float,
+    cfg_max_t: float,
+    truncation_factor: float | None,
+    rescale_k: float | None,
+    rescale_sigma: float | None,
+    speaker_k_scale: float | None,
+    speaker_k_max_layers: int | None,
+    speaker_k_min_t: float | None,
+    block_size: int | None = None,
+) -> torch.Tensor:
+    if block_size is None:
+        block_size = 640
+    torch.manual_seed(rng_seed)
+    INIT_SCALE = 0.999
+    device, dtype = model.device, model.dtype
+    batch_size = text_input_ids.shape[0]
+    t_schedule = torch.linspace(1., 0., num_steps + 1, device=device) * INIT_SCALE
+    text_input_ids_uncond, text_mask_uncond = _get_uncond_text_input_ids_and_mask(text_input_ids.shape[0], text_input_ids.shape[1], device=device)
+    speaker_latent_uncond, speaker_mask_uncond = torch.zeros_like(speaker_latent), torch.zeros_like(speaker_mask)
+    full_text_input_ids = torch.cat([text_input_ids, text_input_ids_uncond, text_input_ids], dim=0)
+    full_text_mask = torch.cat([text_mask, text_mask_uncond, text_mask], dim=0)
+    full_speaker_latent = torch.cat([speaker_latent, speaker_latent, speaker_latent_uncond], dim=0)
+    full_speaker_mask = torch.cat([speaker_mask, speaker_mask, speaker_mask_uncond], dim=0)
+    kv_cache_full = model.get_kv_cache(
+        speaker_latent=full_speaker_latent.to(dtype),
+        speaker_mask=full_speaker_mask,
+        text_input_ids=full_text_input_ids,
+        text_mask=full_text_mask,
+    )
+    kv_cache = _get_first_n_kv_cache(kv_cache_full, batch_size)
+    if speaker_k_scale is not None:
+        _multiply_speaker_kv_cache(kv_cache_full, speaker_k_scale, text_input_ids.shape[-1], speaker_k_max_layers)
+    x_t = torch.randn((batch_size, block_size, 80), device=device, dtype=torch.float32)
+    if truncation_factor is not None:
+        x_t = x_t * truncation_factor
+    for i in range(num_steps):
+        t, t_next = t_schedule[i], t_schedule[i+1]
+        has_cfg = ((t >= cfg_min_t) * (t <= cfg_max_t)).item()
+        if has_cfg:
+            v_cond, v_uncond_text, v_uncond_speaker = model(
+                x=torch.cat([x_t, x_t, x_t], dim=0).to(dtype),
+                t=(torch.ones((batch_size * 3,), device=device) * t).to(dtype),
+                text_input_ids=None,
+                text_mask=full_text_mask,
+                speaker_latent=None,
+                speaker_mask=full_speaker_mask,
+                kv_cache=kv_cache_full,
+            ).float().chunk(3, dim=0)
+            v_pred = v_cond + cfg_scale_text * (v_cond - v_uncond_text) + cfg_scale_speaker * (v_cond - v_uncond_speaker)
+        else:
+            v_pred = model(
+                x=x_t.to(dtype),
+                t=(torch.ones((batch_size,), device=device) * t).to(dtype),
+                text_input_ids=None,
+                text_mask=text_mask,
+                speaker_latent=None,
+                speaker_mask=speaker_mask,
+                kv_cache=kv_cache,
+            ).float()
+        if rescale_k is not None and rescale_sigma is not None:
+            v_pred = _temporal_score_rescale(v_pred, x_t, t, rescale_k, rescale_sigma)
+        if speaker_k_scale is not None and t_next < speaker_k_min_t and t >= speaker_k_min_t:
+            _multiply_speaker_kv_cache(kv_cache_full, 1. / speaker_k_scale, text_input_ids.shape[-1], speaker_k_max_layers)
+        x_t = x_t + v_pred * (t_next - t)
+    return x_t
+@torch.inference_mode()
+def sample_euler_cfg_alternating_guidances(
+    model: EchoDiT,
+    speaker_latent: torch.Tensor,
+    speaker_mask: torch.Tensor,
+    text_input_ids: torch.Tensor,
+    text_mask: torch.Tensor,
+    rng_seed: int,
+    num_steps: int,
+    cfg_scale_text: float,
+    cfg_scale_speaker: float,
+    cfg_min_t: float,
+    cfg_max_t: float,
+    truncation_factor: float | None,
+    rescale_k: float | None,
+    rescale_sigma: float | None,
+    speaker_k_scale: float | None,
+    speaker_k_max_layers: int | None,
+    speaker_k_min_t: float | None,
+    block_size: int | None = None,
+) -> torch.Tensor:
+    if block_size is None:
+        block_size = 640
+    torch.manual_seed(rng_seed)
+    INIT_SCALE = 0.999
+    device, dtype = model.device, model.dtype
+    batch_size = text_input_ids.shape[0]
+    t_schedule = torch.linspace(1., 0., num_steps + 1, device=device) * INIT_SCALE
+    text_input_ids_uncond, text_mask_uncond = _get_uncond_text_input_ids_and_mask(text_input_ids.shape[0], text_input_ids.shape[1], device=device)
+    # TODO THIS / THE BELOW IS TECHNICALLY INCORRECT, AS IT ASSUMES A CAUSAL TEXT ENCODER (which is not the case)
+    # IF THE TEXT ENCODER WERE CAUSAL, THEN USING AN UNCOND TEXT MASK ON COND TEXT INPUTS GIVES YOU AN UNCOND STATE DUE TO BOS=0
+    # HOWEVER, MIGHT NOT MAKE MUCH OF A DIFFERENCE
+    # CHANGED ALL OTHER SAMPLERS TO USE CORRECT UNCONDITIONAL CACHES
+    speaker_latent_uncond, speaker_mask_uncond = torch.zeros_like(speaker_latent), torch.zeros_like(speaker_mask)
+    full_text_input_ids = torch.cat([text_input_ids, text_input_ids], dim=0)
+    full_text_mask = torch.cat([text_mask, text_mask_uncond], dim=0)
+    full_speaker_latent = torch.cat([speaker_latent, speaker_latent_uncond], dim=0)
+    full_speaker_mask = torch.cat([speaker_mask, speaker_mask_uncond], dim=0)
+    kv_cache_full = model.get_kv_cache(
+        speaker_latent=full_speaker_latent.to(dtype),
+        speaker_mask=full_speaker_mask,
+        text_input_ids=full_text_input_ids,
+        text_mask=full_text_mask,
+    )
+    kv_cache = _get_first_n_kv_cache(kv_cache_full, batch_size)
+    if speaker_k_scale is not None:
+        _multiply_speaker_kv_cache(kv_cache_full, speaker_k_scale, text_input_ids.shape[-1], speaker_k_max_layers)
+    x_t = torch.randn((batch_size, block_size, 80), device=device, dtype=torch.float32)
+    if truncation_factor is not None:
+        x_t = x_t * truncation_factor
+    for i in range(num_steps):
+        t, t_next = t_schedule[i], t_schedule[i+1]
+        has_cfg = ((t >= cfg_min_t) * (t <= cfg_max_t)).item()
+        if has_cfg:
+            v_cond, v_uncond = model(
+                x=torch.cat([x_t, x_t], dim=0).to(dtype),
+                t=(torch.ones((batch_size * 2,), device=device) * t).to(dtype),
+                text_input_ids=None,
+                text_mask=torch.cat([text_mask, text_mask_uncond if i % 2 == 0 else text_mask], dim=0),
+                speaker_latent=None,
+                speaker_mask=torch.cat([speaker_mask, speaker_mask if i % 2 == 0 else speaker_mask_uncond], dim=0),
+                kv_cache=kv_cache_full,
+            ).float().chunk(2, dim=0)
+            v_pred = v_cond + (cfg_scale_text if i % 2 == 0 else cfg_scale_speaker) * (v_cond - v_uncond)
+        else:
+            v_pred = model(
+                x=x_t.to(dtype),
+                t=(torch.ones((batch_size,), device=device) * t).to(dtype),
+                text_input_ids=None,
+                text_mask=text_mask,
+                speaker_latent=None,
+                speaker_mask=speaker_mask,
+                kv_cache=kv_cache,
+            ).float()
+        if rescale_k is not None and rescale_sigma is not None:
+            v_pred = _temporal_score_rescale(v_pred, x_t, t, rescale_k, rescale_sigma)
+        if speaker_k_scale is not None and t_next < speaker_k_min_t and t >= speaker_k_min_t:
+            _multiply_speaker_kv_cache(kv_cache_full, 1. / speaker_k_scale, text_input_ids.shape[-1], speaker_k_max_layers)
+        x_t = x_t + v_pred * (t_next - t)
+    return x_t
+@torch.inference_mode()
+def sample_euler_apg_independent_guidances(
+    model: EchoDiT,
+    speaker_latent: torch.Tensor,
+    speaker_mask: torch.Tensor,
+    text_input_ids: torch.Tensor,
+    text_mask: torch.Tensor,
+    rng_seed: int,
+    num_steps: int,
+    cfg_scale_text: float,
+    cfg_scale_speaker: float,
+    cfg_min_t: float,
+    cfg_max_t: float,
+    truncation_factor: float | None,
+    rescale_k: float | None,
+    rescale_sigma: float | None,
+    apg_eta_text: float,
+    apg_eta_speaker: float,
+    apg_momentum_text: float | None,
+    apg_momentum_speaker: float | None,
+    apg_norm_text: float | None,
+    apg_norm_speaker: float | None,
+    speaker_k_scale: float | None,
+    speaker_k_max_layers: int | None,
+    speaker_k_min_t: float | None,
+    block_size: int | None = None,
+) -> torch.Tensor:
+    if block_size is None:
+        block_size = 640
+    if apg_momentum_text is None:
+        apg_momentum_text = 0.0
+    if apg_momentum_speaker is None:
+        apg_momentum_speaker = 0.0
+    torch.manual_seed(rng_seed)
+    INIT_SCALE = 0.999
+    device, dtype = model.device, model.dtype
+    batch_size = text_input_ids.shape[0]
+    t_schedule = torch.linspace(1., 0., num_steps + 1, device=device) * INIT_SCALE
+    text_input_ids_uncond, text_mask_uncond = _get_uncond_text_input_ids_and_mask(text_input_ids.shape[0], text_input_ids.shape[1], device=device)
+    speaker_latent_uncond, speaker_mask_uncond = torch.zeros_like(speaker_latent), torch.zeros_like(speaker_mask)
+    full_text_input_ids = torch.cat([text_input_ids, text_input_ids_uncond, text_input_ids], dim=0)
+    full_text_mask = torch.cat([text_mask, text_mask_uncond, text_mask], dim=0)
+    full_speaker_latent = torch.cat([speaker_latent, speaker_latent, speaker_latent_uncond], dim=0)
+    full_speaker_mask = torch.cat([speaker_mask, speaker_mask, speaker_mask_uncond], dim=0)
+    kv_cache_full = model.get_kv_cache(
+        speaker_latent=full_speaker_latent.to(dtype),
+        speaker_mask=full_speaker_mask,
+        text_input_ids=full_text_input_ids,
+        text_mask=full_text_mask,
+    )
+    kv_cache = _get_first_n_kv_cache(kv_cache_full, batch_size)
+    if speaker_k_scale is not None:
+        _multiply_speaker_kv_cache(kv_cache_full, speaker_k_scale, text_input_ids.shape[-1], speaker_k_max_layers)
+    x_t = torch.randn((batch_size, block_size, 80), device=device, dtype=torch.float32)
+    if truncation_factor is not None:
+        x_t = x_t * truncation_factor
+    buf_text = torch.zeros_like(x_t)
+    buf_speaker = torch.zeros_like(x_t)
+    for i in range(num_steps):
+        t, t_next = t_schedule[i], t_schedule[i+1]
+        has_cfg = ((t >= cfg_min_t) * (t <= cfg_max_t)).item()
+        if has_cfg:
+            v_cond, v_uncond_text, v_uncond_speaker = model(
+                x=torch.cat([x_t, x_t, x_t], dim=0).to(dtype),
+                t=(torch.ones((batch_size * 3,), device=device) * t).to(dtype),
+                text_input_ids=None,
+                text_mask=full_text_mask,
+                speaker_latent=None,
+                speaker_mask=full_speaker_mask,
+                kv_cache=kv_cache_full,
+            ).float().chunk(3, dim=0)
+            x0_cond = x_t - t * v_cond
+            x0_uncond_text = x_t - t * v_uncond_text
+            x0_uncond_speaker = x_t - t * v_uncond_speaker
+            diff_text = x0_cond - x0_uncond_text
+            diff_speaker = x0_cond - x0_uncond_speaker
+            buf_text = diff_text + apg_momentum_text * buf_text
+            diff_text = buf_text
+            buf_speaker = diff_speaker + apg_momentum_speaker * buf_speaker
+            diff_speaker = buf_speaker
+            if apg_norm_text is not None:
+                nt = torch.sqrt((diff_text * diff_text).sum(dim=tuple(range(1, diff_text.dim())), keepdim=True) + 1e-12)
+                s = torch.minimum(torch.ones_like(nt), (torch.as_tensor(apg_norm_text, device=device, dtype=diff_text.dtype) / nt))
+                diff_text = diff_text * s
+            if apg_norm_speaker is not None:
+                ns = torch.sqrt((diff_speaker * diff_speaker).sum(dim=tuple(range(1, diff_speaker.dim())), keepdim=True) + 1e-12)
+                s = torch.minimum(torch.ones_like(ns), (torch.as_tensor(apg_norm_speaker, device=device, dtype=diff_speaker.dtype) / ns))
+                diff_speaker = diff_speaker * s
+            c_norm = torch.sqrt((x0_cond * x0_cond).sum(dim=tuple(range(1, x0_cond.dim())), keepdim=True) + 1e-12)
+            c_hat = x0_cond / c_norm
+            par_text = (diff_text * c_hat).sum(dim=tuple(range(1, diff_text.dim())), keepdim=True) * c_hat
+            ort_text = diff_text - par_text
+            upd_text = ort_text + apg_eta_text * par_text
+            par_speaker = (diff_speaker * c_hat).sum(dim=tuple(range(1, diff_speaker.dim())), keepdim=True) * c_hat
+            ort_speaker = diff_speaker - par_speaker
+            upd_speaker = ort_speaker + apg_eta_speaker * par_speaker
+            x0_pred = x0_cond + cfg_scale_text * upd_text + cfg_scale_speaker * upd_speaker
+            v_pred = (x_t - x0_pred) / t
+        else:
+            v_pred = model(
+                x=x_t.to(dtype),
+                t=(torch.ones((batch_size,), device=device) * t).to(dtype),
+                text_input_ids=None,
+                text_mask=text_mask,
+                speaker_latent=None,
+                speaker_mask=speaker_mask,
+                kv_cache=kv_cache,
+            ).float()
+        if rescale_k is not None and rescale_sigma is not None:
+            v_pred = _temporal_score_rescale(v_pred, x_t, t, rescale_k, rescale_sigma)
+        if speaker_k_scale is not None and t_next < speaker_k_min_t and t >= speaker_k_min_t:
+            _multiply_speaker_kv_cache(kv_cache_full, 1. / speaker_k_scale, text_input_ids.shape[-1], speaker_k_max_layers)
+        x_t = x_t + v_pred * (t_next - t)
+    return x_t
+# router
+class GuidanceMode(Enum):
+    INDEPENDENT = "independent"
+    APG = "apg"
+    JOINT = "joint"
+    ALTERNATING = "alternating"
+def sample_euler_cfg_any(
+    model: EchoDiT,
+    speaker_latent: torch.Tensor,
+    speaker_mask: torch.Tensor,
+    text_input_ids: torch.Tensor,
+    text_mask: torch.Tensor,
+    rng_seed: int,
+    guidance_mode: GuidanceMode,
+    num_steps: int,
+    cfg_scale_text: float,
+    cfg_scale_speaker: float | None,
+    cfg_min_t: float,
+    cfg_max_t: float,
+    truncation_factor: float | None,
+    rescale_k: float | None,
+    rescale_sigma: float | None,
+    speaker_k_scale: float | None,
+    speaker_k_min_t: float | None,
+    speaker_k_max_layers: int | None,
+    apg_eta_text: float | None,
+    apg_eta_speaker: float | None,
+    apg_momentum_text: float | None,
+    apg_momentum_speaker: float | None,
+    apg_norm_text: float | None,
+    apg_norm_speaker: float | None,
+    block_size: int | None = None,
+) -> torch.Tensor:
+    if guidance_mode == GuidanceMode.INDEPENDENT:
+        assert cfg_scale_speaker is not None, "cfg_scale_speaker must be provided for independent guidances"
+        return sample_euler_cfg_independent_guidances(
+            model=model,
+            speaker_latent=speaker_latent,
+            speaker_mask=speaker_mask,
+            text_input_ids=text_input_ids,
+            text_mask=text_mask,
+            rng_seed=rng_seed,
+            num_steps=num_steps,
+            cfg_scale_text=cfg_scale_text,
+            cfg_scale_speaker=cfg_scale_speaker,
+            cfg_min_t=cfg_min_t,
+            cfg_max_t=cfg_max_t,
+            truncation_factor=truncation_factor,
+            rescale_k=rescale_k,
+            rescale_sigma=rescale_sigma,
+            speaker_k_scale=speaker_k_scale,
+            speaker_k_max_layers=speaker_k_max_layers,
+            speaker_k_min_t=speaker_k_min_t,
+            block_size=block_size,
+        )
+    elif guidance_mode == GuidanceMode.APG:
+        assert cfg_scale_speaker is not None, "cfg_scale_speaker must be provided for APG"
+        assert apg_eta_text is not None, "apg_eta_text must be provided for APG"
+        assert apg_eta_speaker is not None, "apg_eta_speaker must be provided for APG"
+        return sample_euler_apg_independent_guidances(
+            model=model,
+            speaker_latent=speaker_latent,
+            speaker_mask=speaker_mask,
+            text_input_ids=text_input_ids,
+            text_mask=text_mask,
+            rng_seed=rng_seed,
+            num_steps=num_steps,
+            cfg_scale_text=cfg_scale_text,
+            cfg_scale_speaker=cfg_scale_speaker,
+            cfg_min_t=cfg_min_t,
+            cfg_max_t=cfg_max_t,
+            truncation_factor=truncation_factor,
+            rescale_k=rescale_k,
+            rescale_sigma=rescale_sigma,
+            apg_eta_text=apg_eta_text,
+            apg_eta_speaker=apg_eta_speaker,
+            apg_momentum_text=apg_momentum_text,
+            apg_momentum_speaker=apg_momentum_speaker,
+            apg_norm_text=apg_norm_text,
+            apg_norm_speaker=apg_norm_speaker,
+            speaker_k_scale=speaker_k_scale,
+            speaker_k_max_layers=speaker_k_max_layers,
+            speaker_k_min_t=speaker_k_min_t,
+            block_size=block_size,
+        )
+    elif guidance_mode == GuidanceMode.JOINT:
+        assert cfg_scale_text == cfg_scale_speaker or cfg_scale_speaker is None, "cfg_scale_text and cfg_scale_speaker must be the same or cfg_scale_speaker must be None"
+        return sample_euler_cfg(
+            model=model,
+            speaker_latent=speaker_latent,
+            speaker_mask=speaker_mask,
+            text_input_ids=text_input_ids,
+            text_mask=text_mask,
+            rng_seed=rng_seed,
+            num_steps=num_steps,
+            cfg_scale=cfg_scale_text,
+            cfg_min_t=cfg_min_t,
+            cfg_max_t=cfg_max_t,
+            truncation_factor=truncation_factor,
+            rescale_k=rescale_k,
+            rescale_sigma=rescale_sigma,
+            speaker_k_scale=speaker_k_scale,
+            speaker_k_max_layers=speaker_k_max_layers,
+            speaker_k_min_t=speaker_k_min_t,
+            block_size=block_size,
+        )
+    elif guidance_mode == GuidanceMode.ALTERNATING:
+        assert cfg_scale_speaker is not None, "cfg_scale_speaker must be provided for alternating guidances"
+        return sample_euler_cfg_alternating_guidances(
+            model=model,
+            speaker_latent=speaker_latent,
+            speaker_mask=speaker_mask,
+            text_input_ids=text_input_ids,
+            text_mask=text_mask,
+            rng_seed=rng_seed,
+            num_steps=num_steps,
+            cfg_scale_text=cfg_scale_text,
+            cfg_scale_speaker=cfg_scale_speaker,
+            cfg_min_t=cfg_min_t,
+            cfg_max_t=cfg_max_t,
+            truncation_factor=truncation_factor,
+            rescale_k=rescale_k,
+            rescale_sigma=rescale_sigma,
+            speaker_k_scale=speaker_k_scale,
+            speaker_k_max_layers=speaker_k_max_layers,
+            speaker_k_min_t=speaker_k_min_t,
+            block_size=block_size,
+        )
+    else:
+        raise ValueError(f"Unknown guidance mode: {guidance_mode}")

silentcipher/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .server import get_model
2	+
3	+ __version__ = '1.0.4'

silentcipher/model.py ADDED Viewed

	@@ -0,0 +1,95 @@

+import torch
+import torch.nn as nn
+import numpy as np
+class Layer(nn.Module):
+	def __init__(self, dim_in, dim_out, kernel_size, stride, padding):
+		super(Layer, self).__init__()
+		self.conv = nn.Conv2d(dim_in, dim_out, kernel_size=kernel_size, stride=stride, padding=padding, bias=True)
+		self.gate = nn.Conv2d(dim_in, dim_out, kernel_size=kernel_size, stride=stride, padding=padding, bias=True)
+		self.bn = nn.BatchNorm2d(dim_out)
+	def forward(self, x):
+		return self.bn(self.conv(x) * torch.sigmoid(self.gate(x)))
+class Encoder(nn.Module):
+	def __init__(self, out_dim=32, n_layers=3, message_dim=0, message_band_size=None, n_fft=None):
+		super(Encoder, self).__init__()
+		assert message_band_size is not None
+		assert n_fft is not None
+		self.message_band_size = message_band_size
+		main = [Layer(dim_in=1, dim_out=32, kernel_size=3, stride=1, padding=1)]
+		for i in range(n_layers-2):
+			main.append(Layer(dim_in=32, dim_out=32, kernel_size=3, stride=1, padding=1))
+		main.append(Layer(dim_in=32, dim_out=out_dim, kernel_size=3, stride=1, padding=1))
+		self.main = nn.Sequential(*main)
+		self.linear = nn.Linear(message_dim, message_band_size)
+		self.n_fft = n_fft
+	def forward(self, x):
+		h = self.main(x)
+		return h
+	def transform_message(self, msg):
+		output = self.linear(msg.transpose(2, 3)).transpose(2, 3)
+		if self.message_band_size != self.n_fft // 2 + 1:
+			output = torch.nn.functional.pad(output, (0, 0, 0, self.n_fft // 2 + 1 - self.message_band_size))
+		return output
+class CarrierDecoder(nn.Module):
+	def __init__(self, config, conv_dim, n_layers=4, message_band_size=1024):
+		super(CarrierDecoder, self).__init__()
+		self.config = config
+		self.message_band_size = message_band_size
+		layers = [Layer(dim_in=conv_dim, dim_out=96, kernel_size=3, stride=1, padding=1)]
+		for i in range(n_layers-2):
+			layers.append(Layer(dim_in=96, dim_out=96, kernel_size=3, stride=1, padding=1))
+		layers.append(Layer(dim_in=96, dim_out=1, kernel_size=1, stride=1, padding=0))
+		self.main = nn.Sequential(*layers)
+	def forward(self, x, message_sdr):
+		h = self.main(x)
+		if self.config.ensure_negative_message:
+			h = torch.abs(h)
+		h[:, :, self.message_band_size:, :] = 0
+		if not self.config.no_normalization:
+			h = h / torch.mean(h**2, dim=2, keepdim=True)**0.5 / (10**(message_sdr/20))
+		return h
+class MsgDecoder(nn.Module):
+	def __init__(self, message_dim=0, message_band_size=None, channel_dim=128, num_layers=10):
+		super(MsgDecoder, self).__init__()
+		assert message_band_size is not None
+		self.message_band_size = message_band_size
+		main = [
+			nn.Dropout(0),
+			Layer(dim_in=1, dim_out=channel_dim, kernel_size=3, stride=1, padding=1)
+		]
+		for l in range(num_layers - 2):
+			main += [
+				nn.Dropout(0),
+				Layer(dim_in=channel_dim, dim_out=channel_dim, kernel_size=3, stride=1, padding=1),
+			]
+		main += [
+			nn.Dropout(0),
+			Layer(dim_in=channel_dim, dim_out=message_dim, kernel_size=3, stride=1, padding=1)
+		]
+		self.main = nn.Sequential(*main)
+		self.linear = nn.Linear(self.message_band_size, 1)
+	def forward(self, x):
+		h = self.main(x[:, :, :self.message_band_size])
+		h = self.linear(h.transpose(2, 3)).squeeze(3).unsqueeze(1)
+		return h

silentcipher/server.py ADDED Viewed

	@@ -0,0 +1,480 @@

+from calendar import c
+import os
+import argparse
+import re
+from tabnanny import check
+import yaml
+import time
+import numpy as np
+import soundfile as sf
+from scipy import stats as st
+import librosa
+from pydub import AudioSegment
+import torch
+from torch import nn
+from .model import Encoder, CarrierDecoder, MsgDecoder
+from .stft import STFT
+class Model():
+    def __init__(self, config, device='cpu'):
+        self.config = config
+        self.device = device
+        self.n_messages = config.n_messages
+        self.model_type = config.model_type
+        self.message_dim = config.message_dim
+        self.message_len = config.message_len
+        # model dimensions
+        self.enc_conv_dim     = 16
+        self.enc_num_repeat   = 3
+        self.dec_c_num_repeat = self.enc_num_repeat
+        self.dec_m_conv_dim   = 1
+        self.dec_m_num_repeat = 8
+        self.encoder_out_dim = 32
+        self.dec_c_conv_dim = 32*3
+        self.enc_c = Encoder(n_layers=self.config.enc_n_layers,
+                             message_dim=self.message_dim,
+                             out_dim=self.encoder_out_dim,
+                             message_band_size=self.config.message_band_size,
+                             n_fft=self.config.N_FFT)
+        self.dec_c = CarrierDecoder(config=self.config,
+                                    conv_dim=self.dec_c_conv_dim,
+                                    n_layers=self.config.dec_c_n_layers,
+                                    message_band_size=self.config.message_band_size)
+        self.dec_m = [MsgDecoder(message_dim=self.message_dim,
+                                 message_band_size=self.config.message_band_size) for _ in range(self.n_messages)]
+        # ------ make parallel ------
+        self.enc_c = self.enc_c.to(self.device)
+        self.dec_c = self.dec_c.to(self.device)
+        self.dec_m = [m.to(self.device) for m in self.dec_m]
+        self.average_energy_VCTK=0.002837200844477648
+        self.stft = STFT(self.config.N_FFT, self.config.HOP_LENGTH)
+        self.stft.to(self.device)
+        self.load_models(config.load_ckpt)
+        self.sr = self.config.SR
+    def letters_encoding(self, patch_len, message_lst):
+        """
+        Encodes a list of messages into a compact representation and a padded representation.
+        Args:
+            patch_len (int): The length of the patch.
+            message_lst (list): A list of messages to be encoded.
+        Returns:
+            tuple: A tuple containing two numpy arrays:
+                - message: A padded representation of the messages, where each message is repeated to match the patch length.
+                - message_compact: A compact representation of the messages, where each message is encoded as a one-hot vector.
+        Raises:
+            AssertionError: If the length of any message in message_lst is not equal to self.config.message_len - 1.
+        """
+        message = []
+        message_compact = []
+        for i in range(self.n_messages):
+            assert len(message_lst[i]) == self.config.message_len - 1
+            index = np.concatenate((np.array(message_lst[i])+1, [0]))
+            one_hot = np.identity(self.message_dim)[index]
+            message_compact.append(one_hot)
+            if patch_len % self.message_len == 0:
+                message.append(np.tile(one_hot.T, (1, patch_len // self.message_len)))
+            else:
+                _ = np.tile(one_hot.T, (1, patch_len // self.message_len))
+                _ = np.concatenate([_, one_hot.T[:, 0:patch_len % self.message_len]], axis=1)
+                message.append(_)
+        message = np.stack(message)
+        message_compact = np.stack(message_compact)
+        # message = np.pad(message, ((0, 0), (0, 129 - self.message_dim), (0, 0)), 'constant')
+        return message, message_compact
+    def get_best_ps(self, y_one_sec):
+        """
+        Calculates the best phase shift value for watermark decoding.
+        Args:
+            y_one_sec (numpy.ndarray): Input audio signal.
+        Returns:
+            int: The best phase shift value.
+        """
+        def check_accuracy(pred_values):
+            accuracy = 0
+            for i in range(pred_values.shape[1]):
+                unique, counts = np.unique(pred_values[:, i], return_counts=True)
+                accuracy += np.max(counts) / pred_values.shape[0]
+            return accuracy / pred_values.shape[1]
+        y = torch.FloatTensor(y_one_sec).unsqueeze(0).unsqueeze(0).to(self.device)
+        max_accuracy = 0
+        final_phase_shift = 0
+        for ps in range(0, self.config.HOP_LENGTH, 10):
+            carrier, _ = self.stft.transform(y[0:1, 0:1, ps:].squeeze(1))
+            carrier = carrier[:, None]
+            for i in range(self.n_messages):  # decode each msg_i using decoder_m_i
+                msg_reconst = self.dec_m[i](carrier)
+                pred_values = torch.argmax(msg_reconst[0, 0], dim=0).data.cpu().numpy()
+                pred_values = pred_values[0:int(msg_reconst.shape[3]/self.config.message_len)*self.config.message_len]
+                pred_values = pred_values.reshape([-1, self.config.message_len])
+                cur_acc = check_accuracy(pred_values)
+                if cur_acc > max_accuracy:
+                    max_accuracy = cur_acc
+                    final_phase_shift = ps
+        return final_phase_shift
+    def get_confidence(self, pred_values, message):
+        """
+        Calculates the confidence of the predicted values based on the provided message.
+        Parameters:
+        pred_values (numpy.ndarray): The predicted values.
+        message (str): The message used for prediction.
+        Returns:
+        float: The confidence score.
+        Raises:
+        AssertionError: If the length of the message is not equal to the number of columns in pred_values.
+        """
+        assert len(message) == pred_values.shape[1], f'{len(message)} | {pred_values.shape}'
+        return np.mean((pred_values == message[None]).astype(np.float32)).item()
+    def sdr(self, orig, recon):
+        """
+        Calculate the Signal-to-Distortion Ratio (SDR) between the original and reconstructed signals.
+        Parameters:
+        orig (numpy.ndarray): The original signal.
+        recon (numpy.ndarray): The reconstructed signal.
+        Returns:
+        float: The Signal-to-Distortion Ratio (SDR) value.
+        """
+        rms1 = ((np.mean(orig ** 2)) ** 0.5)
+        rms2 = ((np.mean((orig - recon) ** 2)) ** 0.5)
+        sdr = 20 * np.log10(rms1 / rms2)
+        return sdr
+    def load_audio(self, path):
+        """
+        Load an audio file from the given path and return the audio array and sample rate.
+        Args:
+            path (str): The path to the audio file.
+        Returns:
+            tuple: A tuple containing the audio array and sample rate.
+        """
+        audio = AudioSegment.from_file(path)
+        audio_array, sr = (np.array(audio.get_array_of_samples(), dtype=np.float32).reshape((-1, audio.channels)) / (
+            1 << (8 * audio.sample_width - 1))), audio.frame_rate
+        if audio_array.shape[1] == 1:
+            audio_array = audio_array[:, 0]
+        return audio_array, sr
+    def encode(self, in_path, out_path, message_list, message_sdr=None, calc_sdr=True, disable_checks=False):
+        """
+        Encodes a message into an audio file.
+        Parameters:
+        - in_path (str): The path to the input audio file.
+        - out_path (str): The path to save the output audio file.
+        - message_list (list): A list of messages to be encoded into the audio file.
+        - message_sdr (float, optional): The Signal-to-Distortion Ratio (SDR) of the message. Defaults to None.
+        - calc_sdr (bool, optional): Whether to calculate the SDR of the encoded audio. Defaults to True.
+        - disable_checks (bool, optional): Whether to disable input checks. Defaults to False.
+        Returns:
+        - dict: A dictionary containing the status of the encoding process, the SDR value(s), the time taken for encoding, and the time taken per second of audio.
+        """
+        y, orig_sr = self.load_audio(in_path)
+        start = time.time()
+        encoded_y, sdr = self.encode_wav(y, orig_sr, message_list=message_list, message_sdr=message_sdr, calc_sdr=calc_sdr, disable_checks=disable_checks)
+        time_taken = time.time() - start
+        sf.write(out_path, encoded_y, orig_sr)
+        if type(sdr) == list:
+            return {'status': True, 'sdr': [f'{sdr_i:.2f}' for sdr_i in sdr], 'time_taken': time_taken, 'time_taken_per_second': time_taken / (y.shape[0] / orig_sr)}
+        else:
+            return {'status': True, 'sdr': f'{sdr:.2f}', 'time_taken': time_taken, 'time_taken_per_second': time_taken / (y.shape[0] / orig_sr)}
+    def decode(self, path, phase_shift_decoding):
+        """
+        Decode the audio file at the given path using phase shift decoding.
+        Parameters:
+        path (str): The path to the audio file.
+        phase_shift_decoding (bool): Flag indicating whether to use phase shift decoding.
+        Returns:
+        dictionary: A dictionary containing the decoded message status and value
+        """
+        y, orig_sr = self.load_audio(path)
+        return self.decode_wav(y, orig_sr, phase_shift_decoding)
+    def encode_wav(self, y_multi_channel, orig_sr, message_list, message_sdr=None, calc_sdr=True, disable_checks=False):
+        """
+        Encodes a multi-channel audio waveform with a given message.
+        Args:
+            y_multi_channel (numpy.ndarray): The multi-channel audio waveform to be encoded.
+            orig_sr (int): The original sampling rate of the audio waveform.
+            message_list (list): The list of messages to be encoded. Each message may correspond to a channel in the audio waveform.
+            message_sdr (float, optional): The signal-to-distortion ratio (SDR) of the message. If not provided, the default SDR from the configuration is used.
+            calc_sdr (bool, optional): Flag indicating whether to calculate the SDR of the encoded waveform. Defaults to True.
+            disable_checks (bool, optional): Flag indicating whether to disable input audio checks. Defaults to False.
+        Returns:
+            tuple: A tuple containing the encoded multi-channel audio waveform and the SDR (if calculated).
+        Raises:
+            AssertionError: If the number of messages does not match the number of channels in the input audio waveform.
+        """
+        single_channel = False
+        if len(y_multi_channel.shape) == 1:
+            single_channel = True
+            y_multi_channel = y_multi_channel[:, None]
+        if message_sdr is None:
+            message_sdr = self.config.message_sdr
+            print(f'Using the default SDR of {self.config.message_sdr} dB')
+        if type(message_list[0]) == int:
+            message_list = [message_list]*y_multi_channel.shape[1]
+        y_watermarked_multi_channel = []
+        sdrs = []
+        assert len(message_list) == y_multi_channel.shape[1], f'{len(message_list)} | {y_multi_channel.shape[1]} Mismatch in the number of messages and channels in the input audio.'
+        for channel_i in range(y_multi_channel.shape[1]):
+            y = y_multi_channel[:, channel_i]
+            message = message_list[channel_i]
+            with torch.no_grad():
+                orig_y = y.copy()
+                if orig_sr != self.sr:
+                    if orig_sr > self.sr:
+                        print(f'WARNING! Reducing the sampling rate of the original audio from {orig_sr} -> {self.sr}. High frequency components may be lost!')
+                    y = librosa.resample(y, orig_sr = orig_sr, target_sr = self.sr)
+                original_power = np.mean(y**2)
+                if not disable_checks:
+                    if original_power == 0:
+                        print('WARNING! The input audio has a power of 0.This means the audio is likely just silence. Skipping encoding.')
+                        return orig_y, 0
+                y = y * np.sqrt(self.average_energy_VCTK / original_power)  # Noise has a power of 5% power of VCTK samples
+                y = torch.FloatTensor(y).unsqueeze(0).unsqueeze(0).to(self.device)
+                carrier, carrier_phase = self.stft.transform(y.squeeze(1))
+                carrier = carrier[:, None]
+                carrier_phase = carrier_phase[:, None]
+                def binary_encode(mes):
+                    binary_message = ''.join(['{0:08b}'.format(mes_i) for mes_i in mes])
+                    four_bit_msg = []
+                    for i in range(len(binary_message)//2):
+                        four_bit_msg.append(int(binary_message[i*2:i*2+2], 2))
+                    return four_bit_msg
+                binary_encoded_message = binary_encode(message)
+                msgs, msgs_compact = self.letters_encoding(carrier.shape[3], [binary_encoded_message])
+                msg_enc = torch.from_numpy(msgs[None]).to(self.device).float()
+                carrier_enc = self.enc_c(carrier)  # encode the carrier
+                msg_enc = self.enc_c.transform_message(msg_enc)
+                merged_enc = torch.cat((carrier_enc, carrier.repeat(1, 32, 1, 1), msg_enc.repeat(1, 32, 1, 1)), dim=1)  # concat encodings on features axis
+                message_info = self.dec_c(merged_enc, message_sdr)
+                if self.config.frame_level_normalization:
+                    message_info = message_info*(torch.mean((carrier**2), dim=2, keepdim=True)**0.5)  # *time_weighing
+                elif self.config.utterance_level_normalization:
+                    message_info = message_info*(torch.mean((carrier**2), dim=(2,3), keepdim=True)**0.5)  # *time_weighing
+                if self.config.ensure_negative_message:
+                    message_info = -message_info
+                    carrier_reconst = torch.nn.functional.relu(message_info + carrier)  # decode carrier, output in stft domain
+                elif self.config.ensure_constrained_message:
+                    message_info[message_info > carrier] = carrier[message_info > carrier]
+                    message_info[-message_info > carrier] = -carrier[-message_info > carrier]
+                    carrier_reconst = message_info + carrier  # decode carrier, output in stft domain
+                    assert torch.all(carrier_reconst >= 0), 'negative values found in carrier_reconst'
+                else:
+                    carrier_reconst = torch.abs(message_info + carrier)  # decode carrier, output in stft domain
+                self.stft.num_samples = y.shape[2]
+                y = self.stft.inverse(carrier_reconst.squeeze(1), carrier_phase.squeeze(1)).data.cpu().numpy()[0, 0]
+                y = y * np.sqrt(original_power / (self.average_energy_VCTK))  # Noise has a power of 5% power of VCTK samples
+                if orig_sr != self.sr:
+                    y = librosa.resample(y, orig_sr = self.sr, target_sr = orig_sr)
+                if calc_sdr:
+                    sdr = self.sdr(orig_y, y)
+                else:
+                    sdr = 0
+            y_watermarked_multi_channel.append(y[:, None])
+            sdrs.append(sdr)
+        y_watermarked_multi_channel = np.concatenate(y_watermarked_multi_channel, axis=1)
+        if single_channel:
+            y_watermarked_multi_channel = y_watermarked_multi_channel[:, 0]
+            sdrs = sdrs[0]
+        return y_watermarked_multi_channel, sdrs
+    def decode_wav(self, y_multi_channel, orig_sr, phase_shift_decoding):
+        """
+        Decode the given audio waveform to extract hidden messages.
+        Args:
+            y_multi_channel (numpy.ndarray): The multi-channel audio waveform.
+            orig_sr (int): The original sample rate of the audio waveform.
+            phase_shift_decoding (str): Flag indicating whether to perform phase shift decoding.
+        Returns:
+            dict or list: A list of dictionary containing the decoded messages, confidences, and status for each channel if the input is multi-channel.
+                          Otherwise, a dictionary containing the decoded messages, confidences, and status for a single channel.
+        Raises:
+            Exception: If the decoding process fails.
+        """
+        single_channel = False
+        if len(y_multi_channel.shape) == 1:
+            single_channel = True
+            y_multi_channel = y_multi_channel[:, None]
+        results = []
+        for channel_i in range(y_multi_channel.shape[1]):
+            y = y_multi_channel[:, channel_i]
+            try:
+                with torch.no_grad():
+                    if orig_sr != self.sr:
+                        y = librosa.resample(y, orig_sr = orig_sr, target_sr = self.sr)
+                    original_power = np.mean(y**2)
+                    y = y * np.sqrt(self.average_energy_VCTK / original_power)  # Noise has a power of 5% power of VCTK samples
+                    if phase_shift_decoding and phase_shift_decoding != 'false':
+                        ps = self.get_best_ps(y)
+                    else:
+                        ps = 0
+                    y = torch.FloatTensor(y[ps:]).unsqueeze(0).unsqueeze(0).to(self.device)
+                    carrier, _ = self.stft.transform(y.squeeze(1))
+                    carrier = carrier[:, None]
+                    msg_reconst_list = []
+                    confidence = []
+                    for i in range(self.n_messages):  # decode each msg_i using decoder_m_i
+                        msg_reconst = self.dec_m[i](carrier)
+                        pred_values = torch.argmax(msg_reconst[0, 0], dim=0).data.cpu().numpy()
+                        pred_values = pred_values[0:int(msg_reconst.shape[3]/self.config.message_len)*self.config.message_len]
+                        pred_values = pred_values.reshape([-1, self.config.message_len])
+                        ord_values = st.mode(pred_values, keepdims=False).mode
+                        end_char = np.min(np.nonzero(ord_values == 0)[0])
+                        confidence.append(self.get_confidence(pred_values, ord_values))
+                        if end_char == self.config.message_len:
+                            ord_values = ord_values[:self.config.message_len-1]
+                        else:
+                            ord_values = np.concatenate([ord_values[end_char+1:], ord_values[:end_char]], axis=0)
+                        # pred_values = ''.join([chr(v + 64) for v in ord_values])
+                        msg_reconst_list.append((ord_values - 1).tolist())
+                    def convert_to_8_bit_segments(msg_list):
+                        segment_message_list = []
+                        for msg_list_i in msg_list:
+                            binary_format = ''.join(['{0:02b}'.format(mes_i) for mes_i in msg_list_i])
+                            eight_bit_segments = [int(binary_format[i*8:i*8+8], 2) for i in range(len(binary_format)//8)]
+                            segment_message_list.append(eight_bit_segments)
+                        return segment_message_list
+                    msg_reconst_list = convert_to_8_bit_segments(msg_reconst_list)
+                results.append({'messages': msg_reconst_list, 'confidences': confidence, 'status': True})
+            except:
+                results.append({'messages': [], 'confidences': [], 'error': 'Could not find message', 'status': False})
+        if single_channel:
+            results = results[0]
+        return results
+    def convert_dataparallel_to_normal(self, checkpoint):
+        return {i[len('module.'):] if i.startswith('module.') else i: checkpoint[i] for i in checkpoint }
+    def load_models(self, ckpt_dir):
+        self.enc_c.load_state_dict(self.convert_dataparallel_to_normal(torch.load(os.path.join(ckpt_dir, "enc_c.ckpt"), map_location=self.device)))
+        self.dec_c.load_state_dict(self.convert_dataparallel_to_normal(torch.load(os.path.join(ckpt_dir, "dec_c.ckpt"), map_location=self.device)))
+        for i,m in enumerate(self.dec_m):
+            m.load_state_dict(self.convert_dataparallel_to_normal(torch.load(os.path.join(ckpt_dir, f"dec_m_{i}.ckpt"), map_location=self.device)))
+def get_model(model_type='44.1k', ckpt_path='../Models/44_1_khz/73999_iteration', config_path='../Models/44_1_khz/73999_iteration/hparams.yaml', device='cpu'):
+    if model_type == '44.1k':
+        if not os.path.exists(ckpt_path) or not os.path.exists(config_path):
+            print('ckpt path or config path does not exist! Downloading the model from the Hugging Face Hub...')
+            from huggingface_hub import snapshot_download
+            folder_dir = snapshot_download(repo_id="sony/silentcipher")
+            ckpt_path = os.path.join(folder_dir, '44_1_khz/73999_iteration')
+            config_path = os.path.join(folder_dir, '44_1_khz/73999_iteration/hparams.yaml')
+        config = yaml.safe_load(open(config_path))
+        config = argparse.Namespace(**config)
+        config.load_ckpt = ckpt_path
+        model = Model(config, device)
+    elif model_type == '16k':
+        if not os.path.exists(ckpt_path) or not os.path.exists(config_path):
+            print('ckpt path or config path does not exist! Downloading the model from the Hugging Face Hub...')
+            from huggingface_hub import snapshot_download
+            folder_dir = snapshot_download(repo_id="sony/silentcipher")
+            ckpt_path = os.path.join(folder_dir, '16_khz/97561_iteration')
+            config_path = os.path.join(folder_dir, '16_khz/97561_iteration/hparams.yaml')
+        config = yaml.safe_load(open(config_path))
+        config = argparse.Namespace(**config)
+        config.load_ckpt = ckpt_path
+        model = Model(config, device)
+    else:
+        print('Please specify a valid model_type [44.1k, 16k]')
+    return model

silentcipher/stft.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import torch
+class Singleton(type):
+    _instances = {}
+    def __call__(cls, *args, **kwargs):
+        if cls not in cls._instances:
+            cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
+        return cls._instances[cls]
+class STFT(torch.nn.Module, metaclass=Singleton):
+    def __init__(self, filter_length=1024, hop_length=512):
+        super(STFT, self).__init__()
+        self.filter_length = filter_length
+        self.hop_len = hop_length
+        self.win_len = filter_length
+        self.window = torch.hann_window(self.win_len)
+        self.num_samples = -1
+    def transform(self, x):
+        x = torch.nn.functional.pad(x, (0, self.win_len - x.shape[1]%self.win_len))
+        fft = torch.stft(x, self.filter_length, self.hop_len, self.win_len, window=self.window.to(x.device), return_complex=True)
+        real_part, imag_part = fft.real, fft.imag
+        squared = real_part**2 + imag_part**2
+        additive_epsilon = torch.ones_like(squared) * (squared == 0).float() * 1e-24
+        magnitude = torch.sqrt(squared + additive_epsilon) - torch.sqrt(additive_epsilon)
+        phase = torch.autograd.Variable(torch.atan2(imag_part.data, real_part.data)).float()
+        return magnitude, phase
+    def inverse(self, magnitude, phase):
+        recombine_magnitude_phase = magnitude*torch.cos(phase) + 1j*magnitude*torch.sin(phase)
+        inverse_transform = torch.istft(recombine_magnitude_phase, self.filter_length, hop_length=self.hop_len, win_length=self.win_len, window=self.window.to(magnitude.device)).unsqueeze(1)  # , length=self.num_samples
+        padding = self.win_len - (self.num_samples % self.win_len)
+        inverse_transform = inverse_transform[:, :, :-padding]
+        return inverse_transform

text_presets.txt ADDED Viewed

	@@ -0,0 +1,42 @@

+Reading | [S1] The old lighthouse keeper had seen many storms in his thirty years on the rock, but nothing like this. The fog rolled in thick as wool, swallowing the beam of light before it could reach the churning waves below. Then he heard it, three short bells from the channel, where no ship should be at this hour. He grabbed his lantern and peered into the mist, his heart pounding. Something was out there, something that shouldn't exist.
+Reading | [S1] Deep beneath the ocean's surface, where sunlight fades to perpetual twilight, extraordinary creatures have evolved in ways that defy imagination. Bioluminescent jellyfish pulse with ethereal blue light, while giant squid hunt in the crushing darkness. At depths of over two miles, the pressure is immense, enough to collapse a submarine, yet life persists.
+Reading | [S1] The telegram arrived on a Tuesday morning in June, nineteen forty-three. Margaret's hands trembled as she tore open the envelope, dreading the words she knew might be inside. Her brother had shipped out to North Africa six months ago, and his letters had grown increasingly sparse.
+Reading | [S1] The ancient map showed a path through the Whispering Mountains that no living traveler had taken in generations. Legends spoke of a hidden valley where time moved differently, where a single day in the outside world meant years had passed within. As dawn broke over the snow-capped peaks, Elena shouldered her pack and began the ascent. Whatever waited at the journey's end, whether treasure or peril,
+Cartoon | [S1] After giving everything some more thought, I've decided it's in the best interest of humanity to acquire Nexus AI. (laughs) I've spoken with the CEO and he's on board. Well (laughs), at least that's the impression he gave initially.
+Single (Disfluent) | [S1] ... explore how we can design, create interfaces that are not confusing, but at the same time can be powerful. Um, you know, I think, uh, in the, the famous, um, usability book, it's, uh, it's this, um, um, oh, geez, I'm, I'm blanking on the term, uh, uh, the, the rule about, um, uh, it's like the simplicity rule. I can't recall. Oh, cognitive load maybe.
+Single (Disfluent) | [S1] Uh, complacency when the motivation isn't structured properly. Like for example, if you, if you're in the cor- if you work in the corporation for many years, a lot of corporate employees, they just, they're, they're aiming for that stock vesting and they're, they're doing just a sufficient job to, to, to reach that vesting and, and they don't, they're not performing any better than that. Um, and so I think, um, that showed me an important insight. Yeah.
+Single (Disfluent) | [S1] We see the pattern of revelations, major shifts. I think Neptune in Pisces, which that transit has been happening all of 2021, and Neptune will remain in the sign of Pisces until March of 2029. So it's several years more of this transit. And what it brings is a lot of things, you know, the thing that I tend to emphasize is the profound dissolution or profound changes
+Single (Disfluent) | [S1] I asked her, "Do you have like a phrase you use," and she mentioned she actually does. Like when things get tense, when there's like a moment, like if her, if her roommate is like venting about work drama or just like is stressed, and her, her roommate like deals with anxiety, I'm like, "Oh, this is probably how it feels to live with me." But, um, and like if, if, if things are rough, like she'll internally just like use this practice where she's like, like, "Not my problem, not mine to carry, not mine to handle, not mine to change." Like she'll sort of repeat that. So that's interesting.
+Single (Disfluent) | [S1] If I examine the, the, if, if you examine the range of options, uh, beginning from, like, say, individual all the way, right? There will be some revenue stream, uh, there will be some purchase, there'll be some hardware profit margin for someone who creates a smart product, um, uh, there will be memberships, personal and business, uh, and then there'll be usage-based, right? So I still believe that that's kinda how, those are all the metrics. To your point, what is a membership? Up to now, folks
+Single (Disfluent) | [S1] I think, if, if we can keep it under 25 points allowed, sure, our odds improve significantly. We wouldn't need to put up huge numbers ourselves, or at least that's the theory. And I should, I want to share some other stats which might be a bit outside our current discussion, but regarding this compared to 2018, the team's final four games that year, they managed 18 points total.
+Singing | [S1] (singing) Amazing grace, how sweet the sound, that saved a wretch like me. I once was lost, but now am found, was blind, but now I see.
+Conversation | [S1] Alright then. So, so 18 years you spent in that, uh, in that role, but alongside that in, in, was it while you were working that position in '93, you started doing some work with the network? [S2] Uh, yes. It was somewhere around '93. I, I, I played tennis pretty well, you know? I, I, I competed as a tennis player. And the, I got a chance to do some broadcasting over in Brisbane.
+Conversation | [S1] ... that will provide the analytics component- [S2] Right. [S1] ... to ideally get you to adopt some of their other tools. And- [S2] (laughs) [S1] ... some of those features are valuable too. [S2] That's interesting. [S1] Mailchimp, I mean, that's campaign manage-, uh, not exactly campaign management, but messaging platforms. [S2] Uh-huh. [S1] The, the companies that are, you know,
+Conversation | [S1] They were like, they were pumped for it, going wild for it, and it disappeared immediately. [S2] Yeah, I think it's about people understanding what's available first. Um... [S1] I think the finish on that one too was really nice. [S2] Yeah. [S1] I mean, that was pretty awesome. [S2] Have you seen those new editions?
+Conversation | [S1] He was just practicing with them and they were on rotation. [S2] So that was probably in January. [S1] I think startup stereotypes, there is some like that, but some of them, I think they need to be changed. Like we don't all work twenty-hour days. [S2] No, they just need to, it's called not, it's based in Silicon Valley. [S1] Yeah. [S2] But the stereotypes would apply if they, it was called Techlife- [S1] Palo Alto. [S2] ... Cupertino or Mountain View, California.
+Conversation | [S1] That's a nice overview. [S2] We were at the downtown cinema. [S1] By that, you mean the one in Riverside? [S2] Yeah. [S1] Yeah. So not exactly downtown. [S2] Not exactly downtown, yeah. [S1] I know a little bit about that area. [S2] (laughs) [S1] You know, Millbrook doesn't have a cinema. [S2] (laughs) It's the closest one for us. It's the closest. [S1] Yeah, that's true. [S2] The most nearby. [S1] Riverside is nearby. [S2] Riverside's close. [S1] That's fair. [S2] Support nearby. [S1] You can say, say Riverside, definitely. [S2] Well, yeah, fair enough.
+Conversation | [S1] But they also, they also discovered, um, they also discovered like patterns in the desert, um, near Peru, like in the Atacama Desert. [S2] Yeah. [S1] Um, and like, it was like, of like perfectly, like, geo- geometric shapes. And they're like, "Yo, this is definitely not like formed by wind. This has to be artificial." [S2] Yeah, it's too precise.
+Conversation | [S1] 'Cause I, yeah, there, there has to be a way that they can just make the, the system recognize that, no, you did not earn this- [S2] (laughs) [S1] ... on your own. You still have to go and complete one if you want it for your own- [S2] Right. [S1] ... like, profile. [S2] Right. Mm-hmm. [S1] So, yeah. [S2] Um, yeah. So let's actually move into multiplayer.
+Conversation | [S1] Yeah. [S2] Yeah. TRS as a whole is just relaxed. [S1] But anyway, you know that Mirror app that launched and then got removed like a month later?  [S2] Mirror, what, like, to your future? [S1] Yeah. [S2] Oh. [S1] So basically, there was an app, there's a show coming out. [S2] This is a show. [S1] Coming, I don't know what it is. [S2] Yeah, yeah, yeah. [S1] Like 2026 or something. Basically, Marcus, have you heard about this? [S2] I'm sorry, I don't know. No, I don't have an, it's an app- [S1] Okay, so I'll explain. I'll explain. [S2] Yeah. [S1] For context. So there's this app that launched in terms of the show called Mirror.
+Conversation | [S1] Jamie Patterson, right? [S2] No, I know where- [S1] I know where- [S2] ... Patterson works as well. I know where- [S1] I know- I know he used to work near- on this street, and this is a weird street. [S2] The only person who I don't know where they work, Jamie. But anyway, why are we even talking about who works where? [S1] It was a- it was- it was a really weird street name where Jamie worked. [S2] I- I drove past this street on my commute. [S1] No, you didn't. [S2] Yeah, I did. [S1] No, you drove past the street that my street is down the street of. [S2] Nice. There's, like, one street in Oakfield, I think I'll be able to find it, mate.