# API Reference ## Module: `src.models.encoder` ### Class: `ByteLatentEncoder` Converts byte sequences into latent patches with positional embeddings. ```python class ByteLatentEncoder(nn.Module): def __init__( self, d_model: int = 512, patch_size: int = 4, dropout: float = 0.1 ) ``` **Parameters:** - `d_model` (int): Latent dimension size - `patch_size` (int): Number of bytes per patch - `dropout` (float): Dropout probability **Methods:** ```python def forward(self, x: torch.Tensor) -> torch.Tensor: """ Args: x: (Batch, Seq_Len) - Input bytes [0-255] Returns: (Batch, Num_Patches, d_model) - Latent patches """ ``` --- ## Module: `src.models.layers` ### Class: `LinearAttention` $O(N)$ causal attention using ELU feature maps. ```python class LinearAttention(nn.Module): def __init__( self, d_model: int, num_heads: int = 8, dropout: float = 0.1 ) ``` **Methods:** ```python def forward(self, x: torch.Tensor) -> torch.Tensor: """ Args: x: (Batch, Seq_Len, d_model) Returns: (Batch, Seq_Len, d_model) """ ``` **Algorithm:** ``` Q, K, V = elu(Wq x) + 1, elu(Wk x) + 1, Wv x Attention = (Q @ cumsum(K ⊗ V)) / (Q @ cumsum(K) + ε) ``` --- ### Class: `SlidingWindowAttention` Causal attention with fixed window size. ```python class SlidingWindowAttention(nn.Module): def __init__( self, d_model: int, num_heads: int, window_size: int ) ``` **Parameters:** - `window_size` (int): Maximum distance for attention (default: 128) --- ### Class: `HybridBlock` Combines LinearAttention + SlidingWindowAttention in parallel. ```python class HybridBlock(nn.Module): def __init__( self, d_model: int, num_heads: int, window_size: int, dropout: float ) ``` **Methods:** ```python def forward(self, x: torch.Tensor) -> torch.Tensor: """ Args: x: (Batch, Seq_Len, d_model) Returns: (Batch, Seq_Len, d_model) Algorithm: attn_out = SlidingWindowAttention(norm(x)) ssm_out = LinearAttention(norm(x)) x = x + out_proj(attn_out + ssm_out) x = x + MLP(norm(x)) """ ``` --- ## Module: `src.models.reasoning` ### Class: `RecurrentReasoningBlock` System 2 thinking loop with gated residual updates. ```python class RecurrentReasoningBlock(nn.Module): def __init__( self, d_model: int, thinking_steps: int = 3, dropout: float = 0.1 ) ``` **Parameters:** - `thinking_steps` (int): Number of refinement iterations **Methods:** ```python def forward(self, x: torch.Tensor) -> torch.Tensor: """ Args: x: (Batch, Seq_Len, d_model) - Initial latent Returns: (Batch, Seq_Len, d_model) - Refined latent Algorithm: for t in range(thinking_steps): update = MLP(norm(x)) gate = sigmoid(W_gate @ norm(x)) x = x + gate * update """ ``` --- ## Module: `src.models.agiformer` ### Class: `LocalAutoregressiveHead` GRU-based byte decoder with teacher forcing. ```python class LocalAutoregressiveHead(nn.Module): def __init__( self, d_model: int, patch_size: int, hidden_dim: int = 256 ) ``` **Methods:** ```python def forward( self, latents: torch.Tensor, target_bytes: Optional[torch.Tensor] = None, temperature: float = 0.0 ) -> torch.Tensor: """ Args: latents: (Batch, Num_Patches, d_model) target_bytes: (Batch, Num_Patches * patch_size) - For training temperature: Sampling temperature (0 = greedy) Returns: Training: (Batch, Num_Patches, patch_size, 256) - Logits Inference: (Batch, Num_Patches, patch_size) - Byte IDs """ ``` --- ### Class: `AGIFORMER` Main model class. ```python class AGIFORMER(nn.Module): def __init__( self, d_model: int = 512, n_layers: int = 6, num_heads: int = 8, patch_size: int = 4, window_size: int = 128, vocab_size: int = 256, dropout: float = 0.1, thinking_steps: int = 3 ) ``` **Parameters:** - `d_model`: Latent dimension - `n_layers`: Number of HybridBlocks - `num_heads`: Attention heads per layer - `patch_size`: Bytes per patch - `window_size`: Local attention window - `vocab_size`: Always 256 (bytes) - `dropout`: Dropout probability - `thinking_steps`: System 2 iterations **Methods:** ```python def forward( self, x: torch.Tensor, target_bytes: Optional[torch.Tensor] = None, temperature: float = 0.0 ) -> torch.Tensor: """ Full forward pass: Encoder → Backbone → Reasoning → Decoder Args: x: (Batch, Seq_Len) - Input bytes target_bytes: (Batch, Seq_Len_Target) - For training temperature: Sampling temperature Returns: Training: (Batch, Num_Patches, patch_size, 256) Inference: (Batch, Num_Patches, patch_size) """ ``` --- ## Module: `src.data.real_data` ### Class: `Enwik8Dataset` PyTorch dataset for enwik8. ```python class Enwik8Dataset(torch.utils.data.Dataset): def __init__( self, data_dir: str = "./data", split: str = "train", seq_len: int = 1024 ) ``` **Parameters:** - `split`: "train", "val", or "test" - `seq_len`: Sequence length per sample **Methods:** ```python def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]: """ Returns: input: (seq_len,) - Context bytes target: (seq_len,) - Next-patch bytes """ ``` ### Function: `get_enwik8_dataloader` Creates DataLoader with automatic download. ```python def get_enwik8_dataloader( batch_size: int, seq_len: int, split: str = "train" ) -> torch.utils.data.DataLoader: """ Args: batch_size: Batch size seq_len: Sequence length split: "train", "val", or "test" Returns: DataLoader yielding (input, target) batches """ ``` --- ## Utility Scripts ### `train.py` Main training loop. **Key Functions:** ```python def train_step(model, batch, optimizer, criterion): """Single training step""" def validate(model, val_loader, criterion): """Validation loop""" ``` ### `generate.py` Inference with temperature sampling. **Key Function:** ```python def generate_text( model_path: str, prompt_text: str, max_new_tokens: int = 200, temperature: float = 0.7 ) -> None: """Generate text from prompt""" ``` ### `inspect_reasoning.py` System 2 diagnostics. **Key Function:** ```python def inspect_system_2(model_path: str) -> None: """ Measures: - Latent refinement (Δz) - Gate biases - Parameter health """ ``` --- ## Example Usage ### Training from Scratch ```python from src.models.agiformer import AGIFORMER from src.data.real_data import get_enwik8_dataloader import torch.optim as optim model = AGIFORMER(d_model=512, n_layers=6, thinking_steps=3) train_loader = get_enwik8_dataloader(batch_size=4, seq_len=1024) optimizer = optim.AdamW(model.parameters(), lr=3e-4) for batch in train_loader: x, target = batch logits = model(x, target_bytes=target) loss = F.cross_entropy(logits.view(-1, 256), target.view(-1)) optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5) optimizer.step() ``` ### Custom Inference ```python model = AGIFORMER() model.load_state_dict(torch.load("best_model.pth")) model.eval() prompt_bytes = torch.tensor([ord(c) for c in "Hello world"]) with torch.no_grad(): output = model(prompt_bytes.unsqueeze(0), temperature=0.7) generated = output[0, -1, :].tolist() text = ''.join([chr(b) for b in generated if 32 <= b <= 126]) print(text) ``` --- ## Type Hints Summary ```python # Common types Tensor = torch.Tensor IntTensor = torch.LongTensor FloatTensor = torch.FloatTensor # Shapes (notation) B = Batch size L = Sequence length N = Number of patches (L / patch_size) P = Patch size D = d_model H = num_heads V = Vocabulary size (256) # Input/Output shapes Input: (B, L) IntTensor Latent: (B, N, D) FloatTensor Logits: (B, N, P, V) FloatTensor Output: (B, N, P) IntTensor ```