File size: 10,360 Bytes
b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 b6c0790 b0097d1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
---
library_name: transformers
license: apache-2.0
language:
- en
tags:
- monoid
- causal-lm
- linear-attention
- state-space
- O(1)-inference
- vector-decay
- reasoning
pipeline_tag: text-generation
model-index:
- name: Spartacus-1B-Instruct
results: []
---
# Spartacus-1B-Instruct β Causal Monoid Language Model
A 1.3B parameter language model that replaces softmax attention with **causal monoid state compression**, achieving **O(1) time per token** and **O(1) memory** at inference β regardless of sequence length.
## Monoid Attention β Internal Structure
```
MonoidAttention (per layer, per head)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β x_t β R^{2048} β
β β β
β βββ> q_proj ββ> RMSNorm ββ> q_t β R^d (query, scaled 1/βd) β
β β β
β βββ> k_proj ββ> RMSNorm ββ> SiLU ββ> k_t β R^d (key, non-negative) β
β β β
β βββ> v_proj ββ> v_t β R^d (value) β
β β β
β βββ> decay_proj ββ> -Softplus ββ> log Ξ±_t β R^d (vector decay gate) β
β β
β k_t β v_t β
β β βββββββββββββββββββββββββββββββββββ β
β β β State Matrix S_t β R^{d x d} β β
β v β "Compressed causal history" β β
β S_t = diag(Ξ±_t) Β· S_{t-1} + k_t β v_t β β
β β β Ξ±_t β (0,1]^d per dimension β β
β β βββββββββββββββββββββββββββββββββββ β
β v β
β o_t = q_t Β· S_t ββ> o_proj ββ> output β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Key Properties
| Property | Transformer (Llama) | Spartacus (Monoid) |
|---|---|---|
| Inference time per token | O(T) β scans full KV-cache | **O(1)** β single state update |
| Inference memory per layer | O(T) β stores all past K,V | **O(1)** β fixed dΓd state matrix |
| Sequence length extrapolation | Degrades beyond training length | **Unlimited** β state size is constant |
| Causality | Imposed via attention mask | **Built into the recurrence** |
| Training complexity | O(TΒ²) | **O(T)** via parallel prefix scan |
## The Monoid Recurrence
Standard attention computes:
```
o_t = Ξ£_{iβ€t} softmax(q_t Β· k_i) v_i β requires O(T) KV-cache
```
Monoid attention compresses the entire causal history into a **fixed-size state matrix** S_t per head:
```
S_t = diag(Ξ±_t) Β· S_{t-1} + k_t β v_t β vector decay monoid recurrence
o_t = q_t Β· S_t β state readout
```
This is a monoid because the binary operator `(log_Ξ±, S) β (log_Ξ², X) = (log_Ξ± + log_Ξ², exp(log_Ξ²)Β·S + X)` is **associative**, enabling O(T) parallel prefix scan for training and O(1) sequential update for inference.
## Vector Decay β Per-Dimension Memory Lifetimes
Unlike scalar decay (one Ξ± per head), Spartacus uses **vector decay**: each dimension of the d-vector has its own independent decay rate Ξ±_t[i] β (0, 1]:
```
S_t[i,j] = Ξ±_t[i] Β· S_{t-1}[i,j] + k_t[i] Β· v_t[j]
```
This allows different feature dimensions to specialize:
- **Fast-decaying dimensions** (Ξ± β 0) β local syntax, punctuation, function words
- **Slow-decaying dimensions** (Ξ± β 1) β entity memory, topic tracking, long-range facts
The decay gate uses **Negative Softplus** activation:
```
log Ξ±_t = -softplus(WΒ·x_t + b)
```
| Property | Value |
|---|---|
| Range | Ξ± β (0, 1] β bounded, no explosion |
| Perfect memory | WΒ·x β -β βΉ softplus β 0 βΉ Ξ± β 1 (lossless retention) |
| Full forgetting | WΒ·x β +β βΉ softplus β β βΉ Ξ± β 0 (complete reset) |
| Stability | Ξ± β€ 1 by construction β no divergence regardless of input magnitude |
## Attention Mask β Padding-Aware Recurrence
The monoid recurrence correctly handles `attention_mask` for padded batches (e.g., left-padding during `generate()`). For PAD positions (mask=0):
```
log_Ξ± = 0 β Ξ± = 1 (preserve state unchanged)
k = 0, v = 0 β kv = 0 (no information injected)
```
Net effect: `S_t = 1Β·S_{t-1} + 0 = S_{t-1}` β PAD acts as the **monoid identity element**, completely invisible to the recurrence. This ensures identical outputs whether inputs are padded or not.
## Design Choices
- **SiLU-activated keys**: `k = SiLU(k_proj(x))` ensures non-negative keys, making the state matrix S positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
- **QK-Norm**: RMSNorm on both q and k before readout, stabilizing the scale of qΒ·S when the state matrix accumulates many outer products
- **Log-space decay**: Working in log-space `log(Ξ±)` avoids numerical underflow when Ξ±^T β 0 for long sequences
- **Learnable h0**: The initial state Sβ = h0 is a learnable parameter (zero-initialized), acting as a compressed "system prompt"
- **Negative Softplus gate**: Ensures Ξ± β (0, 1] by construction β allows perfect memory (Ξ±=1) while preventing state explosion (Ξ±>1)
## Three Forward Paths
| Path | Condition | Complexity | Description |
|---|---|---|---|
| Training | `use_cache=False` | O(T) parallel scan | Vectorized outer products β parallel prefix scan β vectorized readout |
| Inference prefill | `use_cache=True, T>1` | O(T) parallel scan | Same as training + extracts final state S_T for cache |
| Inference decode | `use_cache=True, T=1` | **O(1)** monoid_op | Single `monoid_op` to fold new token into state β one matmul readout |
## Model Details
| Parameter | Value |
|---|---|
| Model | `NoesisLab/Spartacus-1B-Instruct` |
| Architecture | MonoidForCausalLM |
| Parameters | ~1.34B (tied embeddings) |
| Hidden size | 2048 |
| Intermediate size (MLP) | 8192 |
| Layers | 16 |
| Attention heads | 32 |
| Head dimension | 64 |
| Decay gate | Vector decay, d=64 per head |
| State matrix per head | 64 Γ 64 = 4,096 floats |
| Vocabulary | 128,256 (Llama-3.2 tokenizer) |
| Precision | bfloat16 |
## Benchmarks (0-shot)
| Task | Metric | Value | Stderr |
|---|---|---|---|
| ARC-Challenge | acc_norm | 0.3063 | Β±0.0135 |
| ARC-Easy | acc | 0.5518 | Β±0.0102 |
| HellaSwag | acc_norm | 0.4610 | Β±0.0050 |
| PIQA | acc_norm | 0.6915 | Β±0.0108 |
| WinoGrande | acc | 0.5225 | Β±0.0140 |
### Comparison with ~1B Baselines (acc_norm, 0-shot)
| Task | Spartacus-1B | TinyLlama-1.1B | Llama 3.2-1B | Mamba-1.4B | RWKV-6-1.6B |
|---|---|---|---|---|---|
| ARC-C | **0.3063** | 0.3268 | ~0.359 | 0.284 | ~0.301 |
| ARC-E | **0.5518** | 0.5547 | ~0.752 | 0.512 | ~0.530 |
| HellaSwag | **0.4610** | 0.4670 | ~0.546 | 0.435 | ~0.450 |
| PIQA | **0.6915** | 0.7210 | ~0.740 | 0.655 | ~0.670 |
| WinoGrande | **0.5225** | 0.5040 | ~0.592 | 0.510 | ~0.515 |
> Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining **O(1) inference time and memory per token**. Scores marked with ~ are approximate community-reported values.
## Parallel Scan Implementation
The `monoid_scan_cuda.py` module provides a Triton JIT-compiled parallel prefix scan for the vector-decay monoid:
- **Grid**: `(B*H*D_k, ceil(D_v/BLOCK_DV))` β one program per state matrix row
- **Forward**: Sequential scan along T per row, parallelized across all (batch, head, d_k) dimensions
- **Backward**: Reverse-order adjoint scan with per-row D_v reduction (minimal atomic_add)
- **Fallback**: Pure PyTorch sequential scan for CPU/MPS
- **Auto-dispatch**: CUDA β Triton kernel, otherwise β PyTorch fallback
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"NoesisLab/Spartacus-1B-Instruct",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## File Structure
```
MonoidForCausalLM.py # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
monoid_scan_cuda.py # Triton JIT parallel prefix scan (vector decay) + PyTorch fallback
model.safetensors # Model weights (bfloat16)
config.json # Model configuration
tokenizer.json # Llama-3.2 tokenizer
```
## Citation
```bibtex
@software{spartacus2025,
title={Spartacus: Causal Monoid Language Model with O(1) Inference},
author={NoesisLab},
year={2025},
url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
description={Replaces softmax attention with vector-decay monoid state compression for constant-time, constant-memory autoregressive generation}
}
```
## License
Apache 2.0
|