Spartacus-1B-Instruct β€” Causal Monoid Language Model

A 1.3B parameter language model that replaces softmax attention with causal monoid state compression, achieving O(1) time per token and O(1) memory at inference β€” regardless of sequence length.

Monoid Attention β€” Internal Structure

                          MonoidAttention (per layer, per head)
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚                                                                         β”‚
 β”‚   x_t ∈ R^{2048}                                                       β”‚
 β”‚    β”‚                                                                    β”‚
 β”‚    β”œβ”€β”€> q_proj ──> RMSNorm ──> q_t ∈ R^d          (query, scaled 1/√d) β”‚
 β”‚    β”‚                                                                    β”‚
 β”‚    β”œβ”€β”€> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^d (key, non-negative) β”‚
 β”‚    β”‚                                                                    β”‚
 β”‚    β”œβ”€β”€> v_proj ──> v_t ∈ R^d                       (value)             β”‚
 β”‚    β”‚                                                                    β”‚
 β”‚    └──> decay_proj ──> -Softplus ──> log Ξ±_t ∈ R^d (vector decay gate) β”‚
 β”‚                                                                         β”‚
 β”‚         k_t βŠ— v_t                                                       β”‚
 β”‚            β”‚             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
 β”‚            β”‚             β”‚  State Matrix S_t ∈ R^{d x d}   β”‚            β”‚
 β”‚            v             β”‚  "Compressed causal history"    β”‚            β”‚
 β”‚    S_t = diag(Ξ±_t) Β· S_{t-1} + k_t βŠ— v_t                 β”‚            β”‚
 β”‚            β”‚             β”‚  Ξ±_t ∈ (0,1]^d per dimension    β”‚            β”‚
 β”‚            β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
 β”‚            v                                                            β”‚
 β”‚    o_t = q_t Β· S_t ──> o_proj ──> output                               β”‚
 β”‚                                                                         β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Properties

Property Transformer (Llama) Spartacus (Monoid)
Inference time per token O(T) β€” scans full KV-cache O(1) β€” single state update
Inference memory per layer O(T) β€” stores all past K,V O(1) β€” fixed dΓ—d state matrix
Sequence length extrapolation Degrades beyond training length Unlimited β€” state size is constant
Causality Imposed via attention mask Built into the recurrence
Training complexity O(TΒ²) O(T) via parallel prefix scan

The Monoid Recurrence

Standard attention computes:

o_t = Ξ£_{i≀t} softmax(q_t Β· k_i) v_i    β€” requires O(T) KV-cache

Monoid attention compresses the entire causal history into a fixed-size state matrix S_t per head:

S_t = diag(Ξ±_t) Β· S_{t-1} + k_t βŠ— v_t     β€” vector decay monoid recurrence
o_t = q_t Β· S_t                              β€” state readout

This is a monoid because the binary operator (log_Ξ±, S) βŠ• (log_Ξ², X) = (log_Ξ± + log_Ξ², exp(log_Ξ²)Β·S + X) is associative, enabling O(T) parallel prefix scan for training and O(1) sequential update for inference.

Vector Decay β€” Per-Dimension Memory Lifetimes

Unlike scalar decay (one α per head), Spartacus uses vector decay: each dimension of the d-vector has its own independent decay rate α_t[i] ∈ (0, 1]:

S_t[i,j] = Ξ±_t[i] Β· S_{t-1}[i,j] + k_t[i] Β· v_t[j]

This allows different feature dimensions to specialize:

  • Fast-decaying dimensions (Ξ± β‰ˆ 0) β€” local syntax, punctuation, function words
  • Slow-decaying dimensions (Ξ± β‰ˆ 1) β€” entity memory, topic tracking, long-range facts

The decay gate uses Negative Softplus activation:

log Ξ±_t = -softplus(WΒ·x_t + b)
Property Value
Range Ξ± ∈ (0, 1] β€” bounded, no explosion
Perfect memory WΒ·x β†’ -∞ ⟹ softplus β†’ 0 ⟹ Ξ± β†’ 1 (lossless retention)
Full forgetting WΒ·x β†’ +∞ ⟹ softplus β†’ ∞ ⟹ Ξ± β†’ 0 (complete reset)
Stability Ξ± ≀ 1 by construction β€” no divergence regardless of input magnitude

Attention Mask β€” Padding-Aware Recurrence

The monoid recurrence correctly handles attention_mask for padded batches (e.g., left-padding during generate()). For PAD positions (mask=0):

log_Ξ± = 0    β†’  Ξ± = 1  (preserve state unchanged)
k = 0, v = 0 β†’  kv = 0 (no information injected)

Net effect: S_t = 1Β·S_{t-1} + 0 = S_{t-1} β€” PAD acts as the monoid identity element, completely invisible to the recurrence. This ensures identical outputs whether inputs are padded or not.

Design Choices

  • SiLU-activated keys: k = SiLU(k_proj(x)) ensures non-negative keys, making the state matrix S positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
  • QK-Norm: RMSNorm on both q and k before readout, stabilizing the scale of qΒ·S when the state matrix accumulates many outer products
  • Log-space decay: Working in log-space log(Ξ±) avoids numerical underflow when Ξ±^T β†’ 0 for long sequences
  • Learnable h0: The initial state Sβ‚€ = h0 is a learnable parameter (zero-initialized), acting as a compressed "system prompt"
  • Negative Softplus gate: Ensures Ξ± ∈ (0, 1] by construction β€” allows perfect memory (Ξ±=1) while preventing state explosion (Ξ±>1)

Three Forward Paths

Path Condition Complexity Description
Training use_cache=False O(T) parallel scan Vectorized outer products β†’ parallel prefix scan β†’ vectorized readout
Inference prefill use_cache=True, T>1 O(T) parallel scan Same as training + extracts final state S_T for cache
Inference decode use_cache=True, T=1 O(1) monoid_op Single monoid_op to fold new token into state β†’ one matmul readout

Model Details

Parameter Value
Model NoesisLab/Spartacus-1B-Instruct
Architecture MonoidForCausalLM
Parameters ~1.34B (tied embeddings)
Hidden size 2048
Intermediate size (MLP) 8192
Layers 16
Attention heads 32
Head dimension 64
Decay gate Vector decay, d=64 per head
State matrix per head 64 Γ— 64 = 4,096 floats
Vocabulary 128,256 (Llama-3.2 tokenizer)
Precision bfloat16

Benchmarks (0-shot)

Task Metric Value Stderr
ARC-Challenge acc_norm 0.3063 Β±0.0135
ARC-Easy acc 0.5518 Β±0.0102
HellaSwag acc_norm 0.4610 Β±0.0050
PIQA acc_norm 0.6915 Β±0.0108
WinoGrande acc 0.5225 Β±0.0140

Comparison with ~1B Baselines (acc_norm, 0-shot)

Task Spartacus-1B TinyLlama-1.1B Llama 3.2-1B Mamba-1.4B RWKV-6-1.6B
ARC-C 0.3063 0.3268 ~0.359 0.284 ~0.301
ARC-E 0.5518 0.5547 ~0.752 0.512 ~0.530
HellaSwag 0.4610 0.4670 ~0.546 0.435 ~0.450
PIQA 0.6915 0.7210 ~0.740 0.655 ~0.670
WinoGrande 0.5225 0.5040 ~0.592 0.510 ~0.515

Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining O(1) inference time and memory per token. Scores marked with ~ are approximate community-reported values.

Parallel Scan Implementation

The monoid_scan_cuda.py module provides a Triton JIT-compiled parallel prefix scan for the vector-decay monoid:

  • Grid: (B*H*D_k, ceil(D_v/BLOCK_DV)) β€” one program per state matrix row
  • Forward: Sequential scan along T per row, parallelized across all (batch, head, d_k) dimensions
  • Backward: Reverse-order adjoint scan with per-row D_v reduction (minimal atomic_add)
  • Fallback: Pure PyTorch sequential scan for CPU/MPS
  • Auto-dispatch: CUDA β†’ Triton kernel, otherwise β†’ PyTorch fallback

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "NoesisLab/Spartacus-1B-Instruct",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

File Structure

MonoidForCausalLM.py       # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
monoid_scan_cuda.py        # Triton JIT parallel prefix scan (vector decay) + PyTorch fallback
model.safetensors          # Model weights (bfloat16)
config.json                # Model configuration
tokenizer.json             # Llama-3.2 tokenizer

Citation

@software{spartacus2025,
  title={Spartacus: Causal Monoid Language Model with O(1) Inference},
  author={NoesisLab},
  year={2025},
  url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
  description={Replaces softmax attention with vector-decay monoid state compression for constant-time, constant-memory autoregressive generation}
}

License

Apache 2.0

Downloads last month
99
Safetensors
Model size
1B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including NoesisLab/Spartacus-1B-Instruct