Spartacus-1B-Instruct / README.md

OzTianlu

Upload 11 files

b6c0790 verified about 16 hours ago

preview code

raw

history blame contribute delete

10.4 kB

metadata

library_name: transformers
license: apache-2.0
language:
  - en
tags:
  - monoid
  - causal-lm
  - linear-attention
  - state-space
  - O(1)-inference
  - vector-decay
  - reasoning
pipeline_tag: text-generation
model-index:
  - name: Spartacus-1B-Instruct
    results: []

Spartacus-1B-Instruct — Causal Monoid Language Model

A 1.3B parameter language model that replaces softmax attention with causal monoid state compression, achieving O(1) time per token and O(1) memory at inference — regardless of sequence length.

Monoid Attention — Internal Structure

                          MonoidAttention (per layer, per head)
 ┌─────────────────────────────────────────────────────────────────────────┐
 │                                                                         │
 │   x_t ∈ R^{2048}                                                       │
 │    │                                                                    │
 │    ├──> q_proj ──> RMSNorm ──> q_t ∈ R^d          (query, scaled 1/√d) │
 │    │                                                                    │
 │    ├──> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^d (key, non-negative) │
 │    │                                                                    │
 │    ├──> v_proj ──> v_t ∈ R^d                       (value)             │
 │    │                                                                    │
 │    └──> decay_proj ──> -Softplus ──> log α_t ∈ R^d (vector decay gate) │
 │                                                                         │
 │         k_t ⊗ v_t                                                       │
 │            │             ┌─────────────────────────────────┐            │
 │            │             │  State Matrix S_t ∈ R^{d x d}   │            │
 │            v             │  "Compressed causal history"    │            │
 │    S_t = diag(α_t) · S_{t-1} + k_t ⊗ v_t                 │            │
 │            │             │  α_t ∈ (0,1]^d per dimension    │            │
 │            │             └─────────────────────────────────┘            │
 │            v                                                            │
 │    o_t = q_t · S_t ──> o_proj ──> output                               │
 │                                                                         │
 └─────────────────────────────────────────────────────────────────────────┘

Key Properties

Property	Transformer (Llama)	Spartacus (Monoid)
Inference time per token	O(T) — scans full KV-cache	O(1) — single state update
Inference memory per layer	O(T) — stores all past K,V	O(1) — fixed d×d state matrix
Sequence length extrapolation	Degrades beyond training length	Unlimited — state size is constant
Causality	Imposed via attention mask	Built into the recurrence
Training complexity	O(T²)	O(T) via parallel prefix scan

The Monoid Recurrence

Standard attention computes:

o_t = Σ_{i≤t} softmax(q_t · k_i) v_i    — requires O(T) KV-cache

Monoid attention compresses the entire causal history into a fixed-size state matrix S_t per head:

S_t = diag(α_t) · S_{t-1} + k_t ⊗ v_t     — vector decay monoid recurrence
o_t = q_t · S_t                              — state readout

This is a monoid because the binary operator (log_α, S) ⊕ (log_β, X) = (log_α + log_β, exp(log_β)·S + X) is associative, enabling O(T) parallel prefix scan for training and O(1) sequential update for inference.

Vector Decay — Per-Dimension Memory Lifetimes

Unlike scalar decay (one α per head), Spartacus uses vector decay: each dimension of the d-vector has its own independent decay rate α_t[i] ∈ (0, 1]:

S_t[i,j] = α_t[i] · S_{t-1}[i,j] + k_t[i] · v_t[j]

This allows different feature dimensions to specialize:

Fast-decaying dimensions (α ≈ 0) — local syntax, punctuation, function words
Slow-decaying dimensions (α ≈ 1) — entity memory, topic tracking, long-range facts

The decay gate uses Negative Softplus activation:

log α_t = -softplus(W·x_t + b)

Property	Value
Range	α ∈ (0, 1] — bounded, no explosion
Perfect memory	W·x → -∞ ⟹ softplus → 0 ⟹ α → 1 (lossless retention)
Full forgetting	W·x → +∞ ⟹ softplus → ∞ ⟹ α → 0 (complete reset)
Stability	α ≤ 1 by construction — no divergence regardless of input magnitude

Attention Mask — Padding-Aware Recurrence

The monoid recurrence correctly handles attention_mask for padded batches (e.g., left-padding during generate()). For PAD positions (mask=0):

log_α = 0    →  α = 1  (preserve state unchanged)
k = 0, v = 0 →  kv = 0 (no information injected)

Net effect: S_t = 1·S_{t-1} + 0 = S_{t-1} — PAD acts as the monoid identity element, completely invisible to the recurrence. This ensures identical outputs whether inputs are padded or not.

Design Choices

SiLU-activated keys: k = SiLU(k_proj(x)) ensures non-negative keys, making the state matrix S positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
QK-Norm: RMSNorm on both q and k before readout, stabilizing the scale of q·S when the state matrix accumulates many outer products
Log-space decay: Working in log-space log(α) avoids numerical underflow when α^T → 0 for long sequences
Learnable h0: The initial state S₀ = h0 is a learnable parameter (zero-initialized), acting as a compressed "system prompt"
Negative Softplus gate: Ensures α ∈ (0, 1] by construction — allows perfect memory (α=1) while preventing state explosion (α>1)

Three Forward Paths

Path	Condition	Complexity	Description
Training	`use_cache=False`	O(T) parallel scan	Vectorized outer products → parallel prefix scan → vectorized readout
Inference prefill	`use_cache=True, T>1`	O(T) parallel scan	Same as training + extracts final state S_T for cache
Inference decode	`use_cache=True, T=1`	O(1) monoid_op	Single `monoid_op` to fold new token into state → one matmul readout

Model Details

Parameter	Value
Model	`NoesisLab/Spartacus-1B-Instruct`
Architecture	MonoidForCausalLM
Parameters	~1.34B (tied embeddings)
Hidden size	2048
Intermediate size (MLP)	8192
Layers	16
Attention heads	32
Head dimension	64
Decay gate	Vector decay, d=64 per head
State matrix per head	64 × 64 = 4,096 floats
Vocabulary	128,256 (Llama-3.2 tokenizer)
Precision	bfloat16

Benchmarks (0-shot)

Task	Metric	Value	Stderr
ARC-Challenge	acc_norm	0.3063	±0.0135
ARC-Easy	acc	0.5518	±0.0102
HellaSwag	acc_norm	0.4610	±0.0050
PIQA	acc_norm	0.6915	±0.0108
WinoGrande	acc	0.5225	±0.0140

Comparison with ~1B Baselines (acc_norm, 0-shot)

Task	Spartacus-1B	TinyLlama-1.1B	Llama 3.2-1B	Mamba-1.4B	RWKV-6-1.6B
ARC-C	0.3063	0.3268	~0.359	0.284	~0.301
ARC-E	0.5518	0.5547	~0.752	0.512	~0.530
HellaSwag	0.4610	0.4670	~0.546	0.435	~0.450
PIQA	0.6915	0.7210	~0.740	0.655	~0.670
WinoGrande	0.5225	0.5040	~0.592	0.510	~0.515

Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining O(1) inference time and memory per token. Scores marked with ~ are approximate community-reported values.

Parallel Scan Implementation

The monoid_scan_cuda.py module provides a Triton JIT-compiled parallel prefix scan for the vector-decay monoid:

Grid: (B*H*D_k, ceil(D_v/BLOCK_DV)) — one program per state matrix row
Forward: Sequential scan along T per row, parallelized across all (batch, head, d_k) dimensions
Backward: Reverse-order adjoint scan with per-row D_v reduction (minimal atomic_add)
Fallback: Pure PyTorch sequential scan for CPU/MPS
Auto-dispatch: CUDA → Triton kernel, otherwise → PyTorch fallback

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "NoesisLab/Spartacus-1B-Instruct",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

File Structure

MonoidForCausalLM.py       # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
monoid_scan_cuda.py        # Triton JIT parallel prefix scan (vector decay) + PyTorch fallback
model.safetensors          # Model weights (bfloat16)
config.json                # Model configuration
tokenizer.json             # Llama-3.2 tokenizer

Citation

@software{spartacus2025,
  title={Spartacus: Causal Monoid Language Model with O(1) Inference},
  author={NoesisLab},
  year={2025},
  url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
  description={Replaces softmax attention with vector-decay monoid state compression for constant-time, constant-memory autoregressive generation}
}

License

Apache 2.0