PRISM - Partitioning Residue Identity in Somatic Maturation

PRISM is an antibody language model that jointly predicts amino acid identity and germline/non-germline (GL/NGL) position classification, enabling developability-aware antibody sequence modeling.

Paper: Explicit representation of germline and non-germline residues improves antibody language modeling

Quick Start

pip install prism-antibody

Inference

import prism

# Auto-downloads from HF Hub (cached after first use)
model = prism.pretrained("RomeroLab-Duke/prism-antibody")

# Extract germline log-probabilities [L, 20]
gl = model.extract_GL_logit("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")

# Extract non-germline log-probabilities [L, 20]
ngl = model.extract_NGL_logit("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")

# Extract marginalized log-probabilities (logsumexp of GL + NGL) [L, 20]
marg = model.extract_marginalized_logit("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")

# Alpha gating values (GL/NGL mixture weights) [L]
alpha = model.extract_alpha("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")

# Mean-pooled embedding [H]
emb = model.extract_embedding("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")

# Perplexity (scalar)
ppl = model.perplexity("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")

# Score mutations (log-likelihood ratio)
score = model.score_mutations("EVQLVESGGGLVQ...", "EVQLVDSGGGLVQ...")

All methods accept a single string or a list of strings.

Finetuning on Your Data

import prism

model = prism.pretrained("RomeroLab-Duke/prism-antibody")

# Finetune on your antibody data (parquet, pickle, or csv)
best_ckpt = model.finetune(
    data_path="my_antibodies.parquet",
    output_dir="outputs/my_finetune",
    max_steps=5000,
    learning_rate=1e-4,
    batch_size=32,
)

# Model is updated in-place — use immediately
gl = model.extract_GL_logit("EVQLVESGGGLVQ...")

Your data file needs HEAVY_CHAIN_AA_SEQUENCE and/or LIGHT_CHAIN_AA_SEQUENCE columns. If no split column is present, a random 90/5/5 train/valid/test split is created automatically.

Column Order

extract_GL_logit, extract_NGL_logit, and extract_marginalized_logit return [L, 20] arrays. The 20 columns correspond to amino acids in alphabetical order:

A C D E F G H I K L M N P Q R S T V W Y

Accessible via model.AA_ORDER.

Model Architecture

  • Base: ESM2 (35M parameters)
  • Multi-head: AA head + Origin head + Alpha gating + Final head
  • Training: Two-stage (pretrain on 66M unpaired OAS, finetune on paired antibodies)
  • Custom tokens: Lowercase amino acids for non-germline residues
  • Gene conditioning: V/J gene embeddings + region embeddings

Citation

@article{prism2025,
  title={Explicit representation of germline and non-germline residues improves antibody language modeling},
  author={...},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support