PRISM - Partitioning Residue Identity in Somatic Maturation
PRISM is an antibody language model that jointly predicts amino acid identity and germline/non-germline (GL/NGL) position classification, enabling developability-aware antibody sequence modeling.
Paper: Explicit representation of germline and non-germline residues improves antibody language modeling
Quick Start
pip install prism-antibody
Inference
import prism
# Auto-downloads from HF Hub (cached after first use)
model = prism.pretrained("RomeroLab-Duke/prism-antibody")
# Extract germline log-probabilities [L, 20]
gl = model.extract_GL_logit("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")
# Extract non-germline log-probabilities [L, 20]
ngl = model.extract_NGL_logit("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")
# Extract marginalized log-probabilities (logsumexp of GL + NGL) [L, 20]
marg = model.extract_marginalized_logit("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")
# Alpha gating values (GL/NGL mixture weights) [L]
alpha = model.extract_alpha("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")
# Mean-pooled embedding [H]
emb = model.extract_embedding("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")
# Perplexity (scalar)
ppl = model.perplexity("EVQLVESGGGLVQPGGSLRLSCAASGFTFS...")
# Score mutations (log-likelihood ratio)
score = model.score_mutations("EVQLVESGGGLVQ...", "EVQLVDSGGGLVQ...")
All methods accept a single string or a list of strings.
Finetuning on Your Data
import prism
model = prism.pretrained("RomeroLab-Duke/prism-antibody")
# Finetune on your antibody data (parquet, pickle, or csv)
best_ckpt = model.finetune(
data_path="my_antibodies.parquet",
output_dir="outputs/my_finetune",
max_steps=5000,
learning_rate=1e-4,
batch_size=32,
)
# Model is updated in-place — use immediately
gl = model.extract_GL_logit("EVQLVESGGGLVQ...")
Your data file needs HEAVY_CHAIN_AA_SEQUENCE and/or LIGHT_CHAIN_AA_SEQUENCE columns.
If no split column is present, a random 90/5/5 train/valid/test split is created automatically.
Column Order
extract_GL_logit, extract_NGL_logit, and extract_marginalized_logit
return [L, 20] arrays. The 20 columns correspond to amino acids in
alphabetical order:
A C D E F G H I K L M N P Q R S T V W Y
Accessible via model.AA_ORDER.
Model Architecture
- Base: ESM2 (35M parameters)
- Multi-head: AA head + Origin head + Alpha gating + Final head
- Training: Two-stage (pretrain on 66M unpaired OAS, finetune on paired antibodies)
- Custom tokens: Lowercase amino acids for non-germline residues
- Gene conditioning: V/J gene embeddings + region embeddings
Citation
@article{prism2025,
title={Explicit representation of germline and non-germline residues improves antibody language modeling},
author={...},
year={2025}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support