ProtT5-XL-UniRef50

Short description

Pretrained model on protein sequences using a masked language modeling (MLM) objective. The model can be used for protein feature extraction or fine-tuning on downstream biological prediction tasks. The model was developed by Ahmed Elnaggar et al. and more information can be found on the GitHub repository and in the accompanying paper. This repository is a fork of their HuggingFace repository. This model is trained on uppercase amino acids: it only works with capital letter amino acids.

Model versions

ProtT5-XL-UniRef50: Based on the t5-3b architecture and pretrained on UniRef50, a dataset of ~45 million protein sequences.

Long description

ProtT5-XL-UniRef50 is based on the t5-3b model and was pretrained on a large corpus of protein sequences in a self-supervised fashion. This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those protein sequences.

One important difference between this T5 model and the original T5 version is the denosing objective. The original T5-3B model was pretrained using a span denosing objective, while this model was pre-trained with a Bart-like MLM denosing objective. The masking probability is consistent with the original T5 training by randomly masking 15% of the amino acids in the input.

It has been shown that the features extracted from this self-supervised model (LM-embeddings) captured important biophysical properties governing protein shape. This implied learning some of the grammar of the language of life realized in protein sequences.

Intended uses & limitations

The model could be used for protein feature extraction or to be fine-tuned on downstream tasks. We have noticed in some tasks one can gain more accuracy by fine-tuning the model rather than using it as a feature extractor. We have also noticed that for feature extraction, it is better to use the feature extracted from the encoder not from the decoder.

ProtT5-XL-UniRef50 is part of the ProtTrans family of models developed by Ahmed Elnaggar et al. and builds upon the ProtT5-XL-BFD checkpoint. Further information can be found in the original publication: IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

Metadata

Input

  • Description: List of uppercase protein sequences, each with variable length m.
  • Input format:
    • Shape: [n], where n is the number of sequences
    • Data format: [str]
  • Example:
    sequence_examples = ["PRTEINO", "SEQWENCE"]
    
  • Preprocessing:
    • Uppercase normalization.
    • Rare or undetermined amino acids "U, Z, O, B" are mapped to "X".
    • Tokenization: spaces are inserted between amino acids, and special tokens are added.

Output

  • Description: Each amino acid is represented by a 1024-dimensional vector.
  • Output format: tensor
    • Shape: [n, max_seq_len, 1024] with max_seq_len the longest sequences (max of m)
    • Data format: (float)
  • Postprocessing:
    • Per-protein embeddings are obtained by averaging the residue embeddings across the sequence.
    • Averaged embedding size: [n, 1024]

Model

  • Modality: Protein sequences
  • Scale: Per protein sequence
  • Description:
    • The model maps amino acid sequences to continuous embeddings (1024 dimensions per residue).
    • The encoder representations can be averaged to obtain a per-protein embedding.
    • Embeddings capture biophysical and structural information such as secondary structure, localization, and membrane properties.
  • Training data:
    • Pretrained on UniRef50, a non-redundant dataset of ~45 million protein sequences.
  • Publication: https://ieeexplore.ieee.org/document/9477085

Installation

Install the conda environment with all dependencies:

# Create the conda environment called virtual-human-chc-prottrans
conda env create -f environment.yaml

# Activate the environment
conda activate virtual-human-chc-prottrans

Example

Feature extraction example

from transformers import T5Tokenizer, T5EncoderModel
import torch
import re

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('virtual-human-chc/prot_t5_xl_uniref50', do_lower_case=False)

# Load the model
model = T5EncoderModel.from_pretrained('virtual-human-chc/prot_t5_xl_uniref50').to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
if device == torch.device("cpu"):
    model.to(torch.float32)

sequence_examples = ["PRTEINO", "SEQWENCE"]
# this will replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)

# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")

# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")

References

  1. Ahmed Elnaggar et al., ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing, IEEE TPAMI (2021).
  2. Hugging Face repository: https://huggingface.co/Rostlab/prot_t5_xl_uniref50
  3. Hugging Face Hub (fork): https://huggingface.co/virtual-human-chc/prot_t5_xl_uniref50
  4. GitHub repository: https://github.com/agemagician/ProtTrans
  5. Training dataset: UniRef50

Copyright

Code derived from https://github.com/agemagician/ProtTrans is licensed under the MIT License, © 2025 Ahmed Elnaggar. The ProtTrans pretrained models are released under the Academic Free License v3.0, © 2025 Ahmed Elnaggar. Additional code © 2025 Maksim Pavlov, licensed under MIT.

Downloads last month
84
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including virtual-human-chc/prot_t5_xl_uniref50