TEDDY / teddy /data_processing /tokenization /README.md

Upload folder using huggingface_hub

4527b5f verified 5 months ago

4.35 kB

The tokenize_for_model.py script is designed to tokenize gene expression data for use in our models. It takes in processesed the, applies various tokenization techniques, and prepares it for training or inference.

General Workflow

The script follows these main steps: 0. Load Tokenization Arguments: The script starts by loading the tokenization arguments from a configuration file or dictionary.

Load Gene Tokenizer: It loads a pre-trained gene tokenizer based on the provided tokenization arguments.
Load AnnData: The script reads the gene expression data from an AnnData file.
Check Genes in Tokenizer: It verifies that the genes in the dataset are present in the tokenizer's vocabulary.
Build Token Array: The script constructs a token array for the genes in the dataset.
Convert Processed Layer to Dense: It converts the processed layer of the AnnData object to a dense matrix.
Tokenize in Batches: The script processes the data in batches, applying tokenization and optional binning or ranking.
Save Tokenized Data: Finally, the script saves the tokenized data to disk.

Tokenization Arguments

The script uses several tokenization arguments to control its behavior. Here is an explanation of each argument and the steps they influence:

max_seq_len
- Description: Specifies the maximum sequence length for the tokenized data.
- Impact: Determines the number of genes to include in each tokenized sequence (cell). If add_cls is enabled, the sequence length is reduced by one to accommodate the CLS token.
add_cls
- Description: Indicates whether to prepend a CLS token to each sequence.
- Impact: If enabled, a CLS token is added to the beginning of each sequence, and the sequence length is adjusted accordingly.
cls_token_id
- Description: The token ID to use for the CLS token.
- Impact: If add_cls is enabled, this token ID is used for the CLS token.
random_genes
- Description: Specifies whether to select a random subset of genes before applying top-k selection
- Impact: If enabled, a random subset of genes is selected for each batch, and then the top-k values are determined from this subset.
include_zero_genes
- Description: Indicates whether to include zero-expression genes in the tokenized data.
- Impact: If enabled, zero-expression genes are included in the tokenized sequences. Otherwise, they are filtered out.
bins
- Description: Specifies the number of bins to use for binning expression values.
- Impact: If set, the script bins the expression values into the specified number of bins. This argument is only relevant for TEDDY-X.
continuous_rank
- Description: Indicates whether to rank expression values continuously.
- Impact: If enabled, the script ranks the expression values in the range [-1, 1]. This argument is only relevant for TEDDY-X.
gene_seed
- Description: A random seed for reproducibility.
- Impact: If set, the script uses this seed to ensure reproducible random operations.
gene_id_column
- Description: The column name in the AnnData object that contains gene IDs.
- Impact: The script uses this column to identify genes from vocab in the dataset.
label_column
- Description: The column name in the AnnData object that contains classification labels
- Impact: If set, the script adds these labels to the tokenized data.
bio_annotations
- Description: Indicates whether to add biological annotations to the tokenized data.
- Impact: If enabled, the script adds annotations such as disease, tissue, cell type, and sex to the tokenized data.
disease_mapping, tissue_mapping, cell_mapping, sex_mapping
- Description: File paths to JSON files containing mappings for biological annotations.
- Impact: The script uses these mappings to convert biological annotations to token IDs.
add_disease_annotation
- Description: Indicates whether to override labels with disease annotations.
- Impact: If enabled, the script overrides the labels with disease annotations.
max_shard_samples
- Description: The maximum number of samples per shard when saving the tokenized data.
- Impact: The script splits the tokenized data into shards with the specified maximum number of samples.