TEDDY

File size: 4,345 Bytes

4527b5f

The `tokenize_for_model.py` script is designed to tokenize gene expression data for use in our models. It takes in processesed the, applies various tokenization techniques, and prepares it for training or inference.

# General Workflow
The script follows these main steps:
0. **Load Tokenization Arguments**: The script starts by loading the tokenization arguments from a configuration file or dictionary.
1. **Load Gene Tokenizer**: It loads a pre-trained gene tokenizer based on the provided tokenization arguments.
2. **Load AnnData**: The script reads the gene expression data from an AnnData file.
3. **Check Genes in Tokenizer**: It verifies that the genes in the dataset are present in the tokenizer's vocabulary.
4. **Build Token Array**: The script constructs a token array for the genes in the dataset.
5. **Convert Processed Layer to Dense**: It converts the processed layer of the AnnData object to a dense matrix.
6. **Tokenize in Batches**: The script processes the data in batches, applying tokenization and optional binning or ranking.
7. **Save Tokenized Data**: Finally, the script saves the tokenized data to disk.

# Tokenization Arguments
The script uses several tokenization arguments to control its behavior. Here is an explanation of each argument and the steps they influence:

- `max_seq_len`
    - Description: Specifies the maximum sequence length for the tokenized data.
    - Impact: Determines the number of genes to include in each tokenized sequence (cell). If add_cls is enabled, the sequence length is reduced by one to accommodate the CLS token.
- `add_cls`
    - Description: Indicates whether to prepend a CLS token to each sequence.
    - Impact: If enabled, a CLS token is added to the beginning of each sequence, and the sequence length is adjusted accordingly.
- `cls_token_id`
    - Description: The token ID to use for the CLS token.
    - Impact: If add_cls is enabled, this token ID is used for the CLS token.
- `random_genes`
    - Description: Specifies whether to select a random subset of genes before applying top-k selection
    - Impact: If enabled, a random subset of genes is selected for each batch, and then the top-k values are determined from this subset.
- `include_zero_genes`
    - Description: Indicates whether to include zero-expression genes in the tokenized data.
    - Impact: If enabled, zero-expression genes are included in the tokenized sequences. Otherwise, they are filtered out.
- `bins`
    - Description: Specifies the number of bins to use for binning expression values.
    - Impact: If set, the script bins the expression values into the specified number of bins. This argument is only relevant for TEDDY-X.
- `continuous_rank`
    - Description: Indicates whether to rank expression values continuously.
    - Impact: If enabled, the script ranks the expression values in the range [-1, 1]. This argument is only relevant for TEDDY-X.
- `gene_seed`
    - Description: A random seed for reproducibility.
    - Impact: If set, the script uses this seed to ensure reproducible random operations.
- `gene_id_column`
    - Description: The column name in the AnnData object that contains gene IDs.
    - Impact: The script uses this column to identify genes from vocab in the dataset.
- `label_column`
    - Description: The column name in the AnnData object that contains classification labels
    - Impact: If set, the script adds these labels to the tokenized data.
- `bio_annotations`
    - Description: Indicates whether to add biological annotations to the tokenized data.
    - Impact: If enabled, the script adds annotations such as disease, tissue, cell type, and sex to the tokenized data.
- `disease_mapping`, `tissue_mapping`, `cell_mapping`, `sex_mapping`
    - Description: File paths to JSON files containing mappings for biological annotations.
    - Impact: The script uses these mappings to convert biological annotations to token IDs.
- `add_disease_annotation`
    - Description: Indicates whether to override labels with disease annotations.
    - Impact: If enabled, the script overrides the labels with disease annotations.
- `max_shard_samples`
    - Description: The maximum number of samples per shard when saving the tokenized data.
    - Impact: The script splits the tokenized data into shards with the specified maximum number of samples.