| # PreprocessReadMe.md | |
| The `preprocess.py` script is designed to preprocess gene expression data for use in our models. It takes in `data.raw.X` or `data.X` data, applies various preprocessing techniques, and prepares it for training or inference. | |
| # General Workflow | |
| The script follows these main steps: | |
| 0. **Load Data and Metadata**: The script starts by loading the gene expression data from an AnnData file and metadata from a JSON file. | |
| 1. **Set Raw Layer**: It checks if the `data.raw.X` layer is set in the AnnData object. If not, it sets it based on the integer counts in the `data.X`. | |
| 2. **Initialize Processed Layer**: It initializes the `data.layer['processed']` in the AnnData object, which is the layer that will be affected by preprocessing. | |
| 3. **Filter Genes by Reference ID**: It filters genes based on reference IDs if specified in the hyperparameters. | |
| 4. **Remove Assays**: It removes specified assays from the data. | |
| 5. **Filter Cells by Gene Counts**: It filters out cells with gene counts below a specified threshold. | |
| 6. **Filter Cells by Mitochondrial Fraction**: It removes cells with a high mitochondrial gene fraction. | |
| 7. **Filter Highly Variable Genes**: It filters genes to retain only highly variable ones using specified methods. | |
| 8. **Normalize Data**: It normalizes the data by applying row (gene level) normalization and scaling. | |
| 9. **Scale Columns by Median**: It scales columns based on median values from a specified dictionary. | |
| 10. **Log Transform**: It applies a log+1 transformation to the data. | |
| 11. **Compute Medians**: It computes and saves medians of the processed data if specified. | |
| 12. **Update Metadata**: It updates the metadata with cell counts and processing arguments. | |
| 13. **Save and Cleanup**: It saves the processed data and metadata to disk and performs garbage collection. | |
| # Preprocessing Arguments | |
| The script uses several preprocessing arguments to control its behavior. Here is an explanation of each argument and the steps they influence: | |
| - `reference_id_only` | |
| - Description: Specifies whether to filter genes by reference ID. | |
| - Impact: If enabled, the script filters genes based on reference IDs. | |
| - `remove_assays` | |
| - Description: List of assays to remove from the data. | |
| - Impact: The script removes specified assays from the data. | |
| - `min_gene_counts` | |
| - Description: Minimum gene counts required for cells to be retained. | |
| - Impact: The script filters out cells with gene counts below this threshold. | |
| - `max_mitochondrial_prop` | |
| - Description: Maximum mitochondrial gene fraction allowed for cells. | |
| - Impact: The script removes cells with a mitochondrial gene fraction above this threshold. | |
| - `hvg_method` | |
| - Description: Method to use for filtering highly variable genes. | |
| - Impact: The script filters genes to retain only highly variable ones using the specified method. | |
| - `normalized_total` | |
| - Description: Value to normalize the total gene expression to. | |
| - Impact: The script normalizes the data by applying row (gene level) normalization and scaling. | |
| - `median_dict` | |
| - Description: Path to a JSON file containing median values for scaling columns. | |
| - Impact: The script scales columns based on median values from the specified dictionary. | |
| - `median_column` | |
| - Description: Column name to use for looking up median values. | |
| - Impact: The script uses this column to look up median values for scaling. | |
| - `log1p` | |
| - Description: Indicates whether to apply a log transformation to the data. | |
| - Impact: If enabled, the script applies a log transformation to the data. | |
| - `compute_medians` | |
| - Description: Indicates whether to compute and save medians of the processed data. | |
| - Impact: If enabled, the script computes and saves medians of the processed data. | |