TEDDY / teddy /data_processing /preprocessing /README.md

Upload folder using huggingface_hub

4527b5f verified 5 months ago

3.78 kB

	# PreprocessReadMe.md

	The `preprocess.py` script is designed to preprocess gene expression data for use in our models. It takes in `data.raw.X` or `data.X` data, applies various preprocessing techniques, and prepares it for training or inference.

	# General Workflow
	The script follows these main steps:
	0. Load Data and Metadata: The script starts by loading the gene expression data from an AnnData file and metadata from a JSON file.
	1. Set Raw Layer: It checks if the `data.raw.X` layer is set in the AnnData object. If not, it sets it based on the integer counts in the `data.X`.
	2. Initialize Processed Layer: It initializes the `data.layer['processed']` in the AnnData object, which is the layer that will be affected by preprocessing.
	3. Filter Genes by Reference ID: It filters genes based on reference IDs if specified in the hyperparameters.
	4. Remove Assays: It removes specified assays from the data.
	5. Filter Cells by Gene Counts: It filters out cells with gene counts below a specified threshold.
	6. Filter Cells by Mitochondrial Fraction: It removes cells with a high mitochondrial gene fraction.
	7. Filter Highly Variable Genes: It filters genes to retain only highly variable ones using specified methods.
	8. Normalize Data: It normalizes the data by applying row (gene level) normalization and scaling.
	9. Scale Columns by Median: It scales columns based on median values from a specified dictionary.
	10. Log Transform: It applies a log+1 transformation to the data.
	11. Compute Medians: It computes and saves medians of the processed data if specified.
	12. Update Metadata: It updates the metadata with cell counts and processing arguments.
	13. Save and Cleanup: It saves the processed data and metadata to disk and performs garbage collection.


	# Preprocessing Arguments
	The script uses several preprocessing arguments to control its behavior. Here is an explanation of each argument and the steps they influence:

	- `reference_id_only`
	- Description: Specifies whether to filter genes by reference ID.
	- Impact: If enabled, the script filters genes based on reference IDs.
	- `remove_assays`
	- Description: List of assays to remove from the data.
	- Impact: The script removes specified assays from the data.
	- `min_gene_counts`
	- Description: Minimum gene counts required for cells to be retained.
	- Impact: The script filters out cells with gene counts below this threshold.
	- `max_mitochondrial_prop`
	- Description: Maximum mitochondrial gene fraction allowed for cells.
	- Impact: The script removes cells with a mitochondrial gene fraction above this threshold.
	- `hvg_method`
	- Description: Method to use for filtering highly variable genes.
	- Impact: The script filters genes to retain only highly variable ones using the specified method.
	- `normalized_total`
	- Description: Value to normalize the total gene expression to.
	- Impact: The script normalizes the data by applying row (gene level) normalization and scaling.
	- `median_dict`
	- Description: Path to a JSON file containing median values for scaling columns.
	- Impact: The script scales columns based on median values from the specified dictionary.
	- `median_column`
	- Description: Column name to use for looking up median values.
	- Impact: The script uses this column to look up median values for scaling.
	- `log1p`
	- Description: Indicates whether to apply a log transformation to the data.
	- Impact: If enabled, the script applies a log transformation to the data.
	- `compute_medians`
	- Description: Indicates whether to compute and save medians of the processed data.
	- Impact: If enabled, the script computes and saves medians of the processed data.