Model - GvEM (Genomic Variant Embedding Model)
GvEM is a PyTorch-based deep learning model designed to embed and model genomic mutation data from VCF (Variant Call Format) files using a biologically-informed hierarchy: Pathway β Chromosome β Gene β Mutations
Hierarchy of input data
example_data = { 'sample1': { 'pathway1': { 'chr1': { 'gene1': [ { 'impact': 'HIGH', 'reference': 'A', 'alternate': 'T' } ] } } } }
Features
- VCF Parser: Converts standard VCF files into a hierarchical JSON-like structure.
- MutationEmbedder: Learns embeddings for categorical mutation features (scalable).
- GeneEncoder: Processes lists of mutations using Transformer and heirarchical attention to get gene-level representations.
- ChromosomeEncoder: Aggregates gene encodings.
- PathwayEncoder: Aggregates chromosome encodings to yield final sample representation.
- Scalable: Easily extensible to new fields or biological groupings.
- HuggingFace Compatible: Designed for sharing and experimentation on the π€ Hub.
Uses
Direct Use :
- Obtain sample level embeddings
- Mutation pattern learning
- Transfer learning across genomic datasets
Downstream Use :
- Variant-based disease prediction (e.g., cancer, rare diseases, ASD)
- Multi-omics fusion models (tabular + image + VCF)
- Cohort level mutation analysis
- Fine-tuning for prognosis, drug response prediction, or variant effect interpretation.
Limitations
- Use in clinical decision-making without expert oversight.
- Input variants must already be annotated.
- Application to non-human genomes, unless explicitly fine-tuned for those organisms.
- High-resolution functional variant prediction - FUTURE DEVELOPMENT TO BE MADE