# Sybil Embedding Extraction Pipeline This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for **federated learning** deployments where sites need to generate embeddings locally without sharing raw medical images. ## Features - ✅ **Automatic Model Download**: Downloads Sybil model from HuggingFace automatically - ✅ **Multi-GPU Support**: Process scans in parallel across multiple GPUs - ✅ **Smart Filtering**: Automatically filters out localizer/scout scans - ✅ **PID-Based Extraction**: Extract embeddings for specific patient cohorts - ✅ **Checkpoint System**: Save progress every N scans to prevent data loss - ✅ **Timepoint Detection**: Automatically detects T0, T1, T2... from scan dates - ✅ **Directory Caching**: Cache directory scans for 100x faster reruns ## Quick Start ### Installation ```bash # Install required packages pip install huggingface_hub torch numpy pandas pydicom ``` ### Basic Usage ```bash # Extract embeddings from all scans python extract-embeddings.py \ --root-dir /path/to/NLST/data \ --output-dir embeddings_output ``` ### Extract Specific Patient Cohort ```bash # Extract only patients listed in a CSV file python extract-embeddings.py \ --root-dir /path/to/NLST/data \ --pid-csv subsets/train_pids.csv \ --output-dir embeddings_train ``` ## Command Line Arguments ### Required - `--root-dir`: Root directory containing DICOM files (e.g., `/data/NLST`) ### Optional - Data Selection - `--pid-csv`: CSV file with "pid" column to filter specific patients - `--max-subjects`: Limit to N subjects (useful for testing) - `--output-dir`: Output directory (default: `embeddings_output`) ### Optional - Performance Tuning - `--num-gpus`: Number of GPUs to use (default: 1) - `--num-parallel`: Process N scans simultaneously (default: 1, recommend 1-4) - `--num-workers`: Parallel workers for directory scanning (default: 4, recommend 4-12) - `--checkpoint-interval`: Save checkpoint every N scans (default: 1000) ## Expected Directory Structure Your DICOM data should follow this structure: ``` /path/to/NLST/ ├── NLST/ │ ├── / │ │ ├── MM-DD-YYYY-NLST-LSS-/ │ │ │ ├── / │ │ │ │ ├── *.dcm │ │ │ │ └── ... │ │ │ └── ... │ │ └── ... │ ├── / │ └── ... ``` ## Output Format ### Embeddings File: `all_embeddings.parquet` Parquet file with columns: - `case_number`: Patient ID (PID) - `subject_id`: Same as case_number - `scan_id`: Unique scan identifier - `timepoint`: T0, T1, T2... (year-based, e.g., 1999→T0, 2000→T1) - `dicom_directory`: Full path to scan directory - `num_dicom_files`: Number of DICOM slices - `embedding_index`: Index in embedding array - `embedding`: 512-dimensional embedding array ### Metadata File: `dataset_metadata.json` Complete metadata including: - Dataset info (total scans, embedding dimensions) - Model info (Sybil ensemble, extraction layer) - Per-scan metadata (paths, statistics) - Failed scans with error messages ## Performance Tips ### For Large Datasets (>10K scans) ```bash # Use cached directory list and multi-GPU processing python extract-embeddings.py \ --root-dir /data/NLST \ --num-gpus 4 \ --num-parallel 4 \ --num-workers 12 \ --checkpoint-interval 500 ``` **Memory Requirements**: ~10GB VRAM per parallel scan - `--num-parallel 1`: Safe for 16GB GPUs - `--num-parallel 2`: Safe for 24GB GPUs - `--num-parallel 4`: Requires 40GB+ GPUs ### For Subset Extraction (Train/Test Split) ```bash # Extract training set python extract-embeddings.py \ --root-dir /data/NLST \ --pid-csv train_pids.csv \ --output-dir embeddings_train \ --num-workers 12 # Extract test set python extract-embeddings.py \ --root-dir /data/NLST \ --pid-csv test_pids.csv \ --output-dir embeddings_test \ --num-workers 12 ``` **Speed**: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup) ## Loading Embeddings for Training ```python import pandas as pd import numpy as np # Load embeddings df = pd.read_parquet('embeddings_output/all_embeddings.parquet') # Extract embedding array embeddings = np.stack(df['embedding'].values) # Shape: (num_scans, 512) # Access metadata pids = df['case_number'].values timepoints = df['timepoint'].values ``` ## Troubleshooting ### Out of Memory (OOM) Errors - Reduce `--num-parallel` to 1 or 2 - Use fewer GPUs with `--num-gpus 1` ### Slow Directory Scanning - Increase `--num-workers` (try 8-12 for fast storage) - Use `--pid-csv` to filter early (100x speedup) - Rerun will use cached directory list automatically ### Missing Timepoints - Timepoints are extracted from year in scan path (1999→T0, 2000→T1) - If `timepoint` is None, year pattern wasn't found in path - You can manually map scans to timepoints using `dicom_directory` column ### Failed Scans - Check `dataset_metadata.json` for `failed_scans` section - Common causes: corrupted DICOM files, insufficient slices, invalid metadata ## Federated Learning Integration This script is designed for **privacy-preserving federated learning**: 1. **Each site runs extraction locally** on their DICOM data 2. **Embeddings are saved** (not raw DICOM images) 3. **Sites share embeddings** with federated learning system 4. **Central server trains model** on embeddings without accessing raw data ### Workflow for Sites ```bash # 1. Download extraction script wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py # 2. Extract embeddings for train/test splits python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings # 3. Share embeddings with federated learning system # (embeddings are much smaller and preserve privacy better than raw DICOM) ``` ## Citation If you use this extraction pipeline, please cite the Sybil model: ```bibtex @article{sybil2023, title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans}, author={...}, journal={...}, year={2023} } ``` ## Support For issues or questions: - Model issues: https://huggingface.co/Lab-Rasool/sybil - Federated learning: Contact your FL system administrator