| # Sybil Embedding Extraction Pipeline | |
| This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for **federated learning** deployments where sites need to generate embeddings locally without sharing raw medical images. | |
| ## Features | |
| - ✅ **Automatic Model Download**: Downloads Sybil model from HuggingFace automatically | |
| - ✅ **Multi-GPU Support**: Process scans in parallel across multiple GPUs | |
| - ✅ **Smart Filtering**: Automatically filters out localizer/scout scans | |
| - ✅ **PID-Based Extraction**: Extract embeddings for specific patient cohorts | |
| - ✅ **Checkpoint System**: Save progress every N scans to prevent data loss | |
| - ✅ **Timepoint Detection**: Automatically detects T0, T1, T2... from scan dates | |
| - ✅ **Directory Caching**: Cache directory scans for 100x faster reruns | |
| ## Quick Start | |
| ### Installation | |
| ```bash | |
| # Install required packages | |
| pip install huggingface_hub torch numpy pandas pydicom | |
| ``` | |
| ### Basic Usage | |
| ```bash | |
| # Extract embeddings from all scans | |
| python extract-embeddings.py \ | |
| --root-dir /path/to/NLST/data \ | |
| --output-dir embeddings_output | |
| ``` | |
| ### Extract Specific Patient Cohort | |
| ```bash | |
| # Extract only patients listed in a CSV file | |
| python extract-embeddings.py \ | |
| --root-dir /path/to/NLST/data \ | |
| --pid-csv subsets/train_pids.csv \ | |
| --output-dir embeddings_train | |
| ``` | |
| ## Command Line Arguments | |
| ### Required | |
| - `--root-dir`: Root directory containing DICOM files (e.g., `/data/NLST`) | |
| ### Optional - Data Selection | |
| - `--pid-csv`: CSV file with "pid" column to filter specific patients | |
| - `--max-subjects`: Limit to N subjects (useful for testing) | |
| - `--output-dir`: Output directory (default: `embeddings_output`) | |
| ### Optional - Performance Tuning | |
| - `--num-gpus`: Number of GPUs to use (default: 1) | |
| - `--num-parallel`: Process N scans simultaneously (default: 1, recommend 1-4) | |
| - `--num-workers`: Parallel workers for directory scanning (default: 4, recommend 4-12) | |
| - `--checkpoint-interval`: Save checkpoint every N scans (default: 1000) | |
| ## Expected Directory Structure | |
| Your DICOM data should follow this structure: | |
| ``` | |
| /path/to/NLST/ | |
| ├── NLST/ | |
| │ ├── <PID_1>/ | |
| │ │ ├── MM-DD-YYYY-NLST-LSS-<scan_id>/ | |
| │ │ │ ├── <series_id>/ | |
| │ │ │ │ ├── *.dcm | |
| │ │ │ │ └── ... | |
| │ │ │ └── ... | |
| │ │ └── ... | |
| │ ├── <PID_2>/ | |
| │ └── ... | |
| ``` | |
| ## Output Format | |
| ### Embeddings File: `all_embeddings.parquet` | |
| Parquet file with columns: | |
| - `case_number`: Patient ID (PID) | |
| - `subject_id`: Same as case_number | |
| - `scan_id`: Unique scan identifier | |
| - `timepoint`: T0, T1, T2... (year-based, e.g., 1999→T0, 2000→T1) | |
| - `dicom_directory`: Full path to scan directory | |
| - `num_dicom_files`: Number of DICOM slices | |
| - `embedding_index`: Index in embedding array | |
| - `embedding`: 512-dimensional embedding array | |
| ### Metadata File: `dataset_metadata.json` | |
| Complete metadata including: | |
| - Dataset info (total scans, embedding dimensions) | |
| - Model info (Sybil ensemble, extraction layer) | |
| - Per-scan metadata (paths, statistics) | |
| - Failed scans with error messages | |
| ## Performance Tips | |
| ### For Large Datasets (>10K scans) | |
| ```bash | |
| # Use cached directory list and multi-GPU processing | |
| python extract-embeddings.py \ | |
| --root-dir /data/NLST \ | |
| --num-gpus 4 \ | |
| --num-parallel 4 \ | |
| --num-workers 12 \ | |
| --checkpoint-interval 500 | |
| ``` | |
| **Memory Requirements**: ~10GB VRAM per parallel scan | |
| - `--num-parallel 1`: Safe for 16GB GPUs | |
| - `--num-parallel 2`: Safe for 24GB GPUs | |
| - `--num-parallel 4`: Requires 40GB+ GPUs | |
| ### For Subset Extraction (Train/Test Split) | |
| ```bash | |
| # Extract training set | |
| python extract-embeddings.py \ | |
| --root-dir /data/NLST \ | |
| --pid-csv train_pids.csv \ | |
| --output-dir embeddings_train \ | |
| --num-workers 12 | |
| # Extract test set | |
| python extract-embeddings.py \ | |
| --root-dir /data/NLST \ | |
| --pid-csv test_pids.csv \ | |
| --output-dir embeddings_test \ | |
| --num-workers 12 | |
| ``` | |
| **Speed**: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup) | |
| ## Loading Embeddings for Training | |
| ```python | |
| import pandas as pd | |
| import numpy as np | |
| # Load embeddings | |
| df = pd.read_parquet('embeddings_output/all_embeddings.parquet') | |
| # Extract embedding array | |
| embeddings = np.stack(df['embedding'].values) # Shape: (num_scans, 512) | |
| # Access metadata | |
| pids = df['case_number'].values | |
| timepoints = df['timepoint'].values | |
| ``` | |
| ## Troubleshooting | |
| ### Out of Memory (OOM) Errors | |
| - Reduce `--num-parallel` to 1 or 2 | |
| - Use fewer GPUs with `--num-gpus 1` | |
| ### Slow Directory Scanning | |
| - Increase `--num-workers` (try 8-12 for fast storage) | |
| - Use `--pid-csv` to filter early (100x speedup) | |
| - Rerun will use cached directory list automatically | |
| ### Missing Timepoints | |
| - Timepoints are extracted from year in scan path (1999→T0, 2000→T1) | |
| - If `timepoint` is None, year pattern wasn't found in path | |
| - You can manually map scans to timepoints using `dicom_directory` column | |
| ### Failed Scans | |
| - Check `dataset_metadata.json` for `failed_scans` section | |
| - Common causes: corrupted DICOM files, insufficient slices, invalid metadata | |
| ## Federated Learning Integration | |
| This script is designed for **privacy-preserving federated learning**: | |
| 1. **Each site runs extraction locally** on their DICOM data | |
| 2. **Embeddings are saved** (not raw DICOM images) | |
| 3. **Sites share embeddings** with federated learning system | |
| 4. **Central server trains model** on embeddings without accessing raw data | |
| ### Workflow for Sites | |
| ```bash | |
| # 1. Download extraction script | |
| wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py | |
| # 2. Extract embeddings for train/test splits | |
| python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings | |
| python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings | |
| # 3. Share embeddings with federated learning system | |
| # (embeddings are much smaller and preserve privacy better than raw DICOM) | |
| ``` | |
| ## Citation | |
| If you use this extraction pipeline, please cite the Sybil model: | |
| ```bibtex | |
| @article{sybil2023, | |
| title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans}, | |
| author={...}, | |
| journal={...}, | |
| year={2023} | |
| } | |
| ``` | |
| ## Support | |
| For issues or questions: | |
| - Model issues: https://huggingface.co/Lab-Rasool/sybil | |
| - Federated learning: Contact your FL system administrator | |