Sybil Embedding Extraction Pipeline
This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for federated learning deployments where sites need to generate embeddings locally without sharing raw medical images.
Features
- β Automatic Model Download: Downloads Sybil model from HuggingFace automatically
- β Multi-GPU Support: Process scans in parallel across multiple GPUs
- β Smart Filtering: Automatically filters out localizer/scout scans
- β PID-Based Extraction: Extract embeddings for specific patient cohorts
- β Checkpoint System: Save progress every N scans to prevent data loss
- β Timepoint Detection: Automatically detects T0, T1, T2... from scan dates
- β Directory Caching: Cache directory scans for 100x faster reruns
Quick Start
Installation
# Install required packages
pip install huggingface_hub torch numpy pandas pydicom
Basic Usage
# Extract embeddings from all scans
python extract-embeddings.py \
--root-dir /path/to/NLST/data \
--output-dir embeddings_output
Extract Specific Patient Cohort
# Extract only patients listed in a CSV file
python extract-embeddings.py \
--root-dir /path/to/NLST/data \
--pid-csv subsets/train_pids.csv \
--output-dir embeddings_train
Command Line Arguments
Required
--root-dir: Root directory containing DICOM files (e.g.,/data/NLST)
Optional - Data Selection
--pid-csv: CSV file with "pid" column to filter specific patients--max-subjects: Limit to N subjects (useful for testing)--output-dir: Output directory (default:embeddings_output)
Optional - Performance Tuning
--num-gpus: Number of GPUs to use (default: 1)--num-parallel: Process N scans simultaneously (default: 1, recommend 1-4)--num-workers: Parallel workers for directory scanning (default: 4, recommend 4-12)--checkpoint-interval: Save checkpoint every N scans (default: 1000)
Expected Directory Structure
Your DICOM data should follow this structure:
/path/to/NLST/
βββ NLST/
β βββ <PID_1>/
β β βββ MM-DD-YYYY-NLST-LSS-<scan_id>/
β β β βββ <series_id>/
β β β β βββ *.dcm
β β β β βββ ...
β β β βββ ...
β β βββ ...
β βββ <PID_2>/
β βββ ...
Output Format
Embeddings File: all_embeddings.parquet
Parquet file with columns:
case_number: Patient ID (PID)subject_id: Same as case_numberscan_id: Unique scan identifiertimepoint: T0, T1, T2... (year-based, e.g., 1999βT0, 2000βT1)dicom_directory: Full path to scan directorynum_dicom_files: Number of DICOM slicesembedding_index: Index in embedding arrayembedding: 512-dimensional embedding array
Metadata File: dataset_metadata.json
Complete metadata including:
- Dataset info (total scans, embedding dimensions)
- Model info (Sybil ensemble, extraction layer)
- Per-scan metadata (paths, statistics)
- Failed scans with error messages
Performance Tips
For Large Datasets (>10K scans)
# Use cached directory list and multi-GPU processing
python extract-embeddings.py \
--root-dir /data/NLST \
--num-gpus 4 \
--num-parallel 4 \
--num-workers 12 \
--checkpoint-interval 500
Memory Requirements: ~10GB VRAM per parallel scan
--num-parallel 1: Safe for 16GB GPUs--num-parallel 2: Safe for 24GB GPUs--num-parallel 4: Requires 40GB+ GPUs
For Subset Extraction (Train/Test Split)
# Extract training set
python extract-embeddings.py \
--root-dir /data/NLST \
--pid-csv train_pids.csv \
--output-dir embeddings_train \
--num-workers 12
# Extract test set
python extract-embeddings.py \
--root-dir /data/NLST \
--pid-csv test_pids.csv \
--output-dir embeddings_test \
--num-workers 12
Speed: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup)
Loading Embeddings for Training
import pandas as pd
import numpy as np
# Load embeddings
df = pd.read_parquet('embeddings_output/all_embeddings.parquet')
# Extract embedding array
embeddings = np.stack(df['embedding'].values) # Shape: (num_scans, 512)
# Access metadata
pids = df['case_number'].values
timepoints = df['timepoint'].values
Troubleshooting
Out of Memory (OOM) Errors
- Reduce
--num-parallelto 1 or 2 - Use fewer GPUs with
--num-gpus 1
Slow Directory Scanning
- Increase
--num-workers(try 8-12 for fast storage) - Use
--pid-csvto filter early (100x speedup) - Rerun will use cached directory list automatically
Missing Timepoints
- Timepoints are extracted from year in scan path (1999βT0, 2000βT1)
- If
timepointis None, year pattern wasn't found in path - You can manually map scans to timepoints using
dicom_directorycolumn
Failed Scans
- Check
dataset_metadata.jsonforfailed_scanssection - Common causes: corrupted DICOM files, insufficient slices, invalid metadata
Federated Learning Integration
This script is designed for privacy-preserving federated learning:
- Each site runs extraction locally on their DICOM data
- Embeddings are saved (not raw DICOM images)
- Sites share embeddings with federated learning system
- Central server trains model on embeddings without accessing raw data
Workflow for Sites
# 1. Download extraction script
wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py
# 2. Extract embeddings for train/test splits
python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings
python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings
# 3. Share embeddings with federated learning system
# (embeddings are much smaller and preserve privacy better than raw DICOM)
Citation
If you use this extraction pipeline, please cite the Sybil model:
@article{sybil2023,
title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans},
author={...},
journal={...},
year={2023}
}
Support
For issues or questions:
- Model issues: https://huggingface.co/Lab-Rasool/sybil
- Federated learning: Contact your FL system administrator