sybil / EXTRACTION_README.md
Niko.Koutsoubis
Add embedding extraction pipeline for federated learning
a091733
# Sybil Embedding Extraction Pipeline
This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for **federated learning** deployments where sites need to generate embeddings locally without sharing raw medical images.
## Features
-**Automatic Model Download**: Downloads Sybil model from HuggingFace automatically
-**Multi-GPU Support**: Process scans in parallel across multiple GPUs
-**Smart Filtering**: Automatically filters out localizer/scout scans
-**PID-Based Extraction**: Extract embeddings for specific patient cohorts
-**Checkpoint System**: Save progress every N scans to prevent data loss
-**Timepoint Detection**: Automatically detects T0, T1, T2... from scan dates
-**Directory Caching**: Cache directory scans for 100x faster reruns
## Quick Start
### Installation
```bash
# Install required packages
pip install huggingface_hub torch numpy pandas pydicom
```
### Basic Usage
```bash
# Extract embeddings from all scans
python extract-embeddings.py \
--root-dir /path/to/NLST/data \
--output-dir embeddings_output
```
### Extract Specific Patient Cohort
```bash
# Extract only patients listed in a CSV file
python extract-embeddings.py \
--root-dir /path/to/NLST/data \
--pid-csv subsets/train_pids.csv \
--output-dir embeddings_train
```
## Command Line Arguments
### Required
- `--root-dir`: Root directory containing DICOM files (e.g., `/data/NLST`)
### Optional - Data Selection
- `--pid-csv`: CSV file with "pid" column to filter specific patients
- `--max-subjects`: Limit to N subjects (useful for testing)
- `--output-dir`: Output directory (default: `embeddings_output`)
### Optional - Performance Tuning
- `--num-gpus`: Number of GPUs to use (default: 1)
- `--num-parallel`: Process N scans simultaneously (default: 1, recommend 1-4)
- `--num-workers`: Parallel workers for directory scanning (default: 4, recommend 4-12)
- `--checkpoint-interval`: Save checkpoint every N scans (default: 1000)
## Expected Directory Structure
Your DICOM data should follow this structure:
```
/path/to/NLST/
├── NLST/
│ ├── <PID_1>/
│ │ ├── MM-DD-YYYY-NLST-LSS-<scan_id>/
│ │ │ ├── <series_id>/
│ │ │ │ ├── *.dcm
│ │ │ │ └── ...
│ │ │ └── ...
│ │ └── ...
│ ├── <PID_2>/
│ └── ...
```
## Output Format
### Embeddings File: `all_embeddings.parquet`
Parquet file with columns:
- `case_number`: Patient ID (PID)
- `subject_id`: Same as case_number
- `scan_id`: Unique scan identifier
- `timepoint`: T0, T1, T2... (year-based, e.g., 1999→T0, 2000→T1)
- `dicom_directory`: Full path to scan directory
- `num_dicom_files`: Number of DICOM slices
- `embedding_index`: Index in embedding array
- `embedding`: 512-dimensional embedding array
### Metadata File: `dataset_metadata.json`
Complete metadata including:
- Dataset info (total scans, embedding dimensions)
- Model info (Sybil ensemble, extraction layer)
- Per-scan metadata (paths, statistics)
- Failed scans with error messages
## Performance Tips
### For Large Datasets (>10K scans)
```bash
# Use cached directory list and multi-GPU processing
python extract-embeddings.py \
--root-dir /data/NLST \
--num-gpus 4 \
--num-parallel 4 \
--num-workers 12 \
--checkpoint-interval 500
```
**Memory Requirements**: ~10GB VRAM per parallel scan
- `--num-parallel 1`: Safe for 16GB GPUs
- `--num-parallel 2`: Safe for 24GB GPUs
- `--num-parallel 4`: Requires 40GB+ GPUs
### For Subset Extraction (Train/Test Split)
```bash
# Extract training set
python extract-embeddings.py \
--root-dir /data/NLST \
--pid-csv train_pids.csv \
--output-dir embeddings_train \
--num-workers 12
# Extract test set
python extract-embeddings.py \
--root-dir /data/NLST \
--pid-csv test_pids.csv \
--output-dir embeddings_test \
--num-workers 12
```
**Speed**: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup)
## Loading Embeddings for Training
```python
import pandas as pd
import numpy as np
# Load embeddings
df = pd.read_parquet('embeddings_output/all_embeddings.parquet')
# Extract embedding array
embeddings = np.stack(df['embedding'].values) # Shape: (num_scans, 512)
# Access metadata
pids = df['case_number'].values
timepoints = df['timepoint'].values
```
## Troubleshooting
### Out of Memory (OOM) Errors
- Reduce `--num-parallel` to 1 or 2
- Use fewer GPUs with `--num-gpus 1`
### Slow Directory Scanning
- Increase `--num-workers` (try 8-12 for fast storage)
- Use `--pid-csv` to filter early (100x speedup)
- Rerun will use cached directory list automatically
### Missing Timepoints
- Timepoints are extracted from year in scan path (1999→T0, 2000→T1)
- If `timepoint` is None, year pattern wasn't found in path
- You can manually map scans to timepoints using `dicom_directory` column
### Failed Scans
- Check `dataset_metadata.json` for `failed_scans` section
- Common causes: corrupted DICOM files, insufficient slices, invalid metadata
## Federated Learning Integration
This script is designed for **privacy-preserving federated learning**:
1. **Each site runs extraction locally** on their DICOM data
2. **Embeddings are saved** (not raw DICOM images)
3. **Sites share embeddings** with federated learning system
4. **Central server trains model** on embeddings without accessing raw data
### Workflow for Sites
```bash
# 1. Download extraction script
wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py
# 2. Extract embeddings for train/test splits
python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings
python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings
# 3. Share embeddings with federated learning system
# (embeddings are much smaller and preserve privacy better than raw DICOM)
```
## Citation
If you use this extraction pipeline, please cite the Sybil model:
```bibtex
@article{sybil2023,
title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans},
author={...},
journal={...},
year={2023}
}
```
## Support
For issues or questions:
- Model issues: https://huggingface.co/Lab-Rasool/sybil
- Federated learning: Contact your FL system administrator