sybil / EXTRACTION_README.md

Niko.Koutsoubis

Add embedding extraction pipeline for federated learning

a091733 19 days ago

6.46 kB

Sybil Embedding Extraction Pipeline

This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for federated learning deployments where sites need to generate embeddings locally without sharing raw medical images.

Features

✅ Automatic Model Download: Downloads Sybil model from HuggingFace automatically
✅ Multi-GPU Support: Process scans in parallel across multiple GPUs
✅ Smart Filtering: Automatically filters out localizer/scout scans
✅ PID-Based Extraction: Extract embeddings for specific patient cohorts
✅ Checkpoint System: Save progress every N scans to prevent data loss
✅ Timepoint Detection: Automatically detects T0, T1, T2... from scan dates
✅ Directory Caching: Cache directory scans for 100x faster reruns

Quick Start

Installation

# Install required packages
pip install huggingface_hub torch numpy pandas pydicom

Basic Usage

# Extract embeddings from all scans
python extract-embeddings.py \
  --root-dir /path/to/NLST/data \
  --output-dir embeddings_output

Extract Specific Patient Cohort

# Extract only patients listed in a CSV file
python extract-embeddings.py \
  --root-dir /path/to/NLST/data \
  --pid-csv subsets/train_pids.csv \
  --output-dir embeddings_train

Command Line Arguments

Required

--root-dir: Root directory containing DICOM files (e.g., /data/NLST)

Optional - Data Selection

--pid-csv: CSV file with "pid" column to filter specific patients
--max-subjects: Limit to N subjects (useful for testing)
--output-dir: Output directory (default: embeddings_output)

Optional - Performance Tuning

--num-gpus: Number of GPUs to use (default: 1)
--num-parallel: Process N scans simultaneously (default: 1, recommend 1-4)
--num-workers: Parallel workers for directory scanning (default: 4, recommend 4-12)
--checkpoint-interval: Save checkpoint every N scans (default: 1000)

Expected Directory Structure

Your DICOM data should follow this structure:

/path/to/NLST/
├── NLST/
│   ├── <PID_1>/
│   │   ├── MM-DD-YYYY-NLST-LSS-<scan_id>/
│   │   │   ├── <series_id>/
│   │   │   │   ├── *.dcm
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── ...
│   ├── <PID_2>/
│   └── ...

Output Format

Embeddings File: `all_embeddings.parquet`

Parquet file with columns:

case_number: Patient ID (PID)
subject_id: Same as case_number
scan_id: Unique scan identifier
timepoint: T0, T1, T2... (year-based, e.g., 1999→T0, 2000→T1)
dicom_directory: Full path to scan directory
num_dicom_files: Number of DICOM slices
embedding_index: Index in embedding array
embedding: 512-dimensional embedding array

Metadata File: `dataset_metadata.json`

Complete metadata including:

Dataset info (total scans, embedding dimensions)
Model info (Sybil ensemble, extraction layer)
Per-scan metadata (paths, statistics)
Failed scans with error messages

Performance Tips

For Large Datasets (>10K scans)

# Use cached directory list and multi-GPU processing
python extract-embeddings.py \
  --root-dir /data/NLST \
  --num-gpus 4 \
  --num-parallel 4 \
  --num-workers 12 \
  --checkpoint-interval 500

Memory Requirements: ~10GB VRAM per parallel scan

--num-parallel 1: Safe for 16GB GPUs
--num-parallel 2: Safe for 24GB GPUs
--num-parallel 4: Requires 40GB+ GPUs

For Subset Extraction (Train/Test Split)

# Extract training set
python extract-embeddings.py \
  --root-dir /data/NLST \
  --pid-csv train_pids.csv \
  --output-dir embeddings_train \
  --num-workers 12

# Extract test set
python extract-embeddings.py \
  --root-dir /data/NLST \
  --pid-csv test_pids.csv \
  --output-dir embeddings_test \
  --num-workers 12

Speed: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup)

Loading Embeddings for Training

import pandas as pd
import numpy as np

# Load embeddings
df = pd.read_parquet('embeddings_output/all_embeddings.parquet')

# Extract embedding array
embeddings = np.stack(df['embedding'].values)  # Shape: (num_scans, 512)

# Access metadata
pids = df['case_number'].values
timepoints = df['timepoint'].values

Troubleshooting

Out of Memory (OOM) Errors

Reduce --num-parallel to 1 or 2
Use fewer GPUs with --num-gpus 1

Slow Directory Scanning

Increase --num-workers (try 8-12 for fast storage)
Use --pid-csv to filter early (100x speedup)
Rerun will use cached directory list automatically

Missing Timepoints

Timepoints are extracted from year in scan path (1999→T0, 2000→T1)
If timepoint is None, year pattern wasn't found in path
You can manually map scans to timepoints using dicom_directory column

Failed Scans

Check dataset_metadata.json for failed_scans section
Common causes: corrupted DICOM files, insufficient slices, invalid metadata

Federated Learning Integration

This script is designed for privacy-preserving federated learning:

Each site runs extraction locally on their DICOM data
Embeddings are saved (not raw DICOM images)
Sites share embeddings with federated learning system
Central server trains model on embeddings without accessing raw data

Workflow for Sites

# 1. Download extraction script
wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py

# 2. Extract embeddings for train/test splits
python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings
python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings

# 3. Share embeddings with federated learning system
# (embeddings are much smaller and preserve privacy better than raw DICOM)

Citation

If you use this extraction pipeline, please cite the Sybil model:

@article{sybil2023,
  title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans},
  author={...},
  journal={...},
  year={2023}
}

Support

For issues or questions:

Model issues: https://huggingface.co/Lab-Rasool/sybil
Federated learning: Contact your FL system administrator