File size: 6,462 Bytes
a091733 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 |
# Sybil Embedding Extraction Pipeline
This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for **federated learning** deployments where sites need to generate embeddings locally without sharing raw medical images.
## Features
- β
**Automatic Model Download**: Downloads Sybil model from HuggingFace automatically
- β
**Multi-GPU Support**: Process scans in parallel across multiple GPUs
- β
**Smart Filtering**: Automatically filters out localizer/scout scans
- β
**PID-Based Extraction**: Extract embeddings for specific patient cohorts
- β
**Checkpoint System**: Save progress every N scans to prevent data loss
- β
**Timepoint Detection**: Automatically detects T0, T1, T2... from scan dates
- β
**Directory Caching**: Cache directory scans for 100x faster reruns
## Quick Start
### Installation
```bash
# Install required packages
pip install huggingface_hub torch numpy pandas pydicom
```
### Basic Usage
```bash
# Extract embeddings from all scans
python extract-embeddings.py \
--root-dir /path/to/NLST/data \
--output-dir embeddings_output
```
### Extract Specific Patient Cohort
```bash
# Extract only patients listed in a CSV file
python extract-embeddings.py \
--root-dir /path/to/NLST/data \
--pid-csv subsets/train_pids.csv \
--output-dir embeddings_train
```
## Command Line Arguments
### Required
- `--root-dir`: Root directory containing DICOM files (e.g., `/data/NLST`)
### Optional - Data Selection
- `--pid-csv`: CSV file with "pid" column to filter specific patients
- `--max-subjects`: Limit to N subjects (useful for testing)
- `--output-dir`: Output directory (default: `embeddings_output`)
### Optional - Performance Tuning
- `--num-gpus`: Number of GPUs to use (default: 1)
- `--num-parallel`: Process N scans simultaneously (default: 1, recommend 1-4)
- `--num-workers`: Parallel workers for directory scanning (default: 4, recommend 4-12)
- `--checkpoint-interval`: Save checkpoint every N scans (default: 1000)
## Expected Directory Structure
Your DICOM data should follow this structure:
```
/path/to/NLST/
βββ NLST/
β βββ <PID_1>/
β β βββ MM-DD-YYYY-NLST-LSS-<scan_id>/
β β β βββ <series_id>/
β β β β βββ *.dcm
β β β β βββ ...
β β β βββ ...
β β βββ ...
β βββ <PID_2>/
β βββ ...
```
## Output Format
### Embeddings File: `all_embeddings.parquet`
Parquet file with columns:
- `case_number`: Patient ID (PID)
- `subject_id`: Same as case_number
- `scan_id`: Unique scan identifier
- `timepoint`: T0, T1, T2... (year-based, e.g., 1999βT0, 2000βT1)
- `dicom_directory`: Full path to scan directory
- `num_dicom_files`: Number of DICOM slices
- `embedding_index`: Index in embedding array
- `embedding`: 512-dimensional embedding array
### Metadata File: `dataset_metadata.json`
Complete metadata including:
- Dataset info (total scans, embedding dimensions)
- Model info (Sybil ensemble, extraction layer)
- Per-scan metadata (paths, statistics)
- Failed scans with error messages
## Performance Tips
### For Large Datasets (>10K scans)
```bash
# Use cached directory list and multi-GPU processing
python extract-embeddings.py \
--root-dir /data/NLST \
--num-gpus 4 \
--num-parallel 4 \
--num-workers 12 \
--checkpoint-interval 500
```
**Memory Requirements**: ~10GB VRAM per parallel scan
- `--num-parallel 1`: Safe for 16GB GPUs
- `--num-parallel 2`: Safe for 24GB GPUs
- `--num-parallel 4`: Requires 40GB+ GPUs
### For Subset Extraction (Train/Test Split)
```bash
# Extract training set
python extract-embeddings.py \
--root-dir /data/NLST \
--pid-csv train_pids.csv \
--output-dir embeddings_train \
--num-workers 12
# Extract test set
python extract-embeddings.py \
--root-dir /data/NLST \
--pid-csv test_pids.csv \
--output-dir embeddings_test \
--num-workers 12
```
**Speed**: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup)
## Loading Embeddings for Training
```python
import pandas as pd
import numpy as np
# Load embeddings
df = pd.read_parquet('embeddings_output/all_embeddings.parquet')
# Extract embedding array
embeddings = np.stack(df['embedding'].values) # Shape: (num_scans, 512)
# Access metadata
pids = df['case_number'].values
timepoints = df['timepoint'].values
```
## Troubleshooting
### Out of Memory (OOM) Errors
- Reduce `--num-parallel` to 1 or 2
- Use fewer GPUs with `--num-gpus 1`
### Slow Directory Scanning
- Increase `--num-workers` (try 8-12 for fast storage)
- Use `--pid-csv` to filter early (100x speedup)
- Rerun will use cached directory list automatically
### Missing Timepoints
- Timepoints are extracted from year in scan path (1999βT0, 2000βT1)
- If `timepoint` is None, year pattern wasn't found in path
- You can manually map scans to timepoints using `dicom_directory` column
### Failed Scans
- Check `dataset_metadata.json` for `failed_scans` section
- Common causes: corrupted DICOM files, insufficient slices, invalid metadata
## Federated Learning Integration
This script is designed for **privacy-preserving federated learning**:
1. **Each site runs extraction locally** on their DICOM data
2. **Embeddings are saved** (not raw DICOM images)
3. **Sites share embeddings** with federated learning system
4. **Central server trains model** on embeddings without accessing raw data
### Workflow for Sites
```bash
# 1. Download extraction script
wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py
# 2. Extract embeddings for train/test splits
python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings
python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings
# 3. Share embeddings with federated learning system
# (embeddings are much smaller and preserve privacy better than raw DICOM)
```
## Citation
If you use this extraction pipeline, please cite the Sybil model:
```bibtex
@article{sybil2023,
title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans},
author={...},
journal={...},
year={2023}
}
```
## Support
For issues or questions:
- Model issues: https://huggingface.co/Lab-Rasool/sybil
- Federated learning: Contact your FL system administrator
|