# Sybil Embedding Extraction Pipeline

This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for **federated learning** deployments where sites need to generate embeddings locally without sharing raw medical images.

## Features

- ✅ **Automatic Model Download**: Downloads Sybil model from HuggingFace automatically
- ✅ **Multi-GPU Support**: Process scans in parallel across multiple GPUs
- ✅ **Smart Filtering**: Automatically filters out localizer/scout scans
- ✅ **PID-Based Extraction**: Extract embeddings for specific patient cohorts
- ✅ **Checkpoint System**: Save progress every N scans to prevent data loss
- ✅ **Timepoint Detection**: Automatically detects T0, T1, T2... from scan dates
- ✅ **Directory Caching**: Cache directory scans for 100x faster reruns

## Quick Start

### Installation

```bash
# Install required packages
pip install huggingface_hub torch numpy pandas pydicom
```

### Basic Usage

```bash
# Extract embeddings from all scans
python extract-embeddings.py \
  --root-dir /path/to/NLST/data \
  --output-dir embeddings_output
```

### Extract Specific Patient Cohort

```bash
# Extract only patients listed in a CSV file
python extract-embeddings.py \
  --root-dir /path/to/NLST/data \
  --pid-csv subsets/train_pids.csv \
  --output-dir embeddings_train
```

## Command Line Arguments

### Required
- `--root-dir`: Root directory containing DICOM files (e.g., `/data/NLST`)

### Optional - Data Selection
- `--pid-csv`: CSV file with "pid" column to filter specific patients
- `--max-subjects`: Limit to N subjects (useful for testing)
- `--output-dir`: Output directory (default: `embeddings_output`)

### Optional - Performance Tuning
- `--num-gpus`: Number of GPUs to use (default: 1)
- `--num-parallel`: Process N scans simultaneously (default: 1, recommend 1-4)
- `--num-workers`: Parallel workers for directory scanning (default: 4, recommend 4-12)
- `--checkpoint-interval`: Save checkpoint every N scans (default: 1000)

## Expected Directory Structure

Your DICOM data should follow this structure:
```
/path/to/NLST/
├── NLST/
│   ├── <PID_1>/
│   │   ├── MM-DD-YYYY-NLST-LSS-<scan_id>/
│   │   │   ├── <series_id>/
│   │   │   │   ├── *.dcm
│   │   │   │   └── ...
│   │   │   └── ...
│   │   └── ...
│   ├── <PID_2>/
│   └── ...
```

## Output Format

### Embeddings File: `all_embeddings.parquet`

Parquet file with columns:
- `case_number`: Patient ID (PID)
- `subject_id`: Same as case_number
- `scan_id`: Unique scan identifier
- `timepoint`: T0, T1, T2... (year-based, e.g., 1999→T0, 2000→T1)
- `dicom_directory`: Full path to scan directory
- `num_dicom_files`: Number of DICOM slices
- `embedding_index`: Index in embedding array
- `embedding`: 512-dimensional embedding array

### Metadata File: `dataset_metadata.json`

Complete metadata including:
- Dataset info (total scans, embedding dimensions)
- Model info (Sybil ensemble, extraction layer)
- Per-scan metadata (paths, statistics)
- Failed scans with error messages

## Performance Tips

### For Large Datasets (>10K scans)

```bash
# Use cached directory list and multi-GPU processing
python extract-embeddings.py \
  --root-dir /data/NLST \
  --num-gpus 4 \
  --num-parallel 4 \
  --num-workers 12 \
  --checkpoint-interval 500
```

**Memory Requirements**: ~10GB VRAM per parallel scan
- `--num-parallel 1`: Safe for 16GB GPUs
- `--num-parallel 2`: Safe for 24GB GPUs  
- `--num-parallel 4`: Requires 40GB+ GPUs

### For Subset Extraction (Train/Test Split)

```bash
# Extract training set
python extract-embeddings.py \
  --root-dir /data/NLST \
  --pid-csv train_pids.csv \
  --output-dir embeddings_train \
  --num-workers 12

# Extract test set
python extract-embeddings.py \
  --root-dir /data/NLST \
  --pid-csv test_pids.csv \
  --output-dir embeddings_test \
  --num-workers 12
```

**Speed**: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup)

## Loading Embeddings for Training

```python
import pandas as pd
import numpy as np

# Load embeddings
df = pd.read_parquet('embeddings_output/all_embeddings.parquet')

# Extract embedding array
embeddings = np.stack(df['embedding'].values)  # Shape: (num_scans, 512)

# Access metadata
pids = df['case_number'].values
timepoints = df['timepoint'].values
```

## Troubleshooting

### Out of Memory (OOM) Errors
- Reduce `--num-parallel` to 1 or 2
- Use fewer GPUs with `--num-gpus 1`

### Slow Directory Scanning
- Increase `--num-workers` (try 8-12 for fast storage)
- Use `--pid-csv` to filter early (100x speedup)
- Rerun will use cached directory list automatically

### Missing Timepoints
- Timepoints are extracted from year in scan path (1999→T0, 2000→T1)
- If `timepoint` is None, year pattern wasn't found in path
- You can manually map scans to timepoints using `dicom_directory` column

### Failed Scans
- Check `dataset_metadata.json` for `failed_scans` section
- Common causes: corrupted DICOM files, insufficient slices, invalid metadata

## Federated Learning Integration

This script is designed for **privacy-preserving federated learning**:

1. **Each site runs extraction locally** on their DICOM data
2. **Embeddings are saved** (not raw DICOM images)
3. **Sites share embeddings** with federated learning system
4. **Central server trains model** on embeddings without accessing raw data

### Workflow for Sites

```bash
# 1. Download extraction script
wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py

# 2. Extract embeddings for train/test splits
python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings
python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings

# 3. Share embeddings with federated learning system
# (embeddings are much smaller and preserve privacy better than raw DICOM)
```

## Citation

If you use this extraction pipeline, please cite the Sybil model:

```bibtex
@article{sybil2023,
  title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans},
  author={...},
  journal={...},
  year={2023}
}
```

## Support

For issues or questions:
- Model issues: https://huggingface.co/Lab-Rasool/sybil
- Federated learning: Contact your FL system administrator