File size: 6,462 Bytes
a091733
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# Sybil Embedding Extraction Pipeline

This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for **federated learning** deployments where sites need to generate embeddings locally without sharing raw medical images.

## Features

- βœ… **Automatic Model Download**: Downloads Sybil model from HuggingFace automatically
- βœ… **Multi-GPU Support**: Process scans in parallel across multiple GPUs
- βœ… **Smart Filtering**: Automatically filters out localizer/scout scans
- βœ… **PID-Based Extraction**: Extract embeddings for specific patient cohorts
- βœ… **Checkpoint System**: Save progress every N scans to prevent data loss
- βœ… **Timepoint Detection**: Automatically detects T0, T1, T2... from scan dates
- βœ… **Directory Caching**: Cache directory scans for 100x faster reruns

## Quick Start

### Installation

```bash
# Install required packages
pip install huggingface_hub torch numpy pandas pydicom
```

### Basic Usage

```bash
# Extract embeddings from all scans
python extract-embeddings.py \
  --root-dir /path/to/NLST/data \
  --output-dir embeddings_output
```

### Extract Specific Patient Cohort

```bash
# Extract only patients listed in a CSV file
python extract-embeddings.py \
  --root-dir /path/to/NLST/data \
  --pid-csv subsets/train_pids.csv \
  --output-dir embeddings_train
```

## Command Line Arguments

### Required
- `--root-dir`: Root directory containing DICOM files (e.g., `/data/NLST`)

### Optional - Data Selection
- `--pid-csv`: CSV file with "pid" column to filter specific patients
- `--max-subjects`: Limit to N subjects (useful for testing)
- `--output-dir`: Output directory (default: `embeddings_output`)

### Optional - Performance Tuning
- `--num-gpus`: Number of GPUs to use (default: 1)
- `--num-parallel`: Process N scans simultaneously (default: 1, recommend 1-4)
- `--num-workers`: Parallel workers for directory scanning (default: 4, recommend 4-12)
- `--checkpoint-interval`: Save checkpoint every N scans (default: 1000)

## Expected Directory Structure

Your DICOM data should follow this structure:
```
/path/to/NLST/
β”œβ”€β”€ NLST/
β”‚   β”œβ”€β”€ <PID_1>/
β”‚   β”‚   β”œβ”€β”€ MM-DD-YYYY-NLST-LSS-<scan_id>/
β”‚   β”‚   β”‚   β”œβ”€β”€ <series_id>/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ *.dcm
β”‚   β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   └── ...
β”‚   β”œβ”€β”€ <PID_2>/
β”‚   └── ...
```

## Output Format

### Embeddings File: `all_embeddings.parquet`

Parquet file with columns:
- `case_number`: Patient ID (PID)
- `subject_id`: Same as case_number
- `scan_id`: Unique scan identifier
- `timepoint`: T0, T1, T2... (year-based, e.g., 1999β†’T0, 2000β†’T1)
- `dicom_directory`: Full path to scan directory
- `num_dicom_files`: Number of DICOM slices
- `embedding_index`: Index in embedding array
- `embedding`: 512-dimensional embedding array

### Metadata File: `dataset_metadata.json`

Complete metadata including:
- Dataset info (total scans, embedding dimensions)
- Model info (Sybil ensemble, extraction layer)
- Per-scan metadata (paths, statistics)
- Failed scans with error messages

## Performance Tips

### For Large Datasets (>10K scans)

```bash
# Use cached directory list and multi-GPU processing
python extract-embeddings.py \
  --root-dir /data/NLST \
  --num-gpus 4 \
  --num-parallel 4 \
  --num-workers 12 \
  --checkpoint-interval 500
```

**Memory Requirements**: ~10GB VRAM per parallel scan
- `--num-parallel 1`: Safe for 16GB GPUs
- `--num-parallel 2`: Safe for 24GB GPUs  
- `--num-parallel 4`: Requires 40GB+ GPUs

### For Subset Extraction (Train/Test Split)

```bash
# Extract training set
python extract-embeddings.py \
  --root-dir /data/NLST \
  --pid-csv train_pids.csv \
  --output-dir embeddings_train \
  --num-workers 12

# Extract test set
python extract-embeddings.py \
  --root-dir /data/NLST \
  --pid-csv test_pids.csv \
  --output-dir embeddings_test \
  --num-workers 12
```

**Speed**: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup)

## Loading Embeddings for Training

```python
import pandas as pd
import numpy as np

# Load embeddings
df = pd.read_parquet('embeddings_output/all_embeddings.parquet')

# Extract embedding array
embeddings = np.stack(df['embedding'].values)  # Shape: (num_scans, 512)

# Access metadata
pids = df['case_number'].values
timepoints = df['timepoint'].values
```

## Troubleshooting

### Out of Memory (OOM) Errors
- Reduce `--num-parallel` to 1 or 2
- Use fewer GPUs with `--num-gpus 1`

### Slow Directory Scanning
- Increase `--num-workers` (try 8-12 for fast storage)
- Use `--pid-csv` to filter early (100x speedup)
- Rerun will use cached directory list automatically

### Missing Timepoints
- Timepoints are extracted from year in scan path (1999β†’T0, 2000β†’T1)
- If `timepoint` is None, year pattern wasn't found in path
- You can manually map scans to timepoints using `dicom_directory` column

### Failed Scans
- Check `dataset_metadata.json` for `failed_scans` section
- Common causes: corrupted DICOM files, insufficient slices, invalid metadata

## Federated Learning Integration

This script is designed for **privacy-preserving federated learning**:

1. **Each site runs extraction locally** on their DICOM data
2. **Embeddings are saved** (not raw DICOM images)
3. **Sites share embeddings** with federated learning system
4. **Central server trains model** on embeddings without accessing raw data

### Workflow for Sites

```bash
# 1. Download extraction script
wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py

# 2. Extract embeddings for train/test splits
python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings
python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings

# 3. Share embeddings with federated learning system
# (embeddings are much smaller and preserve privacy better than raw DICOM)
```

## Citation

If you use this extraction pipeline, please cite the Sybil model:

```bibtex
@article{sybil2023,
  title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans},
  author={...},
  journal={...},
  year={2023}
}
```

## Support

For issues or questions:
- Model issues: https://huggingface.co/Lab-Rasool/sybil
- Federated learning: Contact your FL system administrator