--- license: mit tags: - medical - cancer - ct-scan - risk-prediction - healthcare - pytorch - vision datasets: - NLST metrics: - auc - c-index language: - en library_name: transformers pipeline_tag: image-classification --- # Sybil - Lung Cancer Risk Prediction ## 🎯 Model Description Sybil is a validated deep learning model that predicts future lung cancer risk from a single low-dose chest CT (LDCT) scan. Published in the Journal of Clinical Oncology, this model can assess cancer risk over a 1-6 year timeframe. ### Key Features - **Single Scan Analysis**: Requires only one LDCT scan - **Multi-Year Prediction**: Provides risk scores for years 1-6 - **Validated Performance**: Tested across multiple institutions globally - **Ensemble Approach**: Uses 5 models for robust predictions ## 🚀 Quick Start ### Installation ```bash pip install huggingface-hub torch torchvision pydicom ``` ### Basic Usage ```python from huggingface_hub import snapshot_download import sys import os # Download model model_path = snapshot_download(repo_id="Lab-Rasool/sybil") sys.path.append(model_path) # Import model from modeling_sybil_hf import SybilHFWrapper from configuration_sybil import SybilConfig # Initialize config = SybilConfig() model = SybilHFWrapper(config) dicom_dir = "path/to/volume" dicom_paths = [os.path.join(dicom_dir, f) for f in os.listdir(dicom_dir) if f.endswith('.dcm')] print(f"Found {len(dicom_paths)} DICOM files for prediction.") # Get predictions output = model(dicom_paths=dicom_paths) risk_scores = output.risk_scores.numpy() # Display results print("\nLung Cancer Risk Predictions:") print(f"Risk scores shape: {risk_scores.shape}") # Handle both single and batch predictions if risk_scores.ndim == 2: # Batch predictions - take first sample risk_scores = risk_scores[0] for i, score in enumerate(risk_scores): print(f"Year {i+1}: {float(score)}") ``` ## 📊 Example with Demo Data ```python import requests import zipfile from io import BytesIO import os # Download demo DICOM files def get_demo_data(): cache_dir = os.path.expanduser("~/.sybil_demo") demo_dir = os.path.join(cache_dir, "sybil_demo_data") if not os.path.exists(demo_dir): print("Downloading demo data...") url = "https://www.dropbox.com/scl/fi/covbvo6f547kak4em3cjd/sybil_example.zip?rlkey=7a13nhlc9uwga9x7pmtk1cf1c&dl=1" response = requests.get(url) os.makedirs(cache_dir, exist_ok=True) with zipfile.ZipFile(BytesIO(response.content)) as zf: zf.extractall(cache_dir) # Find DICOM files dicom_files = [] for root, dirs, files in os.walk(cache_dir): for file in files: if file.endswith('.dcm'): dicom_files.append(os.path.join(root, file)) return sorted(dicom_files) # Run demo from huggingface_hub import snapshot_download import sys # Load model model_path = snapshot_download(repo_id="Lab-Rasool/sybil") sys.path.append(model_path) from modeling_sybil_wrapper import SybilHFWrapper from configuration_sybil import SybilConfig # Initialize and predict config = SybilConfig() model = SybilHFWrapper(config) dicom_files = get_demo_data() output = model(dicom_paths=dicom_files) # Show results for i, score in enumerate(output.risk_scores.numpy()): print(f"Year {i+1}: {float(score)}") ``` ## 🔬 Advanced Usage: Embedding Extraction ### Extract Embeddings Before Dropout Layer You can extract 512-dimensional embedding vectors from the layer immediately before the dropout layer. This captures the learned risk features before the final prediction layer. ```python from huggingface_hub import snapshot_download import sys import os import torch import numpy as np # Download and setup model model_path = snapshot_download(repo_id="Lab-Rasool/sybil") sys.path.append(model_path) from modeling_sybil_hf import SybilHFWrapper from configuration_sybil import SybilConfig def extract_embeddings(dicom_paths): """ Extract embeddings from the layer after ReLU, before Dropout. Args: dicom_paths: List of DICOM file paths Returns: numpy array of shape (512,) - averaged embeddings across ensemble """ # Initialize model config = SybilConfig() model = SybilHFWrapper(config) # Set each model in ensemble to eval mode for m in model.models: m.eval() # Storage for embeddings from each model in ensemble all_embeddings = [] # Register hooks on each model in the ensemble for model_idx, ensemble_model in enumerate(model.models): embeddings_buffer = [] def create_hook(buffer): def hook(module, input, output): # Capture the output of ReLU layer (before dropout) buffer.append(output.detach().cpu()) return hook # Register hook on the ReLU layer hook_handle = ensemble_model.relu.register_forward_hook(create_hook(embeddings_buffer)) # Run forward pass with torch.no_grad(): _ = model(dicom_paths=dicom_paths) # Remove hook hook_handle.remove() # Get the embeddings (should be shape [1, 512]) if embeddings_buffer: embedding = embeddings_buffer[0].numpy().squeeze() all_embeddings.append(embedding) print(f"Model {model_idx + 1}: Embedding shape = {embedding.shape}") # Average embeddings across ensemble averaged_embedding = np.mean(all_embeddings, axis=0) return averaged_embedding # Usage dicom_dir = "path/to/volume" dicom_paths = [os.path.join(dicom_dir, f) for f in os.listdir(dicom_dir) if f.endswith('.dcm')] embeddings = extract_embeddings(dicom_paths) print(f"\nEmbedding vector shape: {embeddings.shape}") print(f"Embedding statistics:") print(f" Mean: {np.mean(embeddings):.6f}") print(f" Std: {np.std(embeddings):.6f}") print(f" Min: {np.min(embeddings):.6f}") print(f" Max: {np.max(embeddings):.6f}") ``` ## 🎯 Extracting Embeddings at Other Layers ### Available Extraction Points The Sybil model has several key layers where you can extract intermediate representations: ```python import torch from huggingface_hub import snapshot_download import sys model_path = snapshot_download(repo_id="Lab-Rasool/sybil") sys.path.append(model_path) from modeling_sybil_hf import SybilHFWrapper from configuration_sybil import SybilConfig config = SybilConfig() model = SybilHFWrapper(config) # Get first model from ensemble for demonstration first_model = model.models[0] # Model architecture flow: # Input → image_encoder → pool → relu → dropout → prob_of_failure_layer → Output def extract_layer_output(model, layer_name, dicom_paths): """ Extract output from any layer in the model. Args: model: SybilHFWrapper model layer_name: Name of the layer to extract from dicom_paths: List of DICOM file paths Returns: Extracted features from the specified layer """ features = [] def hook_fn(module, input, output): features.append(output.detach().cpu()) # Register hook on the specified layer for m in model.models: layer = dict(m.named_modules())[layer_name] hook_handle = layer.register_forward_hook(hook_fn) # Run forward pass with torch.no_grad(): _ = model(dicom_paths=dicom_paths) # Remove hook hook_handle.remove() return features # Example 1: Extract from image encoder (3D feature maps) # Shape: (batch, 512, time, height, width) encoder_features = extract_layer_output(model, 'image_encoder', dicom_paths) print(f"Image encoder output shape: {encoder_features[0].shape}") # Example 2: Extract from pooling layer (before ReLU) # Shape: (batch, 512) pool_features = extract_layer_output(model, 'pool', dicom_paths) print(f"Pool layer output shape: {pool_features[0].shape}") # Example 3: Extract from ReLU layer (before dropout) - RECOMMENDED # Shape: (batch, 512) relu_features = extract_layer_output(model, 'relu', dicom_paths) print(f"ReLU layer output shape: {relu_features[0].shape}") # Example 4: Extract from dropout layer (before final prediction) # Shape: (batch, 512) dropout_features = extract_layer_output(model, 'dropout', dicom_paths) print(f"Dropout layer output shape: {dropout_features[0].shape}") ``` ### Custom Layer Extraction Template ```python def extract_custom_layer(dicom_paths, target_layer_name): """ Template for extracting features from any layer. Args: dicom_paths: List of DICOM file paths target_layer_name: Name of target layer (e.g., 'relu', 'pool', 'image_encoder') Returns: Extracted features averaged across ensemble """ from huggingface_hub import snapshot_download import sys import torch import numpy as np model_path = snapshot_download(repo_id="Lab-Rasool/sybil") sys.path.append(model_path) from modeling_sybil_hf import SybilHFWrapper from configuration_sybil import SybilConfig config = SybilConfig() model = SybilHFWrapper(config) all_features = [] for ensemble_model in model.models: ensemble_model.eval() features_buffer = [] # Get the target layer target_layer = dict(ensemble_model.named_modules())[target_layer_name] # Register hook def hook(module, input, output): features_buffer.append(output.detach().cpu()) hook_handle = target_layer.register_forward_hook(hook) # Forward pass with torch.no_grad(): _ = model(dicom_paths=dicom_paths) hook_handle.remove() if features_buffer: all_features.append(features_buffer[0]) # Average across ensemble averaged_features = torch.stack(all_features).mean(dim=0) return averaged_features.numpy() ``` ## 🔍 Model Architecture Inspection ### Print Full Model Architecture ```python from huggingface_hub import snapshot_download import sys model_path = snapshot_download(repo_id="Lab-Rasool/sybil") sys.path.append(model_path) from modeling_sybil_hf import SybilHFWrapper from configuration_sybil import SybilConfig config = SybilConfig() model = SybilHFWrapper(config) # Print configuration print("=" * 80) print("MODEL CONFIGURATION:") print("=" * 80) print(config) # Print ensemble information print("\n" + "=" * 80) print("ENSEMBLE INFORMATION:") print("=" * 80) print(f"Number of models in ensemble: {len(model.models)}") print(f"Device: {model.device}") # Print architecture of first model print("\n" + "=" * 80) print("MODEL ARCHITECTURE (First model in ensemble):") print("=" * 80) first_model = model.models[0] print(first_model) ``` ### Count Model Parameters ```python from huggingface_hub import snapshot_download import sys model_path = snapshot_download(repo_id="Lab-Rasool/sybil") sys.path.append(model_path) from modeling_sybil_hf import SybilHFWrapper from configuration_sybil import SybilConfig config = SybilConfig() model = SybilHFWrapper(config) print("=" * 80) print("MODEL PARAMETERS:") print("=" * 80) # Parameters per model in ensemble for i, ensemble_model in enumerate(model.models): total_params = sum(p.numel() for p in ensemble_model.parameters()) trainable_params = sum(p.numel() for p in ensemble_model.parameters() if p.requires_grad) print(f"\nModel {i+1}:") print(f" Total parameters: {total_params:,}") print(f" Trainable parameters: {trainable_params:,}") print(f" Non-trainable parameters: {total_params - trainable_params:,}") # Total ensemble parameters total_ensemble = sum( sum(p.numel() for p in m.parameters()) for m in model.models ) print(f"\nTotal ensemble parameters: {total_ensemble:,}") ``` ### List Model Components ```python from huggingface_hub import snapshot_download import sys model_path = snapshot_download(repo_id="Lab-Rasool/sybil") sys.path.append(model_path) from modeling_sybil_hf import SybilHFWrapper from configuration_sybil import SybilConfig config = SybilConfig() model = SybilHFWrapper(config) first_model = model.models[0] print("=" * 80) print("MODEL COMPONENTS:") print("=" * 80) # Print each component with parameter count for name, module in first_model.named_children(): num_params = sum(p.numel() for p in module.parameters()) print(f"{name}: {module.__class__.__name__} ({num_params:,} parameters)") print("\n" + "=" * 80) print("DETAILED LAYER NAMES:") print("=" * 80) # Print all named modules (including nested layers) for name, module in first_model.named_modules(): if name: # Skip the root module print(f" {name}: {module.__class__.__name__}") ``` ### Model Architecture Overview The Sybil model consists of the following key components: ``` Input (3D CT Volume) ↓ image_encoder (R3D-18 backbone) - 3D convolutional neural network - Pretrained on Kinetics-400 - Output: (batch, 512, time, height, width) ↓ pool (MultiAttentionPool) - Attention-based pooling mechanisms - Combines multiple pooling strategies - Output: (batch, 512) ↓ relu (ReLU activation) - Non-linear activation - Output: (batch, 512) ← EMBEDDING EXTRACTION POINT ↓ dropout (Dropout layer) - Regularization (p=0.0 in inference) - Output: (batch, 512) ↓ prob_of_failure_layer (CumulativeProbabilityLayer) - Hazard function prediction - Output: (batch, 6) - one score per year ↓ sigmoid (applied post-forward) ↓ Risk Scores (final output) ``` ### Get Layer-by-Layer Summary ```python def print_model_summary(model): """Print a detailed summary of the model architecture.""" from huggingface_hub import snapshot_download import sys model_path = snapshot_download(repo_id="Lab-Rasool/sybil") sys.path.append(model_path) from modeling_sybil_hf import SybilHFWrapper from configuration_sybil import SybilConfig config = SybilConfig() model = SybilHFWrapper(config) first_model = model.models[0] print(f"{'Layer Name':<40} {'Type':<30} {'Parameters':>15}") print("=" * 85) total_params = 0 for name, module in first_model.named_modules(): if name: # Skip root num_params = sum(p.numel() for p in module.parameters()) if num_params > 0: print(f"{name:<40} {module.__class__.__name__:<30} {num_params:>15,}") total_params += num_params print("=" * 85) print(f"{'TOTAL':<40} {'':<30} {total_params:>15,}") # Usage print_model_summary(model) ``` ## 📈 Performance Metrics | Dataset | 1-Year AUC | 6-Year AUC | Sample Size | |---------|------------|------------|-------------| | NLST Test | 0.94 | 0.86 | ~15,000 | | MGH | 0.86 | 0.75 | ~12,000 | | CGMH Taiwan | 0.94 | 0.80 | ~8,000 | ## 🏥 Intended Use ### Primary Use Cases - Risk stratification in lung cancer screening programs - Research on lung cancer prediction models - Clinical decision support (with appropriate oversight) ### Users - Healthcare providers - Medical researchers - Screening program coordinators ### Out of Scope - ❌ Diagnosis of existing cancer - ❌ Use with non-LDCT imaging (X-rays, MRI) - ❌ Sole basis for clinical decisions - ❌ Use outside medical supervision ## 📋 Input Requirements - **Format**: DICOM files from chest CT scan - **Type**: Low-dose CT (LDCT) - **Orientation**: Axial view - **Order**: Anatomically ordered (abdomen → clavicles) - **Number of slices**: Typically 100-300 slices - **Resolution**: Automatically handled by model ## ⚠️ Important Considerations ### Medical AI Notice This model should **supplement, not replace**, clinical judgment. Always consider: - Complete patient medical history - Additional risk factors (smoking, family history) - Current clinical guidelines - Need for professional medical oversight ### Limitations - Optimized for screening population (ages 55-80) - Best performance with LDCT scans - Not validated for pediatric use - Performance may vary with different scanner manufacturers ## 📚 Citation If you use this model, please cite the original paper: ```bibtex @article{mikhael2023sybil, title={Sybil: a validated deep learning model to predict future lung cancer risk from a single low-dose chest computed tomography}, author={Mikhael, Peter G and Wohlwend, Jeremy and Yala, Adam and others}, journal={Journal of Clinical Oncology}, volume={41}, number={12}, pages={2191--2200}, year={2023}, publisher={American Society of Clinical Oncology} } ``` ## 🙏 Acknowledgments This Hugging Face implementation is based on the original work by: - **Original Authors**: Peter G. Mikhael & Jeremy Wohlwend - **Institutions**: MIT CSAIL & Massachusetts General Hospital - **Original Repository**: [GitHub](https://github.com/reginabarzilaygroup/Sybil) - **Paper**: [Journal of Clinical Oncology](https://doi.org/10.1200/JCO.22.01345) ## 📄 License MIT License - See [LICENSE](LICENSE) file - Original Model © 2022 Peter Mikhael & Jeremy Wohlwend - HF Adaptation with Embeddings © 2025 [Aakash Tripathi](https://github.com/Aakash-Tripathi) ## 🔧 Troubleshooting ### Common Issues 1. **Import Error**: Make sure to append model path to sys.path ```python sys.path.append(model_path) ``` 2. **Missing Dependencies**: Install all requirements ```bash pip install torch torchvision pydicom sybil huggingface-hub ``` 3. **DICOM Loading Error**: Ensure DICOM files are valid CT scans ```python import pydicom dcm = pydicom.dcmread("your_file.dcm") # Test single file ``` 4. **Memory Issues**: Model requires ~4GB GPU memory ```python import torch device = 'cuda' if torch.cuda.is_available() else 'cpu' ``` ## 📬 Support - **HF Model Issues**: Open issue on this repository - **Original Model**: [GitHub Issues](https://github.com/reginabarzilaygroup/Sybil/issues) - **Medical Questions**: Consult healthcare professionals ## 🔍 Additional Resources - [Original GitHub Repository](https://github.com/reginabarzilaygroup/Sybil) - [Paper (Open Access)](https://doi.org/10.1200/JCO.22.01345) - [NLST Dataset Information](https://cdas.cancer.gov/nlst/) - [Demo Data](https://github.com/reginabarzilaygroup/Sybil/releases) --- **Note**: This is a research model. Always consult qualified healthcare professionals for medical decisions.