| | --- |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen3-1.7B |
| | tags: |
| | - scaling-laws |
| | - neural-scaling |
| | - performance-prediction |
| | - configuration-to-performance |
| | - pytorch |
| | library_name: transformers |
| | --- |
| | |
| | # NCPL-intermediate: Neural Configuration to Performance Scaling Law |
| |
|
| | This model predicts the performance of neural network configurations using scaling laws. It is trained on the Marin and StepLaw datasets to forecast performance metrics based on model configurations. |
| |
|
| | ## Model Description |
| |
|
| | **NCPL-intermediate** (Neural Configuration to Performance Scaling Law - Intermediate) is a specialized forecasting model that: |
| |
|
| | - Takes pretraining configurations as input |
| | - Predicts intermediate performance metrics using learned scaling law patterns |
| | - Combines text embeddings from a base transformer with numeric value processing through a dedicated MLP |
| | - Supports multiple scaling law formulations (Marin, StepLaw) |
| |
|
| | ### Architecture |
| |
|
| | The model consists of: |
| |
|
| | 1. **Base Model**: Qwen/Qwen3-1.7B |
| | - Provides contextual embeddings for text tokens |
| |
|
| | 2. **Numeric MLP**: |
| | - Processes numeric values (performance metrics, configuration parameters) |
| | - Projects numeric inputs to the same hidden dimension as text embeddings |
| | - Architecture: Linear(1 → 2*hidden_size) → ReLU → Linear(2*hidden_size → hidden_size) |
| |
|
| | 3. **Prediction Head**: |
| | - Linear layer mapping from hidden_size to scalar predictions |
| | - Outputs performance forecasts for each token position |
| | |
| | ## Training Data |
| | |
| | The model was trained on: |
| | |
| | - **Datasets**: Marin and StepLaw scaling law datasets |
| | - **Training configuration**: |
| | - Stage 1: 10 epochs with learning rate 5e-5 (frozen base model) |
| | - Stage 2: 400 epochs with learning rate 1e-5 (full fine-tuning) |
| | - Batch size: 480 (across 8 GPUs) |
| | - Weight decay: 0.01 |
| | - Loss: MSE (Mean Squared Error) |
| | |
| | ## Usage |
| | |
| | The `ScalingLawForecaster` class can be found in the [GitHub repository](https://github.com/zhqwqwq/Configuration-to-Performance-Scaling-Law). |
| | |
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer |
| | # Get ScalingLawForecaster from: https://github.com/zhqwqwq/Configuration-to-Performance-Scaling-Law |
| | from model import ScalingLawForecaster |
| | |
| | # Load model |
| | model = ScalingLawForecaster( |
| | base_model_name="Qwen/Qwen3-1.7B", |
| | init_from_pretrained=True, |
| | force_fp32=True |
| | ) |
| |
|
| | # Load checkpoint |
| | checkpoint = torch.load("pytorch_model.bin") |
| | model.load_state_dict(checkpoint["model_state_dict"]) |
| | model.eval() |
| | |
| | # Load tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B") |
| |
|
| | # Prepare inputs |
| | # input_ids: tokenized text sequence |
| | # is_number_mask: boolean mask indicating which tokens are numeric |
| | # number_values_filled: actual numeric values (0 for non-numeric tokens) |
| | |
| | with torch.no_grad(): |
| | predictions = model( |
| | input_ids=input_ids, |
| | is_number_mask=is_number_mask, |
| | number_values_filled=number_values_filled, |
| | attention_mask=attention_mask |
| | ) |
| | ``` |
| | |
| | ## Input Format |
| |
|
| | The model expects three key inputs: |
| |
|
| | 1. **input_ids** (torch.LongTensor): Tokenized sequence with special numeric tokens |
| | 2. **is_number_mask** (torch.BoolTensor): Boolean mask marking numeric token positions |
| | 3. **number_values_filled** (torch.FloatTensor): Actual numeric values at marked positions |
| | |
| | ## Intended Use |
| | |
| | This model is designed for: |
| | |
| | - **Scaling law research**: Understanding how neural network performance scales with configuration |
| | - **Performance forecasting**: Predicting model performance before full training |
| | - **Configuration optimization**: Finding optimal hyperparameters based on scaling patterns |
| | - **Resource planning**: Estimating computational requirements for different model sizes |
| | |
| | ## Limitations |
| | |
| | - Trained specifically on Marin and StepLaw datasets; generalization to other settings likely require at least finetuning |
| | - Requires properly formatted inputs with numeric tokens replaced and masked |
| | |
| | ## Citation |
| | |
| | If you use this model in your research, please cite: |
| | |
| | ```bibtex |
| | @article{ncpl2026, |
| | title = {Neural Configuration to Performance Scaling Law}, |
| | author = {Huaqing Zhang and Kaiyue Wen and Tengyu Ma}, |
| | journal = {arXiv preprint arXiv:2602.10300}, |
| | year = {2026}, |
| | url = {https://www.arxiv.org/abs/2602.10300} |
| | } |
| | ``` |
| | |