# Model Card - Source: [https://arxiv.org/abs/2509.02046](https://arxiv.org/abs/2509.02046) - Optimizer: `kron` - Model size: `520m` - Data size: `42B` ## Best configuration | Hyperparameter | Value | |---|---| | beta1 | `0.95` | | block_size | `256` | | learning_rate | `0.0005` | | max_grad_norm | `1` | | min_lr_ratio | `0` | | normalize_grads | `True` | | partition_grads_into_blocks | `True` | | preconditioner_init_scale | `1` | | preconditioner_lr | `0.2` | | preconditioner_update_probability | `0.1` | | train_batch_size | `128` | | update_prob_flat_start | `2000` | | warmup | `1000` | | weight_decay | `0.5` |