---
license: mit
tags:
  - tabular-regression
  - sklearn
  - xgboost
  - random-forest
  - motorsport
  - lap-time-prediction
datasets:
  - Haxxsh/gdgc-datathon-data
language:
  - en
pipeline_tag: tabular-regression
---

# GDGC Datathon 2025 - Lap Time Prediction Models

Trained models for predicting Formula racing lap times from the GDGC Datathon 2025 competition.

## Model Description

This repository contains ensemble models trained to predict `Lap_Time_Seconds` for Formula racing events. The models use a combination of Random Forest and XGBoost regressors with cross-validation.

### Models Included

| File | Description | Size |
|------|-------------|------|
| `rf_final.pkl` | Final Random Forest model | 158 MB |
| `xgb_final.pkl` | Final XGBoost model | 2.6 MB |
| `rf_cv_models.pkl` | Random Forest CV fold models | 13.4 GB |
| `xgb_cv_models.pkl` | XGBoost CV fold models | 103 MB |
| `rf_model.pkl` | Base Random Forest model | 95 MB |
| `xgb_model.pkl` | Base XGBoost model | 2 MB |
| `feature_engineer.pkl` | Feature preprocessing pipeline | 6 KB |
| `best_params.json` | Optimal hyperparameters | 1 KB |
| `cv_results.json` | Cross-validation results | 1 KB |

## Training Data

The models were trained on the [GDGC Datathon 2025 dataset](https://huggingface.co/datasets/Haxxsh/gdgc-datathon-data):

- **Training samples:** 734,002
- **Target variable:** `Lap_Time_Seconds` (continuous)
- **Target range:** 70.001s - 109.999s
- **Target distribution:** Nearly symmetric (mean ≈ 90s, std ≈ 11.5s)

### Features

The dataset includes features such as:
- Circuit characteristics (length, corners, laps)
- Weather conditions (temperature, humidity, track condition)
- Rider/driver information (championship points, position, history)
- Tire compounds and degradation factors
- Pit stop durations

## Usage

### Loading the Models

```python
import pickle
import joblib

# Load the final models
with open("rf_final.pkl", "rb") as f:
    rf_model = pickle.load(f)

with open("xgb_final.pkl", "rb") as f:
    xgb_model = pickle.load(f)

# Load feature engineering pipeline
with open("feature_engineer.pkl", "rb") as f:
    feature_engineer = pickle.load(f)
```

### Making Predictions

```python
import pandas as pd

# Load test data
test_df = pd.read_csv("test.csv")

# Apply feature engineering
X_test = feature_engineer.transform(test_df)

# Predict with ensemble (average of RF and XGB)
rf_preds = rf_model.predict(X_test)
xgb_preds = xgb_model.predict(X_test)
ensemble_preds = (rf_preds + xgb_preds) / 2
```

### Download from Hugging Face

```python
from huggingface_hub import hf_hub_download

# Download a specific model file
model_path = hf_hub_download(
    repo_id="Haxxsh/gdgc-datathon-models",
    filename="xgb_final.pkl"
)

# Load it
with open(model_path, "rb") as f:
    model = pickle.load(f)
```

## Hyperparameters

Best parameters found via cross-validation (see `best_params.json`):

```json
{
  "random_forest": {
    "n_estimators": 100,
    "max_depth": null,
    "min_samples_split": 2,
    "min_samples_leaf": 1
  },
  "xgboost": {
    "n_estimators": 100,
    "learning_rate": 0.1,
    "max_depth": 6
  }
}
```

## Evaluation

Cross-validation results are stored in `cv_results.json`. Primary metric: **RMSE** (Root Mean Squared Error).

## Training Code

The training code is available on GitHub: [ezylopx5/DATATHON](https://github.com/ezylopx5/DATATHON)

Key files:
- `train.py` - Main training script
- `features.py` - Feature engineering
- `predict.py` - Inference script

## Framework Versions

- Python 3.8+
- scikit-learn
- XGBoost
- pandas
- numpy

## License

MIT License

## Citation

```bibtex
@misc{gdgc-datathon-2025,
  author = {Haxxsh},
  title = {GDGC Datathon 2025 Lap Time Prediction Models},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Haxxsh/gdgc-datathon-models}
}
```