--- license: mit tags: - tabular-regression - sklearn - xgboost - random-forest - motorsport - lap-time-prediction datasets: - Haxxsh/gdgc-datathon-data language: - en pipeline_tag: tabular-regression --- # GDGC Datathon 2025 - Lap Time Prediction Models Trained models for predicting Formula racing lap times from the GDGC Datathon 2025 competition. ## Model Description This repository contains ensemble models trained to predict `Lap_Time_Seconds` for Formula racing events. The models use a combination of Random Forest and XGBoost regressors with cross-validation. ### Models Included | File | Description | Size | |------|-------------|------| | `rf_final.pkl` | Final Random Forest model | 158 MB | | `xgb_final.pkl` | Final XGBoost model | 2.6 MB | | `rf_cv_models.pkl` | Random Forest CV fold models | 13.4 GB | | `xgb_cv_models.pkl` | XGBoost CV fold models | 103 MB | | `rf_model.pkl` | Base Random Forest model | 95 MB | | `xgb_model.pkl` | Base XGBoost model | 2 MB | | `feature_engineer.pkl` | Feature preprocessing pipeline | 6 KB | | `best_params.json` | Optimal hyperparameters | 1 KB | | `cv_results.json` | Cross-validation results | 1 KB | ## Training Data The models were trained on the [GDGC Datathon 2025 dataset](https://huggingface.co/datasets/Haxxsh/gdgc-datathon-data): - **Training samples:** 734,002 - **Target variable:** `Lap_Time_Seconds` (continuous) - **Target range:** 70.001s - 109.999s - **Target distribution:** Nearly symmetric (mean ≈ 90s, std ≈ 11.5s) ### Features The dataset includes features such as: - Circuit characteristics (length, corners, laps) - Weather conditions (temperature, humidity, track condition) - Rider/driver information (championship points, position, history) - Tire compounds and degradation factors - Pit stop durations ## Usage ### Loading the Models ```python import pickle import joblib # Load the final models with open("rf_final.pkl", "rb") as f: rf_model = pickle.load(f) with open("xgb_final.pkl", "rb") as f: xgb_model = pickle.load(f) # Load feature engineering pipeline with open("feature_engineer.pkl", "rb") as f: feature_engineer = pickle.load(f) ``` ### Making Predictions ```python import pandas as pd # Load test data test_df = pd.read_csv("test.csv") # Apply feature engineering X_test = feature_engineer.transform(test_df) # Predict with ensemble (average of RF and XGB) rf_preds = rf_model.predict(X_test) xgb_preds = xgb_model.predict(X_test) ensemble_preds = (rf_preds + xgb_preds) / 2 ``` ### Download from Hugging Face ```python from huggingface_hub import hf_hub_download # Download a specific model file model_path = hf_hub_download( repo_id="Haxxsh/gdgc-datathon-models", filename="xgb_final.pkl" ) # Load it with open(model_path, "rb") as f: model = pickle.load(f) ``` ## Hyperparameters Best parameters found via cross-validation (see `best_params.json`): ```json { "random_forest": { "n_estimators": 100, "max_depth": null, "min_samples_split": 2, "min_samples_leaf": 1 }, "xgboost": { "n_estimators": 100, "learning_rate": 0.1, "max_depth": 6 } } ``` ## Evaluation Cross-validation results are stored in `cv_results.json`. Primary metric: **RMSE** (Root Mean Squared Error). ## Training Code The training code is available on GitHub: [ezylopx5/DATATHON](https://github.com/ezylopx5/DATATHON) Key files: - `train.py` - Main training script - `features.py` - Feature engineering - `predict.py` - Inference script ## Framework Versions - Python 3.8+ - scikit-learn - XGBoost - pandas - numpy ## License MIT License ## Citation ```bibtex @misc{gdgc-datathon-2025, author = {Haxxsh}, title = {GDGC Datathon 2025 Lap Time Prediction Models}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/Haxxsh/gdgc-datathon-models} } ```