A Framework for Forecasting Professional Tennis Matches Using ATP Statistics and Betting Odds

Description

This project represents the cornerstone of my academic and data science journey: a predictive framework developed as my final Master's thesis at RIT Croatia and later presented at the IEEE ICMLT 2025 conference.
The core objective was to build a machine learning model capable of forecasting professional tennis match outcomes by combining two distinct sources of data: official ATP statistics and pre-match betting odds. Unlike traditional models that rely solely on one domain, my approach demonstrated that feature engineering and domain-specific insights can significantly enhance prediction accuracy when combined with bookmaker-generated probabilities.
The data pipeline was extensive, covering over 50,000 matches across 11 ATP tournaments over the last 6 years, with more than 300 features engineered from raw match statistics. These included metrics related to player performance, recent form, serve and return statistics, surface type, tournament level, and market sentiment derived from odds.
Several models were trained and compared, including XGBoost, LightGBM, Random Forest, and Logistic Regression. Results highlighted the superior performance of XGBoost when combining engineered ATP features and betting odds, achieving a balanced trade-off between interpretability and predictive power. Key takeaways included:
Betting odds alone are highly predictive, but
Engineered features add interpretive richness and capture dynamics not reflected in odds alone.
The combination of both yielded the highest F1-score and AUC, outperforming single-source models.
The project emphasized the importance of domain knowledge, variable interactions, and non-linear relationships in sports forecasting. It also included a thorough correlation analysis, variable selection process, and visualization of feature importance to maintain clarity and transparency.
Ultimately, this work stands out not only for its technical rigor but for its real-world potential in sports analytics and decision-making contexts. It showcased how data science can bridge intuition and algorithmic prediction, and it remains one of the most comprehensive predictive sports analytics models I have developed.

Tools

Python – Data wrangling, feature engineering, model training, and evaluation
Pandas & NumPy – Data manipulation and statistical operations
Matplotlib & Seaborn – Exploratory Data Analysis and visualization
XGBoost, LightGBM, Random Forest, Logistic Regression – Machine learning models for performance comparison
Scikit-learn – Model validation, preprocessing, and metrics (accuracy, F1, AUC)
Jupyter Notebook – Prototyping and iterative experimentation
LaTeX – Research paper writing and formatting for IEEE standards
Overleaf – Collaborative editing and ICMLT submission