As a squash player, I find myself being asked those questions a lot (the first more than the
second). However, I think the answer to the second is interesting, and also tells you a lot
about the culture surrounding the game.
For this project, I gathered a TON of data from the US Squash website, including, but not
- Match outcomes
- Match round
- Point differentials
- Players' home cities
- Player rankings and per-match ranking differences
- Player ratings and per-match rating differences
- Per-match difference in account age
The goal of collecting this data was to create a matchup predictor for each match. However,
as with any great data-gathering project, the usage possibilities of the data often outreach
the original intention. In this case, I was inclined to take the vast trove of city-name
data I collected, convert the cities to coordinates, and map the results.
So without further ado, here are the cities from which each squash player in US Squash's
Focusing more clearly on the US yields the map below. As in the last one, the dots all
represent cities, and are sized by number of players from each one.
Without sizing the dots, the world map looks like this.
But what about the original point of collecting the data, the matchup predictor? Turns out
that worked really well as well and can predict with 96% correlation. It's not a unique
algorithm; rather, it is a model for prediction that continues to improve as data is added.
Estimating a result involves taking existing match data on a number of factors, running a
process called PCA (principal component analysis) to redefine the axes on which the data is
graphed in a way that increases variation, and then running a regression across the new axes.
Afterwards you recreate the PCA-generated axis values for the individual match and input
those values into the function found by the regression to get the predicted result. Instead
of being a binary predictor of win/loss, the algorithm actually predicts point differential.
Not only does this provide more information, but it is a fairly accurate predictor of
win/loss: a point differential of +5 or more leads to a 99.79% chance of winning; a point
differential of +1 leads to a 95.85% chance of winning.
Embarrassing but fun fact: 67 different specific data parsers were written for this project;
124 spreadsheets were used; 6 different scraping tools were used in the process of gathering
and processing this data.
The report can be found here