Projects related to sports analysis and prediction.

"How Popular Is Baseball, Really?"
Explaining the popularity of baseball and some common misconceptions about the sport's decline.

This summer at the New York Times, I was lucky enough to take this story from conception to publication. As an avid baseball fan, I get hear the argument that "baseball is a dying sport" often enough that I wanted to look into this idea further. I could resummarize the arguments here, or just point you to the article itself, in which my passionate ramblings have been edited, dissected, and presented by the best in the business.

But, if you don't have time to read it, this is my favorite chart.

"How much money is in baseball?"
Visualizations of baseball summary statistics over the past 35 years.

This is a small visualization project I've been working on... more to come.

How much do free plays benefit the Green Bay offense?

He's a Super Bowl champion and MVP, a 2-time NFL MVP, and master of the hail mary. Although many of his statistics are remarkable, one of his most unique skills is his ability to provide his offense with opportunities by utilizing free plays, which are generated when the defense commits one of a few penalties. If the defense is called for offsides, lining up in the neutral zone, twelve men on the field, or illegal formation, the offense has a 'free play' in which they can run a play and then either decline the penalty and take the yardage gained by the play, or accept the penalty yardage if that provides greater benefit. On such occasions, the offense usually attempts high-risk, high-reward plays because there is no real penalty for mistakes; even if the play results in a turnover, for example, that result will be negated as the offense can simply choose to accept the penalty and replay the down.

It's not obvious, though, whether Aaron Rodgers' ability to draw and utilize free plays contributes in a meaningful way to the outcomes of his offense. In order to determine how much free plays contribute to Rodgers' offense, first I had to establish whether the Packers under Rodgers actually get more yards off of free plays than other offenses. (Spoiler alert: they do). The first graph shows the distribution of free-play yards by each offense across the past 5 seasons.

However, this isn't the whole picture. It's possible that although they draw more free-play yardage, this yardage is negligible in the context of an entire game, or an entire season. Looking at the data a little further, there are two important things to note: firstly, that the yards off of free plays are concentrated in just a few games, and secondly, that free plays have the greatest impact on third-and-short situations. The second graph shows the average likelihood of converting on third down at each yardage distance, as well as the likelihood of converting when a free play is drawn.

The great conclusion of all of this was that free plays make a significant difference in a few games each season for the Packers. Here are a few games in which free plays made a measurable impact, projecting the score over the course of the game with and without free plays.

"What even is squash?"
And where is it?

As a squash player, I find myself being asked those questions a lot (the first more than the second). However, I think the answer to the second is interesting, and also tells you a lot about the culture surrounding the game.

For this project, I gathered a TON of data from the US Squash website, including, but not limited to:
  • Match outcomes
  • Match round
  • Point differentials
  • Players' home cities
  • Player rankings and per-match ranking differences
  • Player ratings and per-match rating differences
  • Per-match difference in account age
The goal of collecting this data was to create a matchup predictor for each match. However, as with any great data-gathering project, the usage possibilities of the data often outreach the original intention. In this case, I was inclined to take the vast trove of city-name data I collected, convert the cities to coordinates, and map the results.

So without further ado, here are the cities from which each squash player in US Squash's database hails.

Focusing more clearly on the US yields the map below. As in the last one, the dots all represent cities, and are sized by number of players from each one.

Without sizing the dots, the world map looks like this.

But what about the original point of collecting the data, the matchup predictor? Turns out that worked really well as well and can predict with 96% correlation. It's not a unique algorithm; rather, it is a model for prediction that continues to improve as data is added. Estimating a result involves taking existing match data on a number of factors, running a process called PCA (principal component analysis) to redefine the axes on which the data is graphed in a way that increases variation, and then running a regression across the new axes. Afterwards you recreate the PCA-generated axis values for the individual match and input those values into the function found by the regression to get the predicted result. Instead of being a binary predictor of win/loss, the algorithm actually predicts point differential. Not only does this provide more information, but it is a fairly accurate predictor of win/loss: a point differential of +5 or more leads to a 99.79% chance of winning; a point differential of +1 leads to a 95.85% chance of winning.

Embarrassing but fun fact: 67 different specific data parsers were written for this project; 124 spreadsheets were used; 6 different scraping tools were used in the process of gathering and processing this data.

The report can be found here.

"Baseball trade circles"
Coming soon to a jujukin near you