For many people, data is daunting – a vast and seemingly impenetrable sea of numbers. While most sports modellers would acknowledge that there’s useful information to be gleaned from it, many – even professionals – seem averse to using data to its full potential when testing their creations, preferring instead to just eyeball the predicted odds against one or two bookmakers to check they’re not too far out of line.
If your model is designed for use in a bookmaking environment, with plenty of margin applied to mask small pricing errors, then this will probably do. But if you intend to use your model for proprietary trading, to make money betting the market, then such an approach is unlikely to give the edge you’ll need. That’s because market prices are essentially a secondary source of data – you’re trying to fit your model to someone else’s view of what’s going to happen, and your results will therefore be contaminated with not just your errors but theirs too.
A much better approach is to seek primary data to fit your models to – actual results from past events. That might be, for example, goals scored by each team in football or the times achieved by different runners in a horse race. The more data you have, the better. But how, exactly, do you go about exploiting it?
A statistical sports model usually takes the form of a probability distribution, telling you the likelihood of each particular outcome. The shape of the distribution is governed by one or more parameters. For instance, the normal distribution for a single random variable is determined by two parameters – the mean and the variance.
A statistical sports model usually takes the form of a probability distribution, telling you the likelihood of each particular outcome.
The question is how to fit these parameters to the data that you’ve obtained. Perhaps the simplest approach is the ‘least squares’ method. This amounts to taking the squared error between every data point and its predicted probability from your model, summing these over the whole data set and then adjusting the model parameters until this summed value is minimised. But least squares makes an assumption – it assumes that the noise in your data is normally distributed. In other words, that the sport you’re modelling is governed by normally distributed random variables.
We know this isn’t always the case. Some models of football goals are based on the Poisson distribution, race times can be gamma distributed, and other more complex sports may obey a distribution that is completely non-standard.
An effective and versatile alternative to least squares is ‘maximum likelihood estimation’ (often abbreviated to MLE). The ‘likelihood’ is the probability of obtaining the observed dataset, given a particular choice of model parameters. The method, then, is to adjust your model parameters until the likelihood of obtaining the data is maximised – hence, maximum likelihood. And these parameters are then said to be the best fit to the data. Crucially, MLE makes no assumptions about the form of your distribution. However, when your data are normally distributed then, reassuringly, it reduces to the least squares method.
For example, let’s say you have a biased coin and you want to deduce the probability of it coming up heads when flipped. You watch the coin being flipped 50 times and observe 30 heads and 20 tails. This is the dataset. What we need to determine is p, the probability of obtaining heads on any one flip. The likelihood function for this dataset is p30(1-p)20. Because p<1 the powers can produce some quite unwieldy numbers, and so it’s usual to work not with the likelihood function directly, but with its logarithm. Because log is a monotonically increasing function, any value of p that maximises the likelihood will also maximise the log-likelihood. The log-likelihood in this case is: 30 log p + 20 log (1-p). Differentiating and setting to zero (as required for a maximum) gives 30/p = 20/(1-p), which yields p = 0.6 as the best-fit value.
This is a fairly trivial example just to show how MLE works. A less obvious case might be trying to deduce the expected number of goals scored by two football teams, A and B, given their match scores over a number of past encounters. Let’s say we have ten games on file between these two teams; six played at Team A’s home ground, and the remaining four at Team B’s ground. The results of the first six games are (A-B): 4-1, 2-1, 2-0, 0-2, 1-0, 0-0, while for the remaining four the results are (A-B) 2-0, 0-1, 0-2, 2-1.
Now let’s say we have a crude model in which each team’s raw expected goals are multiplied by a fixed factor which essentially boosts their expected score when they’re playing the same match at home. So, for example, if this home advantage factor was equal to two, and one of the teams was expected to score one goal when playing away, then that team would be expected to score two goals when playing at home.
For the sake of simplicity here, we assume the two teams’ goal totals are Poisson distributed and statistically independent. We want to know, from the data, each team’s raw expected goals (ie, the number they would most likely score when playing away) and the home advantage factor.
The log-likelihood contribution from each data point, assuming a Poisson distribution, is k log E – E + log (k!), where E is the expected number of goals and k is the actual number scored. If A and B denote the raw number of goals scored by Team A and Team B, respectively, and H denotes the home advantage factor, then partially differentiating the log-likelihood in turn with respect to A, B and H, and setting each expression equal to zero, gives three simultaneous equations which can be solved to yield A, B and H. In this case, they are: A=1, B=0.66 and H=1.5.
In both these examples it’s possible to maximise the log-likelihood by hand, writing down an analytical expression and differentiating it. But, as the datasets get bigger and the models more complicated, these calculations can soon become too complex for manual solution to be feasible. The good news is that there are numerical optimisation plug-ins available for Excel (for example, ‘Solver’) which can do the maximisation for you.
To install Solver, follow the instructions here. The plug-in should then be accessible from the Data tab. To use it, you’ll need to apply cell formulae to add a column to your dataset in Excel which calculates the log-likelihood contribution for each row from the parameter values in your model. Use a cell in your spreadsheet for each parameter; the log-likelihood contributions are then calculated from these cell values. Finally, make a new cell holding the sum of all the log-likelihood contributions. Click ‘Solve’ and Solver will numerically maximise the contents of this sum cell, by altering the parameter cells. The parameter values you’re left with after maximising the sum are the maximum likelihood estimates obtained from your data. For a tutorial on using Solver see here.
For the more intrepid analyst, R and other pieces of advanced statistics software have even more sophisticated optimisation facilities. I use the ‘optim’ function, part of the ‘stats’ package in R, which is fast and relatively easy to use.
So, there you have it. Fitting models to data isn’t as painful as it might at first seem. Maximum likelihood estimation is a versatile and powerful technique which can extract best-estimate parameter values from a dataset irrespective of the statistical distribution of your model. And best of all you can run it without too much fuss in a simple spreadsheet like Excel.
Reference for further reading here.