Logistic Regression (Or the “Logit Model”) is a fundamental tool for modelling and predicting the outcome of 0/1 events. (Win or lose). In this article, I’m going to take things a bit further and explain how the conditional logistic regression model is very applicable to a lot of contests where there can be more than two outcomes. (i.e. a horse race.)
First, I’ll quickly review linear regression, move to a logistic regression, and then finally cover conditional logistic regression. We’ll use a fictitious horse race for all our examples here.
All models start with some assumptions and beliefs about how the world works. These assumptions will become important later, but we’re going to skip them for now. At a basic level, we’re trying to predict something based on data. The thing we’re trying to predict is formally called the “dependent variable”, and the data we’re using to make that prediction are called the “independent variables” or “factors”. In traditional statistical notation, “Y” represents the dependent variable and “X” represents the factors. So, we want to learn how Y is related to X. More formally, we want to know how much of the variance in Y can be explained by the variance in X. Variance is a VERY, VERY important concept, but beyond the scope of this article. I’ll address it in the future. Also, X can be a represent a single variable, or a matrix of hundreds of variables. For this article, in the interest of simplicity, we’ll just use one.
Simple Linear Models
As the name implies, linear models assume that the relationship between Y and X is linear. The notation we use is:
The B in this formula represents the weight of X. (Formally called the “coefficients”) In simple terms, B represents, “How much does a change in X cause a change in Y.” We then use some relatively simple math (built into Excel, R, Matlab, etc…) to solve for the best B in our model. This is a common theme in all regression: A model is defined, and then we use some mathematical or computing techniques to find the best weights for the model. Often, different factors, transformations of data, or model structures are tried to find the one that best fits the empirical data. One important note, and more of an advanced topic, is that no model fits the data perfectly. What we are estimating is known as “BLUE”: Best Unbiased Linear Estimator.
The plot below illustrates this perfectly. The red dots are the data and the black line is the best linear estimator. Even someone with no math background can quickly see that this nicely represents the relationship between X and Y. However, notice that the line doesn’t pass through many of the red points. So, while the line represents the relationship well, it is actually wrong for any individual point. (This is what the e at the end of the equation represents: the wrongness or “noise”.) The amount of wrongness will become a very important factor in predictions of future events, and is something I’ll delve into in a future article.
Linear models are fine when you want to predict something numeric and continuous such as speed, time, weight, etc. However, they don’t work well when you want to look at phenomena that have a binary outcome such as: win/lose, live/die, complete/fail, etc. With binary outcome events, what we are most interested in is the probability of an event happening given the data. Something called an “inverse logit” can represent this relationship well. I’m going to skip the formal derivation and math here, but a quick Google search will provide more than you want to know. The form of logistic regression, using the same nomenclature as above is:
This will give us a smooth curve, demonstrating how the probability of Y happening is a function of X. The plot below demonstrates how the logistic model fits the data. Notice that some points are outside of the curve. That is another example of “wrongness” that all models have.
Conditional Logistic Regression
Finally, we’re at the point of this article. Hopefully, you have a general understanding of regression models by now.
One area I’ve studied a lot is that of horse racing. I’ve modeled horse races using a number of advanced methods, but the fundamental structure remains the same. What we ultimately want to know is the probability of a horse winning a race. If the public has mispriced that horse, then we have a betting opportunity with positive expected value.
A subtle, but critical distinction needs to be made. We don’t care about the “probability of the horse winning”; we care about the “probability of the horse winning THIS race”. Of course, this is a horse race, so we have to estimate his probability of winning relative to all the other horses in the race. That probability depends on all the other horse’s performances as well. For example, if I race my neighbor down the street there is a 90% chance that I’ll win. If I race Usain Bolt, there is a .00001% chance that I’ll win. So, winning is relative to the other competitors.
This is where conditional logistic regression (CLR) comes in. The “Conditional” part is that winning probabilities are relative to the competitors in the race. Additionally, to follow the laws of probability, all probabilities for a race must sum to 1.0.
Transforming a list of any values, so that they sum to 1.0 is a trivial mathematical function, just divide them by the sum. For example 1,2,3,4,5 – just divide each number by 10 and you get 0.067, 0.133, 0.200, 0.267, 0.333. However, this will NOT let us learn the best factor weights. To do that, we need a formal statistical model that we can fit use correct mathematical techniques. The equation is:
Each horse has a “strength”, represented by the exponential of the linear function. (The top half of the fraction) The strengths are then summed up over all the other strengths in the same race. (Bottom half of fraction) Looking closely, it is easy to see that, this is similar to the toy example I provided above. The tricky part is learning the weights. There is no closed form analytical solution for this. An iterative technique, often gradient descent, is used to find the best weights. Some software packages will handle this well for basic models with a reasonable number of factors. Fancier varieties of this model will require custom computer code to be written. (I use C++ and GPU parallel computing to fit this general form with 186 factors over 40,000 races.)
While brief, this article demonstrated the rationale for both logistic regression and conditional logistic regression. The goal was not to create working models, or explain model fitting procedures, but to give you a general understanding of the three models and when to apply them. For events with a single possible outcome, use logistic regression. For events with multiple possible outcomes, use conditional logistic regression.
In future articles, I’ll discuss variable screening, transformation, prediction variance, and a host of other tools needed to properly fit a predictive model.