Simulating finishing positions

Justin Worrall shows how a set of Season Points prices can be expanded into a matrix of finishing position probabilities for each team in a league

Categories: Basketball, Data, Execution & Getting On, Football, NFL, Professional, Statistical models, Technology, Tools

Once upon a time, not so long ago, you could walk into a betting shop and get a price on any football team you wanted, as long as you wanted to back that team to win. You couldn’t possibly back a team to lose, no sirree, no way, that would be … immoral, and wanting to do so would likely bring a plague down upon your house, involving frogs, locusts and all manner of other unpleasantries.

Or at least so said Hills and Ladbrokes.

Thankfully the High Court was persuaded that in a two horse race, laying team A was functionally equivalent to backing team B, and the big bookmakers were sent away with their tails between their legs; although they kept coming back for about ten years, so you can’t fault them for persistence. Meanwhile Betfair emerged, grew like a weed, and enabled punters to back and lay to their hearts content, until .. well, liquidity dried up, they hired a bookmaker as CEO and decided that being a sportsbook was in fact the way to go after all.

But that’s a sad story for another day. By now, the laying genie was well and truly out of the bottle, and there was to be no stuffing him back in. And whilst Betfair popularised the concept, it’s often forgotten that the idea was pioneered by the spread betting firms, before the exchange was even a gleam in Andrew Black’s eye.

Yes, time was when every bookmaker wanted to be a spread better. Ladbrokes Index, William Hill Index, City Index, IG Spread, Spreadfair .. all great firms (ahem), all now gone to the great bookie in the sky at the hands of Betfair’s reaping scythe. The exchange’s genius was to allow laying on fixed odds product, which was well understood by punters, rather than inventing a new class of product which required a credit account.

And so now the spread betting industry is in a pretty sorry state, with only two firms (Sporting Index and Spreadex) remaining. They do however make some interesting markets, at least from a modelling perspective; stuff which isn’t available elsewhere and which, as we’ll see, is rich in informational content.

Take Season Points – Premier League Season points are available here. The interesting thing about Season Points is that you can get a meaningful, equivalent price for every team in a league. £100 per point on Man City is, risk- wise, precisely equivalent to £100 bet on Liverpool. Where else can you place these kinds of bet ? Not in the Winner market. £100 at 2/1 on Man City to win is risk- wise a very different proposition from £100 on Liverpool to win at 10/1; and good luck getting a meaningful price on Crystal Palace.

Given the fact that the Season Points market is ‘complete’ (ie equivalent prices for every team), we can do some interesting stuff with just a small amount of code and a few distributional assumptions.

Let’s get the data from Sporting Index, who are kind enough to make it available via an (undocumented, natch) JSON feed; although you have to grab the name/id mapping from the web page in a separate call to make sense of it.

import lxml.html, json, re, urllib

LivePricingUrl="http://livepricing.sportingindex.com/LivePricing.svc/jsonp/GetLivePricesByMeeting?meetingKey="

def get_market_quotes(url):        
    doc=lxml.html.fromstring(urllib.urlopen(url).read())
    ids=dict([(li.attrib["key"], re.sub(" Points$", "", li.xpath("span[@class='markets']")[0].text))
              for li in doc.xpath("//ul[@class='prices']/li")
              if "key" in li.attrib])
    quotes=json.loads(urllib.urlopen(LivePricingUrl+url.split("/")[-2]).read())
    return [{"name": ids[quote["Key"]],
             "so_far": tuple([int(tok) for tok in quote["SoFar"].split("/")]),
             "bid": quote["Sell"],
             "offer": quote["Buy"]}
            for quote in quotes["Markets"]]

Let’s see what the data looks like:

MarketQuotes=get_market_quotes("http://www.sportingindex.com/spread-betting/football-domestic/premier-league/mm4.uk.meeting.4191659/premier-league-points-2013-2014")

print pd.DataFrame(sorted(MarketQuotes, key=lambda row: -(row["bid"]+row["offer"])/2.0), columns=["name", "bid", "offer", "so_far"])

              name   bid  offer    so_far
0         Man City  78.0   79.5  (25, 13)
1          Arsenal  77.5   79.0  (31, 13)
2          Chelsea  76.5   78.0  (27, 13)
3          Man Utd  72.0   73.5  (22, 13)
4        Liverpool  67.5   69.0  (24, 13)
5        Tottenham  64.5   66.0  (21, 13)
6          Everton  61.0   62.5  (24, 13)
7      Southampton  55.5   57.0  (22, 13)
8        Newcastle  52.0   53.5  (23, 13)
9          Swansea  46.0   47.5  (15, 13)
10     Aston Villa  45.0   46.5  (16, 13)
11       West Brom  44.5   46.0  (15, 13)
12           Stoke  39.5   41.0  (13, 13)
13            Hull  39.5   41.0  (17, 13)
14        West Ham  39.0   40.5  (13, 14)
15         Norwich  38.0   39.5  (14, 13)
16         Cardiff  37.5   39.0  (13, 13)
17          Fulham  34.0   35.5  (10, 13)
18      Sunderland  33.0   34.5   (8, 13)
19  Crystal Palace  27.5   29.0  (10, 14)

Okay, looks pretty sensible. What we want to do now is to set up a simulation process for season points. The mid- market prices from the table above represent expectations of how many points each team is likely to get. But of course that’s just an expectation; it’s entirely possible for each team to get more points than the offer price, or less points than the bid price; for each team there’s a distribution of points around this mean expectation.

But what do these points distributions look like ?

Well, they are certainly bounded; it’s not possible for a team to get less than zero points, nor more than 114 (in the case of the Premier League; 3 points * 38 games). In fact the distributions are bounded more narrowly than this; given we are half way through a season, a team can’t get less than their current number of points, nor is it possible for any team to get precisely 114 since every team has now lost at least one point through a loss or a draw.

We don’t want to make the assumption that season points are normally distributed however; it’s better to set up a function to simulate the distribution, and then look at the resulting shape.

def simulate_points(quotes, paths, draw_prob=0.3):
    import random
    simpoints=dict([(quote["name"], 
                     [quote["so_far"][0] 
                      for i in range(paths)])
                    for quote in quotes])
    ngames=2*(len(quotes)-1) 
    for quote in quotes:
        midprice=(quote["bid"]+quote["offer"])/float(2)
        currentpoints, played = quote["so_far"]
        toplay=ngames-played
        expectedpoints=(midprice-currentpoints)/float(toplay) 
        winprob=(expectedpoints-draw_prob)/float(3) 
        for i in range(paths):
            for j in range(toplay):
                q=random.random()
                if q < winprob:
                    simpoints[quote["name"]][i]+=3
                elif q < winprob+draw_prob:
                    simpoints[quote["name"]][i]+=1
    return [{"name": key,
             "simulated_points": value}
            for key, value in simpoints.items()]

Now there’s a lot to quibble about with this function. Specifically, it takes no account of the remaining fixtures and blindly assumes that all future games are played against teams of equal quality. I’ll leave you to think about how to remedy this as homework. In the meantime however it serves as a useful tool with which to explore the distributions; let’s simulate and look at the first and second moments:

SimulatedPoints=simulate_points(MarketQuotes, paths=50000)

MidPrices=dict([(quote["name"], (quote["bid"]+quote["offer"])/float(2)) for quote in MarketQuotes])

for row in SimulatedPoints:
    row["mid"]=MidPrices[row["name"]]
    row["mean"]=np.mean(row["simulated_points"])    
    row["stdev"]=np.std(row["simulated_points"])
    row["error"]=row["mid"]-row["mean"]

print pd.DataFrame(sorted(SimulatedPoints, key=lambda row: -row["mid"]), columns=["name", "mid", "mean", "stdev", "error"])
              name    mid      mean     stdev    error
0         Man City  78.75  78.71340  5.542557  0.03660
1          Arsenal  78.25  78.31092  6.133936 -0.06092
2          Chelsea  77.25  77.19956  5.903095  0.05044
3          Man Utd  72.75  72.77466  5.822711 -0.02466
4        Liverpool  68.25  68.25186  6.274254 -0.00186
5        Tottenham  65.25  65.20336  6.292056  0.04664
6          Everton  61.75  61.76346  6.434456 -0.01346
7      Southampton  56.25  56.25740  6.401914 -0.00740
8        Newcastle  52.75  52.78540  6.233348 -0.03540
9          Swansea  46.75  46.82266  6.358935 -0.07266
10     Aston Villa  45.75  45.76336  6.241635 -0.01336
11       West Brom  45.25  45.28530  6.292197 -0.03530
12            Hull  40.25  40.25510  5.770000 -0.00510
13           Stoke  40.25  40.29252  6.104398 -0.04252
14        West Ham  39.75  39.69928  5.977452  0.05072
15         Norwich  38.75  38.78156  5.877997 -0.03156
16         Cardiff  38.25  38.21546  5.909287  0.03454
17          Fulham  34.75  34.70772  5.890475  0.04228
18      Sunderland  33.75  33.76118  5.974816 -0.01118
19  Crystal Palace  28.25  28.22622  5.119481  0.02378

The error column is the difference between the mid- market quote and the mean of the simulated distribution. You’ll notice that all the errors are reasonably small, and could be improved by increasing the number of paths. What’s important here is that the simulation mean is converging on the market mean, which means (no pun intended) that our simulation is consistent with market prices.

We’ve also generated the standard deviation of each distribution. What’s interesting here is that the numbers for each team are reasonably similar (around the 5, 6 mark) but also that the numbers are markedly higher in the middle of the pack. If you think about how points for win/draw/loss are distributed, this makes sense; a weak team like Crystal Palace will generally be picking up zeroes and ones; a mid- ranking team like Swansea will be picking up zeros, ones and threes; whilst a strong team like Man City will typically be picking up threes and ones only.

Of these three categories, the smallest variance is for Crystal Palace (0, 1) whilst the largest is for Swansea (0, 1, 3); Man City are somewhere in the middle (1, 3)

Now points distributions are interesting, but they are not the real prize. There are very few contracts out there which are direct functions of season points; in fact other than Season Points themselves, I can’t think of any. There are however very many contracts which are functions of finishing positions – think Winner, Promotion, Top 6, Relegation etc – and finishing positions are direct functions of the number of season points a team achieves.

So what we need is a function to convert our season points distributions to finishing position probabilities:

def calc_position_probabilities(simpoints):
    paths=len(simpoints[0]["simulated_points"])
    positionprob=dict([(team["name"], 
                        [0 for i in range(len(simpoints))])
                       for team in simpoints])
    for i in range(paths):
        sortedpoints=sorted([(team["name"], team["simulated_points"][i])
                             for team in simpoints],
                            key=lambda x: -x[-1])
        for j in range(len(simpoints)):
            name=sortedpoints[j][0]
            positionprob[name][j]+=1/float(paths)
    return [{"name": key,
             "position_probabilities": value}
            for key, value in positionprob.items()]

All this function does is loop over each simulation path, rank teams according to the number of points scored, and then create a histogram of the rankings for each team; this histogram is equivalent to a vector of finishing position probabilities.

Finally we need a heatmap generation function:

# http://stackoverflow.com/questions/14391959/heatmap-in-matplotlib-with-pcolor

def generate_heatmap(data, size, colourmap, alpha=1.0):
    sorted_data=sorted(data, key=lambda row: np.inner(np.arange(len(data)), row["position_probabilities"]))
    df=pd.DataFrame([row["position_probabilities"] for row in sorted_data], 
                     index=[row["name"] for row in sorted_data], 
                     columns=np.arange(1, len(sorted_data)+1))
    fig, ax = plt.subplots()
    heatmap=ax.pcolor(df, cmap=colourmap, alpha=alpha)
    fig=plt.gcf()
    fig.set_size_inches(*size)
    ax.set_frame_on(False)
    ax.set_yticks(np.arange(df.shape[0])+0.5, minor=False)
    ax.set_xticks(np.arange(df.shape[1])+0.5, minor=False)
    ax.invert_yaxis()
    ax.xaxis.tick_top()
    ax.set_xticklabels(df.columns, minor=False) 
    ax.set_yticklabels(df.index, minor=False)
    # plt.xticks(rotation=90)
    ax.grid(False)
    ax=plt.gca()
    for t in ax.xaxis.get_major_ticks(): 
        t.tick1On=False 
        t.tick2On=False 
    for t in ax.yaxis.get_major_ticks(): 
        t.tick1On=False 
        t.tick2On=False 

And we’re ready to go:

PositionProbabilities=calc_position_probabilities(SimulatedPoints)

generate_heatmap(PositionProbabilities, size=(6, 6), colourmap=plt.cm.Reds)

Heat Map

Et voila, a heatmap of finishing position probabilities for the Premier League.

Couple of things stand out – how Crystal Palace are rooted to the bottom, how the Top Six are rapidly splitting into a Top Three group (Man City, Arsenal, Chelsea) plus the rest, and how there’s a lot more uncertainty regarding finishing positions in the bottom half of the table (less intense colours) than there is in the top, where each team seems to be trading in a three- position range.

You could take this analysis a lot further – one obvious thing to do would be to price Winner, Top 6, Relegation bets as functions of finishing position probabilities; I’ll leave that one as an exercise. My main point is simply to demonstrate that there’s a lot more information embedded in some market prices than might initially meet the eye, and that it’s generally possible to extract it with a small amount of code, some distributional assumptions and a little imagination.

About Justin Worrall

Justin Worrall works at Sportsrisq Capital, focusing on modelling sports- related insurance risks. In a former life he was a derivatives structurer for a US bank.
2 Thoughts on Simulating finishing positions
    Joseph Buchdahl
    7 Dec 2013
    12:09pm

    What a superb analysis. Based on the heat map it would appear that Man Utd (instead of Liverpool) to finish top 4 would be the value bet: http://odds.bestbetting.com/football/england/premier-league/top-4/. I would also seem that Arsenal over Liverpool in the Winner without (Man U, Man C and Chelsea) is a steal, since these 2 teams are nearly 2 standard deviations apart: http://odds.bestbetting.com/football/england/premier-league/winner-without-top-3/

    L. M. Hvattum
    9 Dec 2013
    9:20am

    Interesting to compare with similar “heat maps” based on ratings, regression and simulation, but with no input from betting markets:

    http://i41.tinypic.com/20r8th4.png (after round 15, so Man Utd now lower and L’pool higher).

Leave A Comment