Scraping online price data

Sportsrisq’s Justin Worrall discusses the concept of ‘scraping’ prices from online bookmakers to easily and freely obtain data to use within trading strategies within any sport.

Categories: All Sports, Data, Execution & Getting On, Latency, Prices, Professional, Statistical models, Technology

One of the great things about the web is the fact that openness was built in from the outset. Many a developer or designer has built their career upon the ‘View Source’ button; the ability to look under the hood and see precisely what’s going on underneath.

It didn’t have to be that way, either; it is simply historical accident that the web was invented by an academic researcher rather than a corporate giant, and you can bet your boots that if Microsoft had had their way, ‘View Source’ would have never seen the light of day.

And whilst the web is now a much more complicated place than 20 years ago – web sites are no longer simple static HTML pages, but rather a dynamic and unholy mix of HTML, Javascript, CSS, images, JSON and XML feeds etc – it’s still generally possible to look inside a website and see how a particular effect has been achieved or where a piece of data is coming from.

How the bookmakers must curse this. Their sites are possibly some of the most complex on the web, requiring real time price updates and bet placement facilities, but there’s no way for them to keep the underlying mechanics a secret. If they want to pipe prices to one customer’s browser, there’s very little they can do to stop another customer feeding those same prices into a machine for analysis.

With this in mind, it doesn’t take much to discover public, undocumented price resources lying around the web. Try this one for size here.

It’s the root of the Stan James price/event tree.

How did I find it ? Simple – I installed the HTTPFox extension for Firefox, and monitored the web traffic as the Stan James site loaded.

And now we know the source URL, it’s easy to kick up a script to extract the key event information.

[Footnote: these examples are in Python, a wonderful language for exploring data. Go to the Python website and follow the installation instructions; they are pretty simple. You will also need the lxml package for parsing XML]

from lxml import etree

import urllib

UrlPatterns={
    "root": "http://www.stanjames.com/cache/boNavigationList/541/UK/%s.xml"
    }

def get_categories_for_sport(sport):    
    url=UrlPatterns["root"] % sport["id"]
    doc=etree.fromstring(urllib.urlopen(url).read())
    return [{"name": el.xpath("name")[0].text,
             "id": el.xpath("idfwbonavigation")[0].text}
            for el in doc.xpath("//bonavigationnode")]

print pd.DataFrame(get_categories_for_sport({"name": "Football", "id": "58974.2"})[:30])
          id                                     name
0    58974.2                                 Football
1    79765.2     Sunday's UK and Elite League Matches
2    79764.2          Sunday's Rest Of Europe Matches
3    79821.2       Sunday's Rest of The World Matches
4    78201.2                         Monday's Matches
5   121040.2                      Wednesday's Matches
6    94819.2                       Thursday's Matches
7   134890.2                         Friday's Matches
8   121134.2   Saturday's UK and Elite League Matches
9    79760.2        Saturday's Rest of Europe Matches
10   85102.2     Saturday's Rest Of The World Matches
11  125668.2                         *** Promotion***
12  113017.2          ----^----Daily Coupons----^----
13  123582.2                    International Matches
14  114405.2                International U21 Matches
15   94920.2                International U20 Matches
16  112919.2                International U19 Matches
17   94921.2                International U18 Matches
18  126035.2                            *** Promotion
19  113015.2         ----^----Internationals----^----
20  116071.2                         English Football
21  119561.2                   English Premier League
22  119562.2                     English Championship
23  119563.2                         English League 1
24  119564.2                         English League 2
25  139223.2                           English FA Cup
26  113023.2                       English League Cup
27  116070.2                        Scottish Football
28  116742.2                     Scottish Premiership
29  113019.2  ----^----Main Football Coupons----^----

But of course there’s more than root information here; there’s an entire event tree to be parsed. So once you have the root level categories, you can get the groups associated with a category:

def get_groups_for_category(category):
    url=UrlPatterns["root"] % category["id"]
    doc=etree.fromstring(urllib.urlopen(url).read())
    return [{"name": el.xpath("name")[0].text,
             "id": el.xpath("idfwmarketgroup")[0].text}
            for el in doc.xpath("//marketgroup")]

print pd.DataFrame(get_groups_for_category({"name": "English Premier League", "id": "119561.2"})[:20])
           id                                               name
0   1034393.2                                     Premier League
1    807989.2        English Premier League 2013/2014 - Outright
2    870661.2  English Premier League 2013/2014 - Top Goalscorer
3    857628.2       English Premier League 2013/2014 - W/O Big 3
4    847698.2       English Premier League 2013/2014 - W/O Big 6
5    830853.2    English Premier League 2013/2014 - Top 4 Finish
6    834549.2    English Premier League 2013/2014 - Top 6 Finish
7    857626.2    English Premier League 2013/2014 - Top 8 Finish
8    828455.2     English Premier League 2013/2014 -  Relegation
9    838823.2   English Premier League 2013/2014 - Top 10 Finish
10   831047.2      English Premier League 2013/2014 - To Stay Up
11   857671.2   English Premier League 2013/2014 - Top 15 Finish
12   834582.2  English Premier League 2013/2014 - To Finish B...
13   857627.2   English Premier League 2013/2014 - Top 12 Finish
14   858227.2        English Premier League 2013/2014 - Handicap
15   845037.2  English Premier League 2013/2014 - Straight Fo...
16   845038.2   English Premier League 2013/2014 - Dual Forecast

And once you have the groups, you can get selections associated with that group.

[Footnote: the event tree has different depths at different levels. For outright markets, there are three levels only – Category, Group/Market and Selection. For match events, there are more levels – Category, Group, Event, Market and Selection; but the principles for extracting the data remain the same; it’s just a question of identifying the important XML attributes]

UrlPatterns["group"]="http://www.stanjames.com/cache/marketgroup/UK/%s.xml"

def get_selections_for_group(group):
    url=UrlPatterns["group"] % group["id"]
    doc=etree.fromstring(urllib.urlopen(url).read())
    return [{"name": el.xpath("name")[0].text,
             "id": el.xpath("idfoselection")[0].text,
             "price": "%s/%s" % (el.xpath("currentpriceup")[0].text,
                                 el.xpath("currentpricedown")[0].text)}
            for el in doc.xpath("//selection")]

print pd.DataFrame(get_selections_for_group({"name": "Top 4 Finish", "id": "830853.2"})[:10])

           id         name  price
0  95154554.2     Man City    1/6
1  95154553.2      Chelsea    1/5
2  95154556.2      Arsenal    1/5
3  95154555.2   Man United    1/3
4  95154558.2    Liverpool    1/1
5  95154557.2    Tottenham    9/4
6  95154559.2      Everton   12/1
7  95154562.2  Southampton   20/1
8  95154560.2    Newcastle  200/1
9  95154563.2      Swansea  250/1

All very nice so far, but of course there are a very large number of events/markets in the tree, and we don’t want to have to fetch them individually. It would be better to write a search function which accepts a ‘matching’ expression, and which will crawl the tree looking for events that match that expression.

A sample match expression might look as follows:

XPath="Football~English Premier League 2013/2014~(Top \d+)|(Outright)|(Relegation)|(Bottom)"

This expression is designed to match any Premier League outright market where the payoff is a function of a team’s finishing positions.

Don’t worry too much about the funny syntax (it uses a mini- language called ‘regular expressions’) – the odd- looking bits work as follows:

  • ‘(Top d+)’ will match any market such as Top 4, Top 6, Top 10
  • ‘(Outright)|(Relegation)|(Bottom)’ will match either Outright, Relegation or Bottom

Now we need a crawler function. This is a little complicated but works around the idea of a ‘stack’, which contains URLs for investigation by the crawler. The crawler picks one URL from the top of the stack at a time, fetches the data for that URL, adds any matching markets it finds to the stack, and adds any prices it finds to the results. The process continues until there are no more URLs for the crawler to investigate, at which point the results are returned.

import re, time

SportIds={"Football": "58974.2"}

Handlers=[get_categories_for_sport,
          get_groups_for_category,
          get_selections_for_group]

def crawl_events(path, wait=1):
    tokens=path.split("~")
    if tokens[0] not in SportIds:
        raise RuntimeError("Sport not found")
    stack, results = [], []
    stack.append(({"name": tokens[0],
                   "id": SportIds[tokens[0]]}, 
                  [tokens[0]],
                  0))
    while True:
        if stack==[]:
            break
        head, stack = stack[0], stack[1:]
        parent, path, depth = head
        # print "~".join(path)
        handler=Handlers[depth]
        if depth < len(tokens)-1:
            stack+=[(result, 
                     path+[result["name"]],
                     depth+1)
                    for result in handler(parent)
                    if re.search(tokens[depth+1], result["name"])]
        else:
            for result in handler(parent):
                result["path"]=path+[result["name"]]
                results.append(result)
        time.sleep(wait)
    return results

OK, time to run the crawler with our matching expression. When it’s done, we’ll print out prices associated with some randomly- chosen teams.

Selections=crawl_events(XPath)

def dump_selections(teamname):
    rows=[{"name": "%s/%s" % (item["path"][-2].split(" - ")[-1], item["path"][-1]),
           "id": item["id"],
           "price": item["price"]}
            for item in Selections
            if teamname==item["path"][-1]]
    print pd.DataFrame(rows, columns=["id", "name", "price"])

for teamname in ["Arsenal", "Everton", "Southampton", "Sunderland"]:
    print "-- %s --" % teamname
    print
    dump_selections(teamname)
    print
-- Arsenal --

           id                  name price
0  88418751.2      Outright/Arsenal  10/3
1  95154556.2  Top 4 Finish/Arsenal   1/5
2  95480703.2  Top 6 Finish/Arsenal  1/66

-- Everton --

           id                   name  price
0  88418754.2       Outright/Everton  150/1
1  95154559.2   Top 4 Finish/Everton   12/1
2  95480706.2   Top 6 Finish/Everton    9/4
3  96613913.2   Top 8 Finish/Everton    1/7
4  95788006.2  Top 10 Finish/Everton   1/25
5  96614250.2  Top 12 Finish/Everton  1/150

-- Southampton --

           id                          name  price
0  94586817.2          Outright/Southampton  150/1
1  95154562.2      Top 4 Finish/Southampton   20/1
2  95480713.2      Top 6 Finish/Southampton    4/1
3  96613918.2      Top 8 Finish/Southampton   8/15
4  95788011.2     Top 10 Finish/Southampton    1/6
5  96614253.2     Top 12 Finish/Southampton   1/16
6  96615241.2     Top 15 Finish/Southampton  1/250
7  95434367.2  To Finish Bottom/Southampton  150/1

-- Sunderland --

           id                         name  price
0  94962059.2        Relegation/Sunderland  11/10
1  95154567.2      Top 4 Finish/Sunderland  500/1
2  95480712.2      Top 6 Finish/Sunderland  500/1
3  96613921.2      Top 8 Finish/Sunderland   33/1
4  95788016.2     Top 10 Finish/Sunderland   16/1
5  96614258.2     Top 12 Finish/Sunderland    4/1
6  96615246.2     Top 15 Finish/Sunderland    7/4
7  95434371.2  To Finish Bottom/Sunderland    8/1

So there you have it, easy price discovery courtesy of an undocumented event feed.

In many ways it’s easier to get prices from a feed like this than use the Betfair API – you don’t have to mess with login tokens, you don’t need a SOAP library, and for certain market segments (such as outright markets), you’ll probably find better liquidity with a bookmaker than on the exchange.

[Footnote: the reason you can do this with Stan James is that their backend uses a product called Finsoft Warp, which caches all their prices in static XML files. BetFred use the same product and can be scraped in the same way; homework is to discover the BetFred URL structure using HTTPFox (the BetFred URL structure is slightly different from Stan James)]

About Justin Worrall

Justin Worrall works at Sportsrisq Capital, focusing on modelling sports- related insurance risks. In a former life he was a derivatives structurer for a US bank.
One Thought on Scraping online price data
    ryusukekenji
    8 Jul 2014
    8:01am

    I am applying Python to scrape the online data as well, however not familiar with SQL and so save as xlsx and then apply R to move to Access for further calculation.

Leave A Comment