抓取在线价格数据 - Sports Trading Network

专栏作家

Justin Worrall

3 Articles

Gambling with ‘Pooled Funds’.

Related Jobs

关于网络好处之一是，它从建立之初就是开放的。许多开发者或设计师已经在“查看源代码”按钮上建立了自己的事业，他们拥有打开引擎盖即能查看下面正发生什么的问题的能力。

但事情并不一定非得如此。互联网由学术研究人员而非企业巨头发明并不是历史偶然，我们可以肯定的是，如果微软用他们的方式，“查看源代码”将不会问世。

现在网络比20年前要复杂的多，网站不再一味是静态的HTML页面，而是一个动态的由HTML，Javascript，CSS，图片，JSON和XML等组成的混合体。但一般还是可能查看一个网站、尤其是某个特别的效果是何以实现的，或某数据是来自何处。

庄家们肯定对此深恶痛绝。他们的网站可能是最复杂的网络，需要实时的价格更新和投注位置设施，但即便如此也没有办法让他们保守其运作机制的秘密。如果他们想把价格传输到一个客户的浏览器上，另一个客户可以将此信息在另一台机器上运行并分析，庄家没有什么办法可以阻止这样的行为发生。

因此，要获取网络上公开的、无证的价格资源，不费吹灰之力。下面让我们试试：

http://www.stanjames.com/cache/boNavigationList/541/UK/58974.2.xml

这是斯坦杰姆斯网站的价格/事件树的根。

我怎么找到的？很简单，我安装了Firefox的HTTPFox扩展插件，监测斯坦杰姆斯网站网络流量。

And now we know the source URL, it’s easy to kick up a script to extract the key event information.

现在我们知道了源地址，就很容易从脚本中提取关键事件信息。

[脚注：这些例子是用Python写的，Python是探索数据的一种非常棒的语言。去Python网站并按照说明安装，操作很简单。此外，你需要lxml包来解析可扩展标记语言]

from lxml import etree

import urllib

UrlPatterns={

“root”: “http://www.stanjames.com/cache/boNavigationList/541/UK/%s.xml”

}

def get_categories_for_sport(sport):

url=UrlPatterns[“root”] % sport[“id”]

doc=etree.fromstring(urllib.urlopen(url).read())

return [{“name”: el.xpath(“name”)[0].text,

“id”: el.xpath(“idfwbonavigation”)[0].text}

for el in doc.xpath(“//bonavigationnode”)]

print pd.DataFrame(get_categories_for_sport({“name”: “Football”, “id”: “58974.2”})[:30])

id name

0 58974.2 Football

1 79765.2 Sunday’s UK and Elite League Matches

2 79764.2 Sunday’s Rest Of Europe Matches

3 79821.2 Sunday’s Rest of The World Matches

4 78201.2 Monday’s Matches

5 121040.2 Wednesday’s Matches

6 94819.2 Thursday’s Matches

7 134890.2 Friday’s Matches

8 121134.2 Saturday’s UK and Elite League Matches

9 79760.2 Saturday’s Rest of Europe Matches

10 85102.2 Saturday’s Rest Of The World Matches

11 125668.2 *** Promotion***

12 113017.2 —-^—-Daily Coupons—-^—-

13 123582.2 International Matches

14 114405.2 International U21 Matches

15 94920.2 International U20 Matches

16 112919.2 International U19 Matches

17 94921.2 International U18 Matches

18 126035.2 *** Promotion

19 113015.2 —-^—-Internationals—-^—-

20 116071.2 English Football

21 119561.2 English Premier League

22 119562.2 English Championship

23 119563.2 English League 1

24 119564.2 English League 2

25 139223.2 English FA Cup

26 113023.2 English League Cup

27 116070.2 Scottish Football

28 116742.2 Scottish Premiership

29 113019.2 —-^—-Main Football Coupons—-^—-

但当然这里所包含的不只是根信息，完整的事件树等待我们去解析。所以，一旦你获得了根级别的分类，你便可以得到与其关联的组：

def get_groups_for_category(category):

url=UrlPatterns[“root”] % category[“id”]

doc=etree.fromstring(urllib.urlopen(url).read())

return [{“name”: el.xpath(“name”)[0].text,

“id”: el.xpath(“idfwmarketgroup”)[0].text}

for el in doc.xpath(“//marketgroup”)]

print pd.DataFrame(get_groups_for_category({“name”: “English Premier League”, “id”: “119561.2”})[:20])

id name

0 1034393.2 Premier League

1 807989.2 English Premier League 2013/2014 – Outright

2 870661.2 English Premier League 2013/2014 – Top Goalscorer

3 857628.2 English Premier League 2013/2014 – W/O Big 3

4 847698.2 English Premier League 2013/2014 – W/O Big 6

5 830853.2 English Premier League 2013/2014 – Top 4 Finish

6 834549.2 English Premier League 2013/2014 – Top 6 Finish

7 857626.2 English Premier League 2013/2014 – Top 8 Finish

8 828455.2 English Premier League 2013/2014 – Relegation

9 838823.2 English Premier League 2013/2014 – Top 10 Finish

10 831047.2 English Premier League 2013/2014 – To Stay Up

11 857671.2 English Premier League 2013/2014 – Top 15 Finish

12 834582.2 English Premier League 2013/2014 – To Finish B…

13 857627.2 English Premier League 2013/2014 – Top 12 Finish

14 858227.2 English Premier League 2013/2014 – Handicap

15 845037.2 English Premier League 2013/2014 – Straight Fo…

16 845038.2 English Premier League 2013/2014 – Dual Forecast

一旦你获得了组，你便可以得到与其关联的任何选择：

[脚注：事件树在不同的层次上有不同的深度。对于完全市场，只有三级——分类、组/市场、选择。对于匹配事件，有更多的等级——分类、组、事件、市场和选择。但是提取数据的原则是相同的，只要做到识别出重要的XML属性就可以了。]

UrlPatterns[“group”]=”http://www.stanjames.com/cache/marketgroup/UK/%s.xml”

def get_selections_for_group(group):

url=UrlPatterns[“group”] % group[“id”]

doc=etree.fromstring(urllib.urlopen(url).read())

return [{“name”: el.xpath(“name”)[0].text,

“id”: el.xpath(“idfoselection”)[0].text,

“price”: “%s/%s” % (el.xpath(“currentpriceup”)[0].text,

el.xpath(“currentpricedown”)[0].text)}

for el in doc.xpath(“//selection”)]

print pd.DataFrame(get_selections_for_group({“name”: “Top 4 Finish”, “id”: “830853.2”})[:10])

id name price

0 95154554.2 Man City 1/6

1 95154553.2 Chelsea 1/5

2 95154556.2 Arsenal 1/5

3 95154555.2 Man United 1/3

4 95154558.2 Liverpool 1/1

5 95154557.2 Tottenham 9/4

6 95154559.2 Everton 12/1

7 95154562.2 Southampton 20/1

8 95154560.2 Newcastle 200/1

9 95154563.2 Swansea 250/1

到目前为止一切正常，但当然在事件树上有很多事件/市场，我们不想一个一个地获取它们。最好写一个能接受 “匹配”的表达式的搜索功能，让它在树上查找与表达式匹配的事件。

匹配表达式的样品如下：

XPath=”Football~English Premier League 2013/2014~(Top \\d+)|(Outright)|(Relegation)|(Bottom)”

这个表达式的目的是将任何英超联赛的完全市场进行匹配，市场中的收益是球队的最终位置的函数。

不要过分担心其滑稽的语法（它使用了一种叫做‘正则表达式’的微语言），这个奇怪的语句如下：

‘(Top \d+)’ will match any market such as Top 4, Top 6, Top 10
‘(Outright)|(Relegation)|(Bottom)’ will match either Outright, Relegation or Bottom

现在我们需要一个网络爬虫功能。这有点复杂，但核心思想是“栈”，其中包含由爬虫调查的网址。爬虫每一次从栈顶选一个网址，从网站获取数据，添加找到的任何匹配市场到栈中，并再添加发现的任何价格。这个过程一直持续到没有更多的网址可以调查，到这时结果被返回。

import re, time

SportIds={“Football”: “58974.2”}

Handlers=[get_categories_for_sport,

get_groups_for_category,

get_selections_for_group]

def crawl_events(path, wait=1):

tokens=path.split(“~”)

if tokens[0] not in SportIds:

raise RuntimeError(“Sport not found”)

stack, results = [], []

stack.append(({“name”: tokens[0],

“id”: SportIds[tokens[0]]},

[tokens[0]],

0))

while True:

if stack==[]:

break

head, stack = stack[0], stack[1:]

parent, path, depth = head

# print “~”.join(path)

handler=Handlers[depth]

if depth < len(tokens)-1:

stack+=[(result,

path+[result[“name”]],

depth+1)

for result in handler(parent)

if re.search(tokens[depth+1], result[“name”])]

else:

for result in handler(parent):

result[“path”]=path+[result[“name”]]

results.append(result)

time.sleep(wait)

return results

好了，可以将匹配表达式与爬虫一起运行了。完成时，我们会打印出与随机选择的球队的相关价格。

Selections=crawl_events(XPath)

def dump_selections(teamname):

rows=[{“name”: “%s/%s” % (item[“path”][-2].split(” – “)[-1], item[“path”][-1]),

“id”: item[“id”],

“price”: item[“price”]}

for item in Selections

if teamname==item[“path”][-1]]

print pd.DataFrame(rows, columns=[“id”, “name”, “price”])

for teamname in [“Arsenal”, “Everton”, “Southampton”, “Sunderland”]:

print “– %s –” % teamname

dump_selections(teamname)

— Arsenal —

id name price

0 88418751.2 Outright/Arsenal 10/3

1 95154556.2 Top 4 Finish/Arsenal 1/5

2 95480703.2 Top 6 Finish/Arsenal 1/66

— Everton —

id name price

0 88418754.2 Outright/Everton 150/1

1 95154559.2 Top 4 Finish/Everton 12/1

2 95480706.2 Top 6 Finish/Everton 9/4

3 96613913.2 Top 8 Finish/Everton 1/7

4 95788006.2 Top 10 Finish/Everton 1/25

5 96614250.2 Top 12 Finish/Everton 1/150

— Southampton —

id name price

0 94586817.2 Outright/Southampton 150/1

1 95154562.2 Top 4 Finish/Southampton 20/1

2 95480713.2 Top 6 Finish/Southampton 4/1

3 96613918.2 Top 8 Finish/Southampton 8/15

4 95788011.2 Top 10 Finish/Southampton 1/6

5 96614253.2 Top 12 Finish/Southampton 1/16

6 96615241.2 Top 15 Finish/Southampton 1/250

7 95434367.2 To Finish Bottom/Southampton 150/1

— Sunderland —

id name price

0 94962059.2 Relegation/Sunderland 11/10

1 95154567.2 Top 4 Finish/Sunderland 500/1

2 95480712.2 Top 6 Finish/Sunderland 500/1

3 96613921.2 Top 8 Finish/Sunderland 33/1

4 95788016.2 Top 10 Finish/Sunderland 16/1

5 96614258.2 Top 12 Finish/Sunderland 4/1

6 96615246.2 Top 15 Finish/Sunderland 7/4

7 95434371.2 To Finish Bottom/Sunderland 8/1

这就是倚赖无证事件种子轻而易举获得的价格了。

从许多方面来看，靠种子获取价格比使用必发公司的应用程序更容易，你不需要忧心登录口令，你不需要符号最优汇编程序图书馆，对于特定细分的市场（如完全市场），你可能在庄家处找到比交易所更好的流动性。

[脚注：你可以使用斯坦杰姆斯网站的原因是，他们的后端使用一种称为Finsoft Warp的产品，能够获得静态XML文件中所有价格。Betfred网站使用相同的产品，可以以同样的方式抓取价格。今天的作业是使用HTTPFox解析Betfred网站的URL结构（Betfred网站的URL结构与斯坦杰姆斯网站略有不同) ]

Justin Worrall

Related Articles

Gambling with ‘Pooled Funds’.

Related Jobs