2015-06-15 59 views
3

我想从网页上刮取数据表,我在网上找到的所有教程都太具体,并且不解释每个参数/元素是什么,所以我不能解释了解如何为我的例子工作。任何意见,在哪里可以找到好的教程来刮这种数据,将不胜感激;在Python中的网页刮表数据

query = urllib.urlencode({'q': company}) 
page = requests.get('http://www.hoovers.com/company-information/company-search.html?term=company') 
tree = html.fromstring(page.text) 

table =tree.xpath('//[@id="shell"]/div/div/div[2]/div[5]/div[1]/div/div[1]') 

#Can't get xpath correct 
#This will create a list of companies: 
companies = tree.xpath('//...') 
#This will create a list of locations 
locations = tree.xpath('//....') 

我也曾尝试:

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company' 
req = urllib2.Request(hoover) 
page = urllib2.urlopen(req) 
soup = BeautifulSoup(page) 

table = soup.find("table", { "class" : "clear data-table sortable-header dashed-table-tr alternate-rows" }) 

f = open('output.csv', 'w') 
for row in table.findAll('tr'): 
    f.write(','.join(''.join([str(i).replace(',','') for i in row.findAll('td',text=True) if i[0]!='&']).split('\n')[1;-1])+'\n') 
f.close()  

但我在最后第二条

回答

3

是的,美丽的汤变得无效的语法错误。以下是获取名称的简单示例:

hoover = 'http://www.hoovers.com/company-information/company-search.html?term=company' 
req = urllib2.Request(hoover) 
page = urllib2.urlopen(req) 
soup = BeautifulSoup(page.text) 
trs = soup.find("div", attrs={"class": "clear data-table sortable-header dashed-table-tr alternate-rows"}).find("table").findAll("tr") 
for tr in trs: 
    tds = tr.findAll("td") 
    if len(tds) < 1: 
     continue 
    name = tds[0].text 
    print name 
f.close() 
+0

谢谢!非常有帮助,但我正在尝试以这种方式完成页面源代码,而不是在html页面中读取,因为目标是将其构建为一个函数:hoovers ='http://www.hoovers.com/company-information/ company-search.html?term = company' req = urllib2.Request(hoovers) page = urllib2.urlopen(req) soup = BeautifulSoup(page),但是我得到一个Atttribute错误,运行解决方案的第三行! –

+0

BeautifulSoup构造函数接受一个流或字符串,所以你应该能够传递page.text或一个流式版本。 – cfraschetti

+0

对不起,我不明白你的意思是传递page.text或流版本? –