用BeautifulSoup刮掉一系列表格

我想了解关于网页抓取和python（以及编程方面的问题），并找到了BeautifulSoup库，它似乎提供了很多可能性。用BeautifulSoup刮掉一系列表格

我试图找出如何最好从这个页面拉动相关信息：

http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113

我可以进入更详细的关于这一点，但基本上是公司名称，关于它的描述，联系方式，各种公司详细信息/统计数据等

在这个阶段，看看如何彻底隔离这些数据并进行刮擦，以便将所有数据全部放入CSV或其他内容。

我很困惑如何使用BS来获取不同的表格数据。有很多tr和td标签，不知道如何锚定到任何独特的东西。

我想出了如下的代码作为开始的最好：

from bs4 import BeautifulSoup 
import urllib2 

html = urllib2.urlopen("http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113") 
soup = BeautifulSoup(html) 
soupie = soup.prettify() 
print soupie

，然后从那里使用正则表达式e.t.c.从清理的文本中提取数据。

但是，必须有更好的方法来使用BS树来做到这一点？或者这个网站的格式不符合BS提供更多帮助？

没有寻找一个完整的解决方案，因为这是一个很大的问题，我想学习，但任何代码片段让我在我的方式将非常感激。

更新

感谢@ZeroPiraeus下面我开始了解如何通过表解析。下面是从他的代码输出：

=== Personnel === 
bodytext Ms Gail Morgan CEO 
bodytext Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 
bodytext Lisa Mayoh Sales Manager 
bodytext Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 Email: [email protected] 

=== Company Details === 
bodytext ACN: 007 350 807 ABN: 71 007 350 807 Australian Owned Annual Turnover: $5M - $10M Number of Employees: 6-10 QA: ISO9001-2008, AS9120B, Export Percentage: 5 % Industry Categories: AerospaceLand (Vehicles, etc)LogisticsMarineProcurement Company Email: [email protected] Company Website: http://www.aerospacematerials.com.au Office: 2/6 Ovata Drive Tullamarine VIC 3043 Post: PO Box 188 TullamarineVIC 3043 Phone: +61.3. 9464 4455 Fax: +61.3. 9464 4422 
paraheading ACN: 
bodytext 007 350 807 
paraheading ABN: 
bodytext 71 007 350 807 
paraheading 
bodytext Australian Owned 
paraheading Annual Turnover: 
bodytext $5M - $10M 
paraheading Number of Employees: 
bodytext 6-10 
paraheading QA: 
bodytext ISO9001-2008, AS9120B, 
paraheading Export Percentage: 
bodytext 5 % 
paraheading Industry Categories: 
bodytext AerospaceLand (Vehicles, etc)LogisticsMarineProcurement 
paraheading Company Email: 
bodytext [email protected]aerospacematerials.com.au 
paraheading Company Website: 
bodytext http://www.aerospacematerials.com.au 
paraheading Office: 
bodytext 2/6 Ovata Drive Tullamarine VIC 3043 
paraheading Post: 
bodytext PO Box 188 TullamarineVIC 3043 
paraheading Phone: 
bodytext +61.3. 9464 4455 
paraheading Fax: 
bodytext +61.3. 9464 4422

我的下一个问题是，什么是把这个数据到CSV这将是适用于导入到电子表格的最佳方式？例如，拥有诸如“ABN”“ACN”“公司网站”e.t.c之类的内容。作为列标题，然后将相应的数据作为行信息。

感谢您的任何帮助。

来源

2012-11-12 Fusilli Jerry

想必你有一个排的每一页刮，然后呢？ –

这将是主意，是的。 –

您的代码将取决于你想要什么，你要如何保存，但是，这个片断应该给你一个想法，你如何能得到相关信息，从纸页出去：

import requests 

from bs4 import BeautifulSoup 

url = "http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113" 
html = requests.get(url).text 
soup = BeautifulSoup(html) 

for feature_heading in soup.find_all("td", {"class": "Feature-Heading"}): 
    print "\n=== %s ===" % feature_heading.text 
    details = feature_heading.find_next_sibling("td") 
    for item in details.find_all("td", {"class": ["bodytext", "paraheading"]}): 
     print("\t".join([item["class"][0], " ".join(item.text.split())]))

我找到requests比urllib2更愉快的图书馆，但当然这取决于你。

编辑：

在回答你的后续问题，这里的东西，你可以用它来从刮数据写入一个CSV文件：

import csv 
import requests 

from bs4 import BeautifulSoup 

columns = ["ACN", "ABN", "Annual Turnover", "QA"] 
urls = ["http://www.aidn.org.au/Industry-ViewCompany.asp?CID=3113", ] # ... etc. 

with open("data.csv", "w") as csv_file: 
    writer = csv.DictWriter(csv_file, columns) 
    writer.writeheader() 
    for url in urls: 
     soup = BeautifulSoup(requests.get(url).text) 
     row = {} 
     for heading in soup.find_all("td", {"class": "paraheading"}): 
      key = " ".join(heading.text.split()).rstrip(":") 
      if key in columns: 
       next_td = heading.find_next_sibling("td", {"class": "bodytext"}) 
       value = " ".join(next_td.text.split()) 
       row[key] = value 
     writer.writerow(row)

来源

2012-11-12 19:32:31

非常感谢@ZeroPiraeus。这对帮助我制定使用BS的策略有很长的路要走。如果你不介意看看，是否还有其他问题修改了我的问题？ –

不客气......但我认为你的编辑一定已经被打乱了;它现在只是说“谢谢”。 –

是的，我不得不回到它。现在应该在那里。 –

我以前曾经走过这条路。我使用的html页面总是与表格相同的格式，并且在公司内部。我们确信客户知道，如果他们改变了页面，那很可能会破坏程序设计。有了这个规定，就可以确定从tr和td列表中索引值的位置。离拥有他们无法或无法提供的XML数据的理想情况还很远，但现在已经运行了近一年。如果有人知道更好的答案，我也想知道。这是我第一次也是唯一一次使用美丽汤，从未有过需要，但它运作得很好。

来源

2012-11-12 17:56:59

有趣 - 谢谢你的见解@ Dave_750 –

用BeautifulSoup刮掉一系列表格

回答

相关问题