2014-02-10 76 views
1

我在自动从Wikipedia文章中刮取表中的数据时遇到了一些麻烦。首先,我得到了编码错误。我指定了UTF-8并且错误消失了,但抓取的数据没有正确显示很多字符。您将能够从我是一个完整的新手的代码告诉:Python + BeautifulSoup导出为CSV

from bs4 import BeautifulSoup 
import urllib2 

wiki = "http://en.wikipedia.org/wiki/Anderson_Silva" 
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia 
req = urllib2.Request(wiki,headers=header) 
page = urllib2.urlopen(req) 
soup = BeautifulSoup(page) 

Result = "" 
Record = "" 
Opponent = "" 
Method = "" 
Event = "" 
Date = "" 
Round = "" 
Time = "" 
Location = "" 
Notes = "" 

table = soup.find("table", { "class" : "wikitable sortable" }) 

f = open('output.csv', 'w') 

for row in table.findAll("tr"): 
    cells = row.findAll("td") 
    #For each "tr", assign each "td" to a variable. 
    if len(cells) == 10: 
     Result = cells[0].find(text=True) 
     Record = cells[1].find(text=True) 
     Opponent = cells[2].find(text=True) 
     Method = cells[3].find(text=True) 
     Event = cells[4].find(text=True) 
     Date = cells[5].find(text=True) 
     Round = cells[6].find(text=True) 
     Time = cells[7].find(text=True) 
     Location = cells[8].find(text=True) 
     Notes = cells[9].find(text=True) 

     write_to_file = Result + "," + Record + "," + Opponent + "," + Method + "," + Event + "," + Date + "," + Round + "," + Time + "," + Location + "\n" 
     write_to_unicode = write_to_file.encode('utf-8') 
     print write_to_unicode 
     f.write(write_to_unicode) 

f.close() 
+2

您是否尝试过使用CSV模块(http://docs.python.org/2/library/csv.html)?它处理引用等。该文档还指出你写出不同编码的文本的正确方向。对于您的特定问题,尽管... UTF-8无法正确显示什么内容?根据该页面上的元标记,字符集是UTF-8。 – pswaminathan

回答

1

由于pswaminathan指出,使用csv模块将极大地帮助。下面是我如何做到这一点:

table = soup.find('table', {'class': 'wikitable sortable'}) 
with open('out2.csv', 'w') as f: 
    csvwriter = csv.writer(f) 
    for row in table.findAll('tr'): 
     cells = [c.text.encode('utf-8') for c in row.findAll('td')] 
     if len(cells) == 10: 
      csvwriter.writerow(cells) 

讨论

  • 使用CSV模块,我创建连接到我的输出文件csvwriter对象。
  • 通过使用with命令,我不必担心在完成后关闭输出文件:它将在with块后关闭。
  • 在我的代码中,cells是从tr标记中的td标记中提取的UTF8编码文本的列表。
  • 我使用了比c.find(text=True)更简洁的构造c.text