2014-04-10 32 views
1

将多个类别的网页抓取到csv中。成功获得第一类成列,但第二列数据不写入csv。我正在使用的代码:抓取网站将数据移动到多个csv列

import urllib2 
import csv 
from bs4 import BeautifulSoup 
url = "http://digitalstorage.journalism.cuny.edu/sandeepjunnarkar/tests/jazz.html" 
page = urllib2.urlopen(url) 
soup_jazz = BeautifulSoup(page) 
all_years = soup_jazz.find_all("td",class_="views-field views-field-year") 
all_category = soup_jazz.find_all("td",class_="views-field views-field-category-code") 
with open("jazz.csv", 'w') as f: 
    csv_writer = csv.writer(f) 
    csv_writer.writerow([u'Year Won', u'Category']) 
    for years in all_years: 
     year_won = years.string 
     if year_won: 
      csv_writer.writerow([year_won.encode('utf-8')]) 
    for categories in all_category: 
     category_won = categories.string 
     if category_won: 
      csv_writer.writerow([category_won.encode('utf-8')]) 

它将列标题写入第二列而不是category_won。

根据您的建议,我已把它编译阅读:

with open("jazz.csv", 'w') as f: 
    csv_writer = csv.writer(f) 
    csv_writer.writerow([u'Year Won', u'Category']) 
for years, categories in zip(all_years, all_category): 
    year_won = years.string 
    category_won = categories.string 
    if year_won and category_won: 
     csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) 

但现在我已经收到以下错误:

csv_writer.writerow([year_won.encode( 'UTF-8' ),category_won.encode( 'UTF-8')]) ValueError异常:I/O操作上关闭的文件

回答

0

你可以在两个列表zip()在一起:

for years, categories in zip(all_years, all_category): 
    year_won = years.string 
    category_won = categories.string 
    if year_won and category_won: 
     csv_writer.writerow([year_won.encode('utf-8'), category_won.encode('utf-8')]) 

不幸的是,那个HTML页面有点坏了,你不能像你期望的那样搜索表格行。

下一个最好的事情是寻找这些年来,然后找同级细胞:

soup_jazz = BeautifulSoup(page) 
with open("jazz.csv", 'w') as f: 
    csv_writer = csv.writer(f) 
    csv_writer.writerow([u'Year Won', u'Category']) 
    for year_cell in soup_jazz.find_all('td', class_='views-field-year'): 
     year = year_cell and year_cell.text.strip().encode('utf8') 
     if not year: 
      continue 
     category = next((e for e in year_cell.next_siblings 
         if getattr(e, 'name') == 'td' and 
          'views-field-category-code' in e.attrs.get('class', [])), 
         None) 
     category = category and category.text.strip().encode('utf8') 
     if year and category: 
      csv_writer.writerow([year, category]) 

这将产生:

Year Won,Category 
2012,Best Improvised Jazz Solo 
2012,Best Jazz Vocal Album 
2012,Best Jazz Instrumental Album 
2012,Best Large Jazz Ensemble Album 
.... 
1960,Best Jazz Composition Of More Than Five Minutes Duration 
1959,Best Jazz Performance - Soloist 
1959,Best Jazz Performance - Group 
1958,"Best Jazz Performance, Individual" 
1958,"Best Jazz Performance, Group" 
+0

只是去尝试,现在我上面列出得到一个错误。 – user1922698

+0

@ user1922698:然后,您正在尝试运行'with'语句的*外部*循环。 –

+0

但上面生成的内容一次又一次地显示了同一类别,但它们都是不同的类别。 – user1922698