2017-04-12 57 views
0

如何为".//*[@id='object']//tbody//tr//td//span//a[2]"?的网址返回NaN?我想:如何为没有抓取信息的网站返回NaN?

def parse(self, response): 
    links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]") 
    if not links: 
     item = ToyItem() 
     item['link'] = 'NaN' 
     item['name'] = response.url 
     return item 

    for links in links: 
     item = ToyItem() 
     item['link'] = links.xpath('@href').extract_first() 
     item['name'] = response.url # <-- see here 
    yield item 

    list_of_dics = [] 
    list_of_dics.append(item) 
    df = pd.DataFrame(list_of_dics) 
    print(df) 
    df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False) 

然而,而不是返回(*)

'link1.com' 'NaN' 
'link2.com' 'NAN' 
'link3.com' 'extracted3.link.com' 

我:

'link3.com' 'extracted3.link.com' 

我怎样才能返回(*)

回答

1

您可以返工此使用scrapy管道:

from scrapy import Spider 
class MySpider(Spider): 
    name = 'myspider' 
    start_urls = ['link1','link2','link3'] 

    def parse(self, response): 
     links = response.xpath(".//*[@id='object']//tbody//tr//td//span//a[2]") 
     if not links: 
      item = ToyItem() 
      item['link'] = 'NaN' 
      item['name'] = response.url 
      yield item 
     else: 
      for links in links: 
       item = ToyItem() 
       item['link'] = link.xpath('@href').extract_first() 
       item['name'] = response.url # <-- see here 
       yield item 

现在,在您pipelines.py

class PandasPipeline: 

    def open_spider(self, spider): 
     self.data = [] 

    def process_item(self, item, spider): 
     self.data.append(item) 
     return item 

    def close_spider(self, spider): 
     df = pd.DataFrame(self.data) 
     print('saving dataframe') 
     df.to_csv('/Users/user/Desktop/crawled_table.csv', index=False) 

settings.py

ITEM_PIPELINES = { 
    'myproject.pipelines.PandasPipeline': 900 
} 
+1

@tumbleweed很好,这意味着没有链接在页面上找到。是'parse'被多次调用? – Granitosaurus

+0

是的,对于那些未找到链接,我想知道如何对NaN进行变形,而不是没有'None',将其参考网址保存在左侧。 – tumbleweed

+1

伙计,每次解析被称为'to_csv'用新数据覆盖旧的csv,所以基本上你最终只会得到最后一次'parse'调用的数据,即最后一次被抓取的链接。 – Granitosaurus