2013-12-12 51 views
0

我是scrapy的新手,并试图抓取hackernews。我能够从网站获取所有链接和标题,但空白标题和链接也一直在抓取数据。如何避免这种情况,或者我在声明xpaths时犯了一些错误。Scrapy履带爬行额外的数据

spider.py

from scrapy.spider import BaseSpider 
from scrapy.selector import Selector 

from hn.items import HnItem 

class HNSpider(BaseSpider): 
    name = "hn" 
    allowed_domains = ["https://news.ycombinator.com/"] 
    start_urls = [ 
     "https://news.ycombinator.com/" 
    ] 

    def parse(self, response): 
     selector = Selector(response) 
     sites = selector.xpath('//td[@class="title"]') 
     items = [] 
     for site in sites: 
      item = HnItem() 
      item['title'] = site.xpath('a/text()').extract() 
      item['link'] = site.xpath('a/@href').extract() 
      items.append(item) 
     for item in items: 
      yield item 

输出

2013-12-12 11:50:46+0530 [hn] DEBUG: Crawled (200) <GET https://news.ycombinator.com/> (referer: None) 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=11171475'], 
     'title': [u'Backpacker stripped of tech gear at Auckland Airport']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://sivers.org/ws'], 'title': [u'Why was this secret?']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.theatlantic.com/politics/archive/2013/12/how-americans-were-deceived-about-cell-phone-location-data/282239/'], 
     'title': [u'How Americans Were Deceived About Cell Phone Location Data']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.rockpapershotgun.com/2013/12/11/youtube-blocks-game-videos-industry-offers-help/'], 
     'title': [u'YouTube Blocks Game Videos, Industry Offers Help']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.fsck.com/2013/12/better-and-better-keyboards.html'], 
     'title': [u'Prototype ergonomic mechanical keyboards']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.timmins.net/2013/12/11/how-att-verizon-and-comcast-are-working-together-to-screw-you-by-discontinuing-landline-service/'], 
     'title': [u'How AT&T, Verizon, and Comcast are working together to screw you']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.samaltman.com/h5n1'], 'title': [u'H5N1']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.digitaltrends.com/gadgets/parents-dislike-infant-seat-ipad-mount/'], 
     'title': [u'Parents Revolt Over Fisher-Price Infant Seat With Face-Level iPad Mount ']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://www.fsf.org/news/reform-corporate-surveillance'], 
     'title': [u'Reform corporate surveillance']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://googledrive.blogspot.com/2013/12/newsheets.html?m=1'], 
     'title': [u'New Google Sheets: faster, more powerful, and works offline']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blogs.marketwatch.com/thetell/2013/12/11/fidelity-now-allows-clients-to-put-bitcoins-in-iras/'], 
     'title': [u'Fidelity now allows clients to put bitcoins in IRAs']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://bitmason.blogspot.ca/2013/09/what-are-containers-anyway.html'], 
     'title': [u'What are Linux containers and how did they come about?']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.cbc.ca/news/canada/ottawa/canada-post-to-phase-out-urban-home-mail-delivery-1.2459618'], 
     'title': [u'Canada Post to phase out urban home mail delivery']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.reuters.com/article/2013/12/11/fda-antibiotic-idUSL3N0JQ36T20131211'], 
     'title': [u'U.S. FDA to phase out some antibiotic use in animal production']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://lists.gnu.org/archive/html/guix-devel/2013-12/msg00061.html'], 
     'title': [u'GNU Guix 0.5 released']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://sites.google.com/site/ancientbharat/home'], 
     'title': [u'Ancient Indian Texts']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.creativebloq.com/responsive-design-tools-8134180'], 
     'title': [u'Responsive design tools']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.keacher.com/1216/how-i-introduced-a-27-year-old-computer-to-the-web/'], 
     'title': [u'How I introduced a 27-year-old computer to the web']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.sendtoinc.com/2013/12/11/silicon-valley-internship-j1-visa/'], 
     'title': [u'How to intern in Silicon Valley with a J1 visa']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://www.crowdtilt.com/campaigns/project-marilyn-part-i?utm_source=HackerNews&utm_medium=HNPost&utm_campaign=ProjectMarilyn'], 
     'title': [u'Project Marilyn Part I: Non-Patented Cancer Pharmaceutical']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://steamcommunity.com/groups/steamuniverse#announcements/detail/1930088300965516570'], 
     'title': [u'Steam Machines and Steam Controller shipping to beta participants December 13th']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.alexmaccaw.com/an-engineers-guide-to-stock-options'], 
     'title': [u'An Engineer\u2019s guide to Stock Options']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.vim3d.com/'], 
     'title': [u'Vim3D \u2013 A new 3D vi clone [video]']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://da-data.blogspot.com/2013/12/briefly-profitable-alt-coin-mining-on.html'], 
     'title': [u'Briefly profitable alt-coin mining on Amazon through better code']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://blog.jetbrains.com/idea/2013/12/intellij-idea-13-brings-a-full-bag-of-goodies-to-android-developers/'], 
     'title': [u'IntelliJ IDEA 13 Brings a Full Bag of Goodies to Android Developers']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://crowdmed.theresumator.com/apply/'], 
     'title': [u'CrowdMed (YC W13) is hiring a VP of Marketing + Web Dev and Design Interns']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://jh3y.github.io/tyto/'], 'title': [u'Show HN: tyto']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://www.washingtonpost.com/blogs/the-switch/wp/2013/12/10/nsa-uses-google-cookies-to-pinpoint-targets-for-hacking/'], 
     'title': [u'NSA uses Google cookies to pinpoint targets for hacking']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'https://access.redhat.com/site/products/Red_Hat_Enterprise_Linux/Get-Beta?intcmp=70160000000cINoAAM'], 
     'title': [u'Red Hat Enterprise Linux 7 Beta']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [], 'title': []} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'http://thenextweb.com/dd/2013/12/11/digia-releases-qt-5-2-android-ios-support-previews-windows-rt-launches-qt-mobile-edition/'], 
     'title': [u'Digia releases Qt 5.2 with Android and iOS support']} 
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/> 
     {'link': [u'news2'], 'title': [u'More']} 
2013-12-12 11:50:46+0530 [hn] INFO: Closing spider (finished) 

你可能已经从输出注意到title[]link[]都相处重复一路。

如何更正此问题。请帮忙。

回答

1

有这样做的几种方法,即:

  1. 通过scrapy管道(http://doc.scrapy.org/en/latest/topics/item-pipeline.html): 您可以添加简单的管道,如果在它没有标题或链接将下降的项目。
    from scrapy.exceptions import DropItem 
    class DropEmptyPipeline(object): 
        def process_item(self, item, spider): 
         if "title" in item and "link" in item: 
          return item 
         else: 
          raise DropItem("Missing title or link in %s" % item) 
    
  2. 通过不添加项目到它收集的物品没有所有权或链接:
    if "title" in item and "link" in item: 
        items.append(item)