2013-04-18 59 views
2

我想从scrapy使用维基百科的细节。我能够刮,但我得到了一个非常混乱和糟糕的结果。由于我是python和scrapy的新手,我很难解决这个问题。如何从scrapy获得好的结果

这里是我的代码:

from scrapy.spider import BaseSpider 

from scrapy.selector import HtmlXPathSelector 

from wikipedia.items import WikipediaItem 

class WikipediaSpider(BaseSpider): 
    name = "wiki" 
    allowed_domains = ["wikipedia.org"] 
    start_urls = ["http://en.wikipedia.org/wiki/Main_Page"] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//table[@id="mp-upper"]/tr') 
     items = [] 
     for site in sites: 
      item = WikipediaItem() 
      item['title'] = site.select('.//a/text()').extract() 
      item['link'] = site.select('.//a/@href').extract() 
      item['details'] = site.select('.//p/text()').extract() 
      items.append(item) 
     return items 

,这是结果:

2013-04-19 02:18:48+0800 [wiki] DEBUG: Scraped from <200 http://en.wikipedia.org/wiki/Main_Page> 

{'details': [u' is a fungal species found in moist habitats in ', 

u'. The species produces brown ', 
       u' with ', 

       u' of varying shapes up to 40 millimetres (1.6\xa0in) across, and tall, thin ', 

       u' up to 62 millimetres (2.4\xa0in) long, at the base of which is a large and well-defined "bulb". The stem varies in colour, with whitish, pale yellow-brown, pale red-brown, pale brown and grey-brown all observed. The species produces unusually shaped, irregular ', 

       u', each with a few thick protrusions. This feature helps differentiate it from other species that would otherwise be similar in appearance and ', 

       u'. It grows in ', 

       u' association with ', 

       u', and it is for this that the species is named. However, particular species favoured by the fungus are unclear and may include ', 

       u' and ', 

       u' taxa. The mushrooms grow from the ground, often among mosses or ', 

       u'. The species was first described in 2009, and within the genus ', 

       u', it is a part of the ', 

       u' ', 

       u'. The ', 

       u' ', 

       u' was collected from the shore of a lake near ', 

       u', Finland. The species has also been recorded in Sweden and, at 
least in some areas, it is relatively common. (', 

       u')', 

       u'Recently featured: ', 

       u'\xa0\u2013 ', 

       u'\xa0\u2013 ', 

       u': ', 

       u' ', 

       u' ', 

       u'More anniversaries: ', 

       u' ', 

       u' '], 

    'link': [u'/wiki/File:Inocybe_saliceticola.jpg', 

       u'/wiki/Inocybe_saliceticola', 

       u'/wiki/Nordic_countries', 

       u'/wiki/Mushrooms', 

       u'/wiki/Pileus_(mycology)', 

       u'/wiki/Stipe_(mycology)', 

       u'/wiki/Spore', 

       u'/wiki/Habit_(biology)', 

       u'/wiki/Mycorrhizal', 

       u'/wiki/Willow', 

       u'/wiki/Beech', 

       u'/wiki/Alder', 

       u'/wiki/Detritus', 

       u'/wiki/Section_(botany)', 

       u'/wiki/Holotype', 

       u'/wiki/Nurmes', 

       u'/wiki/Inocybe_saliceticola', 

       u'/wiki/Thistle,_Utah', 

       u'/wiki/Be_Here_Now_(album)', 

       u'/wiki/Sumatran_rhinoceros', 

       u'/wiki/Wikipedia:Today%27s_featured_article/April_2013', 

       u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l', 

       u'/wiki/Wikipedia:Featured_articles', 

       u'/wiki/Wikipedia:Recent_additions', 

       u'/wiki/File:Ezra_Meeker_1921_crop.jpg', 

       u'/wiki/Ezra_Meeker', 

       u'/wiki/Oregon_Trail', 

       u'/wiki/Bullock_cart', 

       u'/wiki/Italy_at_the_2009_Mediterranean_Games', 

       u'/wiki/2009_Mediterranean_Games_medal_table', 

       u'/wiki/Cossack_hetman', 

       u'/wiki/Ivan_Petrizhitsky-Kulaga', 

       u'/wiki/Cossacks', 

       u'/wiki/Fokus_(magazine)', 

       u'/wiki/Amir_Garrett', 

       u'/wiki/College_basketball', 


       u'/wiki/Fastball', 

       u'/wiki/Armenian_Genocide', 

       u'/wiki/Karin_dialect', 

       u'/wiki/Scottish_American', 

       u'/wiki/Daniel_Pennie_House', 

       u'/wiki/Wikipedia:Recent_additions', 

       u'/wiki/Wikipedia:Your_first_article', 

       u'/wiki/Template_talk:Did_you_know', 

       u'/wiki/Slang', 

       u'/wiki/Hammer', 

       u'/wiki/Church_(building)', 

       u'/wiki/Wikipedia:Today%27s_articles_for_improvement', 

       u'/wiki/File:2013_Boston_Marathon_aftermath_people.jpg', 

       u'/wiki/West_fertilizer_plant_explosion', 

       u'/wiki/West,_Texas', 

       u'/wiki/Texas', 

       u'/wiki/Moment_magnitude_scale', 

       u'/wiki/2013_Sistan_and_Baluchestan_earthquake', 

       u'/wiki/Sistan_and_Baluchestan_Province', 

       u'/wiki/15_April_2013_Iraq_attacks', 

       u'/wiki/Boston_Marathon_bombings', 

       u'/wiki/2013_Boston_Marathon', 

       u'/wiki/Death_and_state_funeral_of_Hugo_Ch%C3%A1vez', 

       u'/wiki/Nicol%C3%A1s_Maduro', 

       u'/wiki/Venezuelan_presidential_election,_2013', 

       u'/wiki/List_of_Presidents_of_Venezuela', 

       u'/wiki/Adam_Scott_(golfer)', 

       u'/wiki/2013_Masters_Tournament', 

       u'/wiki/Government_of_India', 

       u'/wiki/Bollywood', 

       u'/wiki/Pran', 

       u'/wiki/Dadasaheb_Phalke_Award', 

       u'/wiki/Deaths_in_2013', 

       u'/wiki/Colin_Davis', 

       u'/wiki/Maria_Tallchief', 

       u'/wiki/Jonathan_Winters', 

       u'//en.wikinews.org/wiki/Main_Page', 

       u'/wiki/Portal:Current_events', 

       u'/wiki/April_18', 

       u'/wiki/File:Stpetes.JPG', 

       u'/wiki/1506', 

       u'/wiki/St._Peter%27s_Basilica', 

       u'/wiki/Vatican_City', 

       u'/wiki/Old_St._Peter%27s_Basilica', 

       u'/wiki/1689', 

       u'/wiki/Militia_(United_States)', 

       u'/wiki/Boston', 

       u'/wiki/1689_Boston_revolt', 

       u'/wiki/Dominion_of_New_England', 

       u'/wiki/1923', 

       u'/wiki/New_York_Yankees', 

       u'/wiki/Major_League_Baseball', 

       u'/wiki/Yankee_Stadium_(1923)', 

       u'/wiki/1938', 

       u'/wiki/Superman', 

       u'/wiki/Jerry_Siegel', 

       u'/wiki/Joe_Shuster', 

       u'/wiki/Action_Comics_1', 

       u'/wiki/Superhero', 

       u'/wiki/Comic_book', 

       u'/wiki/1947', 

       u'/wiki/List_of_the_largest_artificial_non-nuclear_explosions', 

       u'/wiki/Royal_Navy', 

       u'/wiki/Tonne', 

       u'/wiki/Ammunition', 

       u'/wiki/Heligoland', 

       u'/wiki/1949', 

       u'/wiki/Republic_of_Ireland', 

       u'/wiki/Commonwealth_of_Nations', 

       u'/wiki/1996', 

       u'/wiki/1996_shelling_of_Qana', 

       u'/wiki/Qana', 

       u'/wiki/Operation_Grapes_of_Wrath', 

       u'/wiki/United_Nations_Interim_Force_in_Lebanon', 

       u'/wiki/April_17', 

       u'/wiki/April_18', 

       u'/wiki/April_19', 

       u'/wiki/Wikipedia:Selected_anniversaries/April', 

       u'https://lists.wikimedia.org/mailman/listinfo/daily-article-l', 

       u'/wiki/List_of_historical_anniversaries', 

       u'/wiki/Coordinated_Universal_Time', 

       u'//en.wikipedia.org/w/index.php?title=Main_Page&action=purge'], 
'title': [u'Inocybe saliceticola', 

u'Nordic countries', 

       u'mushrooms', 

       u'caps', 

       u'stems', 

       u'spores', 

       u'habit', 

       u'mycorrhizal', 

       u'willow', 

       u'beech', 

       u'alder', 

       u'detritus', 

       u'section', 

       u'holotype', 

       u'Nurmes', 

       u'Thistle, Utah', 

       u'Be Here Now', 

       u'Sumatran rhinoceros', 

       u'Archive' 

       u'List of historical anniversaries', 

       u'UTC', 

       u'Reload this page']} 
+1

维基百科提供了一个API http://www.mediawiki.org/wiki/API:Main_page – dm03514

+0

如果您仍然希望使用Scrapy,请编辑您的文章并注明您希望以何种格式抓取结果。 – alecxe

回答

2

我无法访问你做了同样的页面,但你得到的结果可能是如此飘忽不定,因为维基百科文字充满了链接。当你做site.select('.//p/text()')时,你只能选择直接位于节点<p>下的文本。这意味着子节点<a href=..>text</a>中的内容不会被删除。链接标签分裂结果,所以你最终得到一个奇怪的列表。

如果要检索每个节点可以使用

contents = site.select('.//p/node()').extract() 
item['details'] = ''.join(contents) 

这样,你将有<p>标签(包括<a>标签)内的一切。如果您只想要没有链接标签的文本,则可以使用strip_html(item['details'])(实际上,contents = site.select('.//p//text()').extract()可能也适用,并且可以更多地使用xpath)。

+0

谢谢,它的工作原理。然而,我想刮的维基百科的内容被改变了,所以我没有得到确切的结果。 – Apple