2014-03-25 64 views
2
def parse_linkpage(self, response): 
    hxs = HtmlXPathSelector(response) 
    item = QualificationItem() 
    xpath = """ 
      //h2[normalize-space(.)="Entry requirements for undergraduate courses"] 
      /following-sibling::p 
      """ 
    item['Qualification'] = hxs.select(xpath).extract()[1:] 
    item['Country'] = response.meta['a_of_the_link'] 
    return item 

所以我想知道是否可以让我的代码在<h2>结束后停止刮取。只能在特定标题后才能删除内容吗?

这里是网页:

<h2>Entry requirements for undergraduate courses</h2> 
<p>Example1</p> 
<p>Example2</p> 
<h2>Postgraduate Courses</h2> 
<p>Example3</p> 
<p>Example4</p> 

我想这些结果:

Example1 
Example2 

,但我得到:

Example1 
Example2 
Example3 
Example4 

我知道我可以改变这一行,

item['Qualification'] = hxs.select(xpath).extract() 

到,

item['Qualification'] = hxs.select(xpath).extract()[0:2] 

但这刮看,可能有2周以上的段落在第一头这意味着它会离开这个信息了许多不同的页面。

我想知道是否有一种方法,只是告诉它提取我想要的标题后面的确切数据,而不是一切?

回答

2

这不是很漂亮或容易读,但你可以用EXSLT扩展XPath和使用set:difference()操作:

>>> selector.xpath(""" 
    set:difference(//h2[normalize-space(.)="Entry requirements for undergraduate courses"] 
        /following-sibling::p, 
        //h2[normalize-space(.)="Entry requirements for undergraduate courses"] 
        /following-sibling::h2[1] 
        /following-sibling::p)""").extract() 
[u'<p>Example1</p>', u'<p>Example2</p>'] 

的想法是选择所有p目标h2以下,并排除那些p这在接下来的h2

在一个有点易于阅读的版本后:

>>> for h2 in selector.xpath('//h2[normalize-space(.)="Entry requirements for undergraduate courses"]'): 
...  paragraphs = h2.xpath("""set:difference(./following-sibling::p, 
...            ./following-sibling::h2[1]/following-sibling::p)""").extract() 
...  print paragraphs 
... 
[u'<p>Example1</p>', u'<p>Example2</p>'] 
>>> 
0

也许你可以使用此XPath

//h2[normalize-space(.)="Entry requirements for undergraduate courses"] 
     /following-sibling::p[not(preceding-sibling::h2[normalize-space(.)!="Entry requirements for undergraduate courses"])] 

你可以添加following-sibling::p的另一个谓词不包括那些p(胡)的前同辈不等于

“的本科课程入学要求”
相关问题