2016-05-19 40 views
3

那么我最初的代码工作,但错过了一些奇怪的格式在网站在Scrapy使用XPath选择以下段落任何文本

response.xpath("//*[contains(., 'Description:')]/following-sibling::p/text()").extract() 


    <div id="body"> 
    <a name="main_content" id="main_content"></a> 
    <!-- InstanceBeginEditable name="main_content" --> 
<div class="return_to_div"><a href="../../index.html">HOME</a> | <a href="../index.html">DEATH ROW</a> | <a href="index.html">INFORMATION</a> | text</div> 
<h1>text</h1> 
<h2>text</h2> 
<p class="text_bold">text:</p> 
<p>text</p> 
<p class="text_bold">text:</p> 
<p>text</p> 
<p class="text_bold">Description:</p> 
<p>Line1</p> 
<p>Line2</p> 
Line3 <!-- InstanceEndEditable --> 
    </div> 

我没有问题,拉动1号线和2号线3号线是但不是我的P班的兄弟姐妹。这只发生在我试图从表格中删除的某些页面上。

这里是链接:https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html

对不起Xpath的只是混淆了我,是有办法的标准//*[contains(., 'Description:')]而不是不必是一个兄弟之后提取所有的数据多数民众赞成?

在此先感谢。

编辑:更改示例以更多地反映实际。添加了到原始页面的链接。

+0

你从网页上想要什么? –

回答

3

可以后含<p>选择所有兄弟节点(元素和文本节点)“描述:”(following-sibling::node()),然后获取所有文本节点(descendant-or-self::text()):

>>> import scrapy 
>>> response = scrapy.Selector(text="""<div> 
... <p> Name </p> 
... <p> Age </p> 
... <p class="text-bold"> Description: </p> 
... <p> Line 1 </p> 
... <p> Line 2 </p> 
... Line 3 
... </div>""", type="html") 
>>> response.xpath("""//div/p[contains(., 'Description:')] 
...  /following-sibling::node() 
...   /descendant-or-self::text()""").extract() 
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n'] 
>>> 

让我们来分析一下。

所以,你已经知道如何找到包含正确的<p> “说明”(使用XPath //div/p[contains(., 'Description:')]):

>>> response.xpath("//div/p[contains(., 'Description:')]").extract() 
[u'<p class="text-bold"> Description: </p>'] 

你想<p> s表示跟从(following-sibling::轴+ p元素选择):

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::p").extract() 
[u'<p> Line 1 </p>', u'<p> Line 2 </p>'] 

这不会给你第三行。所以,你了解XPath和尝试了“一揽子” *

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::*").extract() 
[u'<p> Line 1 </p>', u'<p> Line 2 </p>'] 

仍然没有运气。为什么?因为*只选择元素(通常称为“标签”,以简化)。

您之后的第三行是文本节点,父级<div>元素的子级。但是文本节点也是一个节点,以便您可以选择它作为那个著名的<p>上面兄弟(!):

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()").extract() 
[u'\n ', u'<p> Line 1 </p>', u'\n ', u'<p> Line 2 </p>', u'\nLine 3\n'] 

好了,现在看来,我们的节点,我们希望(“标签”元素文本节点)。但是在.extract()的输出中(XPath选择了元素,而不是它们的“内部”文本),您仍然得到那些“<p>”。

所以你了解XPath的更多,并使用.//text()步骤(大致“所有的孩子文本从这里节点”)

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()//text()").extract() 
[u' Line 1 ', u' Line 2 '] 

犯错,等待,在哪里3号线去了?

其实这//是短期的/descendant-or-self::node()/,所以./descendant-or-self::node()/text()将只选择那些未来<p>的孩子文本节点(文本节点没有孩子,self::text()/text()永远不会匹配任何文本节点)

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::node()/text()").extract() 
[u' Line 1 ', u' Line 2 '] 

你可以在这里做的是使用方便的descendant-or-self轴+ text()节点测试,所以如果following-sibling::node()得到一个文本节点,descendant-or-self中的“self”将与文本节点匹配,并且text()节点测试为真

>>> response.xpath("//div/p[contains(., 'Description:')]/following-sibling::node()/descendant-or-self::text()").extract() 
[u'\n ', u' Line 1 ', u'\n ', u' Line 2 ', u'\nLine 3\n'] 

使用从OP的编辑问题的例子网址:

$ scrapy shell https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html 
2016-05-19 13:14:44 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot) 
2016-05-19 13:14:44 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'} 
(...) 
2016-05-19 13:14:48 [scrapy] INFO: Spider opened 
2016-05-19 13:14:50 [scrapy] DEBUG: Crawled (200) <GET https://www.tdcj.state.tx.us/death_row/dr_info/wardadamlast.html> (referer: None) 

>>> t = response.xpath(""" 
...  //div/p[contains(., 'Last Statement:')] 
...   /following-sibling::node() 
...    /descendant-or-self::text()""").extract() 
>>> 
>>> 
>>> print(''.join(t)) 

I would like to thank everyone that has showed up on my behalf, Kathryn Cox, I love you dearly.  Thank you Randy Cannon for showing up and being a lifelong friend.  Thank you Dr. Steve Ball for trying to bring the right out.  There are a lot of injustices that are happening with this.  This is wrong.  Thank you Reverend Leon Harrison for showing me the grace of God.  Thank you for all of my friends that are out there.  This is not a capital case.  I never had intended to do anything.  I feel very grieved for the loss of Walker, and for Donovan and Marissa Walker.  I hope they can find peace and be productive in society.  I would like to thank all of my friends on the row even though everything didn’t work, close isn’t good enough.  I hope that positive change will come out of this. 
I would like to thank my father and mother for everything that they showed me.  I would like to apologize for putting them through this.  I would like to ask for the truth to come out and make positive changes.  Above all else Donovan and Marissa can find love and peace.  I hope they overcome the loss of their father.  At no time did I intend to hurt him. 
When the truth comes out I hope that they can find closure.  There are a lot of things that are not right in this world, I have had to overcome them myself.  I hope all that are on the row, I hope they find peace and solace in their life. Everyone can find peace in a Christian God or whatever God they believe in.  I thank you mom and dad for everything, I love you dearly.  One last thing, I thank all of my friends that showed loyalty and graced my life with more positive.  I would also like to thank Gustav’s mother for having such a great son, and showing me much love.  I have met good people on the row, not all of them are bad.  I hope everyone can see that.  I just want to thank everybody that came to witness this.  I thank everyone, I am sorry things didn’t work out.  May God forgive us all?  I am sorry mother and I am sorry father.  I hope you find peace and solace in your heart.  I know there is something else I need to say.  I feel that.  
+0

你认为你可以根据提供的链接发表更多评论吗? – BernardL

+0

任何想法,我可以得到一个故障; '/ following-sibling :: node() .../descendant-or-self :: text()“”“)。extract() >>>' – BernardL

+0

同样对于那部分代码,它是否会碰到错误,如果它没有文本节点?对不起,我等不及要测试,但我仍然没有可用的控制台 – BernardL