Scrapy：如何刮出条件表中的链接

我是Python和scrapy的总新手，我必须刮完全用表（几乎80表）构建的网站。Scrapy：如何刮出条件表中的链接

该网站的结构是这样的：

<table> 
<tr> 
<td class="header" colspan="2">something</td> 
</tr> 

</table> 
<br/> 
<table> 
<tr> 
<td class="header" colspan="2">something2</td> 
</tr> 

</table> 
<br/> 
<table> 
<tr> 
<td class="header" colspan="2">something3</td> 
</tr> 
</table>

但里面那些表之一的一个有成员的名单，我需要提取每个成员的个人资料信息，但每个配置文件是可变的，所以根据隐私设置，表格的信息会发生变化。

我需要刮的表是这样的，但有许多成员：

<table> 
      <tr> 
       <td colspan="4" class="header">members</td> 
      </tr> 
      <tr> 
       <td class="title">Name</td> 
       <td class="title">position</td> 
       <td class="title">hours</td> 
       <td class="title">observ</td> 
      </tr> 

      <tr> 
       <td class="c1">  
        1.- <a href="http://profiletype1" target="_blank">Homer Simpson</a> 
       </td> 
       <td class="c1"> 
        safety inspector 
       </td> 
       <td class="c1"> 
        10 
       </td> 
       <td class="c1"> 
        Neglect his duties 
       </td> 
      </tr> 
<table>

然后我看了看代码，我注意到，有2种类型的配置文件，并与XPath查询做不相互交叉。

然后问题是我怎样才能提取每个成员的个人资料信息，考虑到当我打开链接时，我可以找到两种不同类型的个人资料。我想我需要一个做这样的事情

def parse(self, response): 
if this xpath query doesn't work 
try this one

来源

2017-07-18 Lena Von Engel

我觉得你还挺已经回答了你的问题代码，而该解决方案是非常特定领域对我来说，能够给一个合适的回答。无论如何，我会尽力给你一个关于我如何解决问题的想法。

def parse(self, respose): 
    test = response.xpath("//some expression that only works in method one").extract_first() 
    if test is not None: 
     return self.parse_with_method_one(response) 
    return self.parse_with_method_two(response) 

def parse_with_method_one(self, response): 
    # your logic 

def parse_with_method_two(self, response): 
    # your logic

来源

2017-07-18 01:22:48

感谢您的回答，但现在我面临着另一个问题。 –

Scrapy：如何刮出条件表中的链接

回答

相关问题