Python BeautifulSoup刮第n种元素

最近我开始刮一些多页但页面结构真的很难刮。它有很多“第n类”元素，每个元素都没有类。但他们的父母分享同一班。我与BeautifulSoup工作，直到我看到了这个可怕的代码，这是伟大的......Python BeautifulSoup刮第n种元素

<div class="detail-50"> 
    <div class="detail-panel-wrap"> 
     <h3>Contact details</h3> 
      Website: <a href="http://www.somewebsitefrompage.com">http://www.somewebsitefrompage.com</a><br />Email: <a href="mailto:somemailfrompage.com">somemailfrompage.com</a><br />Tel: 11111111 111 
        </div> 
         </div>

就目前看来不错，但我想刮网站，电子邮件和电话。分别。我试过很多方法，如

website = soup.select('div.detail-panel-wrap')[1].text`

但没有工作..现在，这里是大问题时，其他元素具有相同的类联系方式：

<div class="detail-50"> 
    <div class="detail-panel-wrap"> 
     <h3>Public address</h3> 
      Mr Martin Austin, Some street, Some city, some ZIP 
        </div> 
         </div>

这一个是地址，也我也需要这个。这两个有很多其他'div'的名字。有人有解决方案吗？如果有人不明白，我可以解释它更好的，对不起，不好解释..

UPDATE
随着我已经找到了应该如何选择一些软件，但它是在Python很难写。这里是如何从页面发现电话：

div#ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact.detail-panel div.detail-50:nth-of-type(1) div.detail-panel-wrap

这一个是地址

div#ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact.detail-panel div.detail-50:nth-of-type(2) div.detail-panel-wrap

这一个网站

div.detail-50 a:nth-of-type(1)

而这一次的联系人的电子邮件

div.detail-panel-wrap a:nth-of-type(2)

注： ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact

是父类的div在所有这些的顶部。

任何人都有一个想法如何在BS4 Python中写这些？

来源

2016-08-11 Ukii

如果有多个的div带班细节与面板包裹，您可以使用H3文本得到你想要的那些：

contact = soup.find("h3", text="Contact details").parent 
address = soup.find("h3", text="Public address").parent

如果我们对样本运行，你可以看到，我们得到两个div的：

In [22]: html = """ 
    ....: <div class="detail-50"> 
    ....:  <div class="detail-panel-wrap"> 
    ....:   <h3>Contact details</h3> 
    ....:    Website: <a href="http://www.somewebsitefrompage.com">http://www.somewebsitefrompage.com</a><br />Email: <a href="mailto:somemailfrompage.com">somemailfrompage.com</a><br />Tel: 11111111 111 
    ....:      </div> 
    ....:  </div> 
    ....:  <div class="detail-50"> 
    ....:   <div class="detail-panel-wrap"> 
    ....:    <h3>Public address</h3> 
    ....:     Mr Martin Austin, Some street, Some city, some ZIP 
    ....:   </div> 
    ....:  </div> 
    ....:  <div class="detail-panel-wrap"> 
    ....:  < h3>foo/h3> 
    ....:  </div> 
    ....:  <div class="detail-panel-wrap"> 
    ....:   <h3>bar/h3> 
    ....:  </div> 
    ....: </div> 
    ....:  """ 

In [23]: from bs4 import BeautifulSoup 

In [24]: soup = BeautifulSoup(html,"lxml") 

In [25]: contact = soup.find("h3", text="Contact details").parent 

In [26]: address = soup.find("h3", text="Public address").parent 

In [27]: print(contact) 
<div class="detail-panel-wrap"> 
<h3>Contact details</h3> 
      Website: <a href="http://www.somewebsitefrompage.com">http://www.somewebsitefrompage.com</a><br/>Email: <a href="mailto:somemailfrompage.com">somemailfrompage.com</a><br/>Tel: 11111111 111 
        </div> 

In [28]: print(address) 
<div class="detail-panel-wrap"> 
<h3>Public address</h3> 
       Mr Martin Austin, Some street, Some city, some ZIP 
     </div>

可能有其他的方法，但没有看到完整的HTML结构是不可能知道的。

为了您的编辑，你只需要与select_one使用选择：

telephone = soup.select_one("#ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact.detail-panel div.detail-50:nth-of-type(1) div.detail-panel-wrap")    

address = soup.select_one("#ContentPlaceHolderDefault_cp_content_ctl00_CharityDetails_4_TabContainer1_tpOverview_plContact.detail-panel div.detail-50:nth-of-type(2) div.detail-panel-wrap") 


website = soup.select_one("div.detail-50 a:nth-of-type(1)") 

email = soup.select_one("div.detail-panel-wrap a:nth-of-type(2)")

但也不能保证就因为选择在Chrome工具等工作..他们将在源上工作你回来。

来源

2016-08-12 00:28:19

您好，非常感谢您的帮助，您可以看到我已经附加了一个更新，它应该是直接的，但问题在于如何在Python代码中编写它。 – Ukii

@Ukii，您在编辑中添加的选择器看起来像是从Chrome工具中复制的，并且可能无法在实际源代码上工作，但无论从字面上看，只要对它们进行选择 –

Python BeautifulSoup刮第n种元素

回答

相关问题