在Python脚本中使用选择器来抓取项目

我已经在python中编写了一些代码来从网页中获取公司的详细信息和名称。我在脚本中使用了css选择器来收集这些项目。然而，当我运行它时，我只能得到“公司详细信息”和“联系”，而只有第一部分用“br”标记分隔出完整的字符串。我怎么能得到除了我所拥有的全部部分？在Python脚本中使用选择器来抓取项目

脚本我试图用：

import requests ; from lxml import html 

tree = html.fromstring(requests.get("https://www.austrade.gov.au/SupplierDetails.aspx?ORGID=ORG8000000314&folderid=1736").text) 
for title in tree.cssselect("div.contact-details"): 
    cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text 
    cContact = title.cssselect("h4:contains('Contact')+p")[0].text 
    print(cDetails, cContact)

元素在其中搜索结果是：

<div class="contact-details block dark"> 
       <h3>Contact Details</h3><p>Company Name: Distance Learning Australia Pty Ltd<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:[email protected]">[email protected]</a><br>Web: <a target="_blank" href="http://dla.edu.au">http://dla.edu.au</a></p><h4>Address</h4><p>Suite 108A, 49 Phillip Avenue<br>Watson<br>ACT<br>2602</p><h4>Contact</h4><p>Name: Christine Jarrett<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:[email protected]">[email protected]</a></p> 
      </div>

结果我得到：

Company Name: Distance Learning Australia Pty Ltd Name: Christine Jarrett

结果我之后：

Company Name: Distance Learning Australia Pty Ltd 
Phone: +61 2 6262 2964 
Fax: +61 2 6169 3168 
Email: [email protected] 

Name: Christine Jarrett 
Phone: +61 2 6262 2964 
Fax: +61 2 6169 3168 
Email: [email protected]

顺便说一句，我的意图是使用选择器而不是xpath执行上述操作。提前致谢。

来源

2017-08-23 SIM

只需使用如下方法text_content()更换text属性来获取所需的输出：

cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text_content() 
cContact = title.cssselect("h4:contains('Contact')+p")[0].text_content()

来源

2017-08-23 10:58:38 Andersson

当有安德森，有希望！非常感谢先生。它做了魔术。 – SIM

在此背景之外要了解的一件事，先生安德森先生。为什么在我的选择器原因中不能使用“:: after”或“:: before”如果我试图做任何事情，我会得到一个错误“不支持伪元素。”不过，我在一个关于css选择器的文档中发现了这个。是否有任何版本相关的冲突？ – SIM

您不能找到伪元素，因为它们不是DOM的一部分。可以在CSS选择器中使用它们来设置HTML源代码中的某些样式，但不适用于网页抓取 – Andersson

text最先返回文本节点。如果要在抓文本节点使用xpath像遍历所有子节点：

company_details = title.cssselect("h3:contains('Contact Details')+p")[0] 
for node in company_details.xpath("child::node()"): 
    print node

结果：

Company Name: Distance Learning Australia Pty Ltd 
<Element br at 0x7f625419eaa0> 
Phone: +61 2 6262 2964 
<Element br at 0x7f625419ed08> 
Fax: +61 2 6169 3168 
<Element br at 0x7f625419e940> 
Email: 
<Element a at 0x7f625419e8e8> 
<Element br at 0x7f625419eba8> 
Web: 
<Element a at 0x7f6254155af8>

来源

2017-08-23 11:01:28

在Python脚本中使用选择器来抓取项目

回答

相关问题