用lxml解析HTML数据

我是一名编程初学者，我的一位朋友告诉我使用BeautifulSoup而不是htmlparser。遇到一些问题后，我得到了一个提示，使用lxml而不是BeaytifulSoup，因为它的性能提高了10倍。用lxml解析HTML数据

我希望有人能给我一个提示如何刮我正在寻找的文本。

我要的是找到以下行和数据表：

<tr> 
    <td><a href="website1.com">website1</a></td> 
    <td>info1</td> 
    <td>info2</td>    
    <td><a href="spam1.com">spam1</a></td> 
</tr> 
<tr> 
    <td><a href="website2.com">website2</a></td> 
    <td>info1</td> 
    <td>info2</td>    
    <td><a href="spam2.com">spam2</a></td> 
</tr>

如何用刮信息1和2的网站，没有垃圾邮件，与lxml，并得到下面的结果？

[['url' 'info1', 'info2'], ['url', 'info1', 'info2']]

来源

2011-12-26 Retrace

import lxml.html as LH 

doc = LH.fromstring(content) 
print([tr.xpath('td[1]/a/@href | td[position()=2 or position()=3]/text()') 
     for tr in doc.xpath('//tr')])

长XP ath具有以下含义：

td[1]         find the first <td> 
    /a         find the <a> 
    /@href        return its href attribute value 
|          or 
td[position()=2 or position()=3]  find the second or third <td> 
    /text()        return its text value

来源

2011-12-26 13:37:42 unutbu

你只需用几行代码就可以让我的一天有一天。并感谢您的解释。其实所有的答案都很好。我正在学习有关xpath的知识，以获得它与萤火虫。但是他更容易找到第一个表格行并处理其中的数据。再次感谢你们，快乐的圣诞节:) – Retrace 2011-12-26 14:20:34

我使用的XPath：td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()

$ python3 
>>> import lxml.html 
>>> doc = lxml.html.parse('data.xml') 
>>> [[j for j in i.xpath('td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()')] for i in doc.xpath('//tr')] 
[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]

来源

2011-12-26 13:02:39 kev

表中的所有表行都是相同的。我正在使用Python 2.7.2+。在表格行内，我只想要第一个3结果。因此['url（website1）'，'info1'，'info2']，['url（website2）'，'info1'，'info2']]。感谢您的回复 – Retrace 2011-12-26 13:13:51

@Trees。我更新了'xpath'。 – kev 2011-12-26 13:21:03

我认为可以安全地假设实际内容不会包含垃圾邮件。虽然只有@Trees可以真正告诉我们数据的哪些方面是一致的。 – Acorn 2011-12-26 13:45:02

import lxml.html as lh 

tree = lh.fromstring(your_html) 

result = [] 
for row in tree.xpath("tr"): 
    url, info1, info2 = row.xpath("td")[:3] 
    result.append([url.xpath("a")[0].attrib['href'], 
        info1.text_content(), 
        info2.text_content()])

结果：

 
[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]

来源

2011-12-26 13:24:24 Acorn

用lxml解析HTML数据

回答

相关问题