Python的 - 解析HTML类

我在愤怒试图解析以下代表HTML提取物，使用BeautifulSoup和LXML：Python的 - 解析HTML类

[<p class="fullDetails"> 
<strong>Abacus Trust Company Limited</strong> 
<br/>Sixty Circular Road 

      <br/>DOUGLAS 

      <br/>ISLE OF MAN 
      <br/>IM1 1SA 
      <br/> 
<br/>Tel: 01624 689600 
      <br/>Fax: 01624 689601 
     <br/> 
<br/> 
<span class="displayBlock" id="ctl00_ctl00_bodycontent_MainContent_Email">E-mail: </span> 
<a href="mailto:[email protected]" id="ctl00_ctl00_bodycontent_MainContent_linkToEmail">[email protected]</a> 
<br/> 
<span id="ctl00_ctl00_bodycontent_MainContent_Web">Web: </span> 
<a href="http://www.abacusiom.com" id="ctl00_ctl00_bodycontent_MainContent_linkToSite">http://www.abacusiom.com</a> 
<br/> 
<br/><b>Partners(s) - ICAS members only:</b> S H Fleming, M J MacBain 
     </p>]

我想要做什么：

提取物 '强'文成COMPANY_NAME
提取物 'BR' 标记文本company_line_x
提取 'MainContent_Email' 文本company_email
提取 'MainContent_Web' 文本company_web

我有这些问题：

1）I可以提取通过使用.findall所有文本（文本= True），但每行有很多填充

2）非ASCII字符有时被返回，这会导致csv.writer失败..我不是100％确定如何处理这个正确。（我以前只是用unicodecsv.writer）

任何意见将非常感谢！

此刻，我的功能只是接收页面数据，并使用findall()

隔离“P级”

def get_company_data(page_data): 
    if not page_data: 
     pass 
    else: 
     company_dets=page_data.findAll("p",{"class":"fullDetails"}) 
     print company_dets 
     return company_dets

来源

2014-09-02 Chris Finlayson

如何获取页面数据？ – alecxe 2014-09-02 12:01:22

感谢您的回复。我使用请求模块提取数据，并将页面数据传递给此函数 – 2014-09-02 12:25:42

好的，您使用的是响应文本还是内容属性？ – alecxe 2014-09-02 12:49:35

下面是一个完整的解决方案：

from bs4 import BeautifulSoup, NavigableString, Tag 

data = """ 
your html here 
""" 

soup = BeautifulSoup(data) 
p = soup.find('p', class_='fullDetails') 

company_name = p.strong.text 
company_lines = [] 
for element in p.strong.next_siblings: 
    if isinstance(element, NavigableString): 
     text = element.strip() 
     if text: 
      company_lines.append(text) 

company_email = p.find('span', text=lambda x: x.startswith('E-mail:')).find_next_sibling('a').text 
company_web = p.find('span', text=lambda x: x.startswith('Web:')).find_next_sibling('a').text 

print company_name 
print company_lines 
print com[enter link description here][1]pany_email, company_web

打印：

Abacus Trust Company Limited 
[u'Sixty Circular Road', u'DOUGLAS', u'ISLE OF MAN', u'IM1 1SA', u'Tel: 01624 689600', u'Fax: 01624 689601', u'S H Fleming, M J MacBain'] 
[email protected] http://www.abacusiom.com

注意，让我们不得不遍历该公司线strong标签的next siblings并获取所有文本节点。 company_email和company_web通过标签检索，换句话说，在其之前的by the textspan标签。

来源

2014-09-02 12:13:28 alecxe

你一样也做了p数据，（我用lxml为下面的示例代码）

要获得公司名称：

company_name = '' 
for strg in root.findall('strong'): 
    company_name = strg.text  # this will give you Abacus Trust Company Limited

要获得公司线/详细信息：

company_line_x = '' 
lines = [] 
for b in root.findall('br'): 
    if b.tail: 
     addr_line = b.tail.strip() 
     lines.append(addr_line) if addr_line != '' else None 

company_line_x = ', '.join(lines) # this will give you Sixty Circular Road, DOUGLAS, ISLE OF MAN, IM1 1SA, Tel: 01624 689600, Fax: 01624 689601

来源

2014-09-02 12:09:16 sk11

OP使用'BeautifulSoup'。 – alecxe 2014-09-02 12:11:48

OP说_使用BeautifulSoup和lxml_，所以我根据我对lxml的建议。无论如何，这个想法仍然差不多。 – sk11 2014-09-02 12:14:05

你是对的，误解了这部分。请注意，您目前缺少'email'和'web'部分。谢谢。 – alecxe 2014-09-02 12:15:45

Python的 - 解析HTML类

回答

相关问题