2017-08-09 95 views
2

我有以下HTMLPython美丽的汤:如何提取标签旁边的文字?

<p> 
<b>Father:</b> Michael Haughton 
<br> 
<b>Mother:</b> Diane 
<br><b>Brother:</b> 
Rashad Haughton<br> 
<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year) 
<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p> 

我不得不单独标题和文字,例如,母亲黛安 ..

所以在最后我会作为字典的名单:

[{"label":"Mother","value":"Diane"}] 

我试着以下但不工作:

def parse(u): 
    u = u.rstrip('\n') 
    r = requests.get(u, headers=headers) 
    if r.status_code == 200: 
     html = r.text.strip() 
     soup = BeautifulSoup(html, 'lxml') 
     headings = soup.select('table p') 
     for h in headings: 
      b = h.find('b') 
      if b is not None: 
       print(b.text) 
       print(h.text + '\n') 
       print('=================================') 


url = 'http://www.nndb.com/people/742/000024670/' 

回答

1
from bs4 import BeautifulSoup 
from urllib.request import urlopen 

#html = '''<p> 
#<b>Father:</b> Michael Haughton 
#<br> 
#<b>Mother:</b> Diane 
#<br><b>Brother:</b> 
#Rashad Haughton<br> 
#<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year) 
#<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>''' 

page = urlopen('http://www.nndb.com/people/742/000024670/') 
source = page.read() 

soup = BeautifulSoup(source) 

needed_p = soup.find_all('p')[8] 

bs = needed_p.find_all('b') 

res = {} 

for b in bs: 
    if b.find_next('a').text: 
     res[b.text] = b.find_next('a').text.strip().strip('\n') 
    if b.next_sibling != ' ': 
     res[b.text] = b.next_sibling.strip().strip('\n') 

res 

输出:

{'Brother:': 'Rashad Haughton', 
'Mother:': 'Diane', 
'Husband:': 'R. Kelly', 
'Father:': 'Michael Haughton', 
'Boyfriend:': 'Damon Dash'} 

编辑: 有关页面顶部的附加信息:

... (code above) ... 
soup = BeautifulSoup(source) 

needed_p = soup.find_all('p')[1:4] + [soup.find_all('p')[8]] # here explicitly selecting needed p-tags for further parsing 

res = {} 

for p in needed_p: 
    bs = p.find_all('b') 
    for b in bs: 
     if b.find_next('a').text: 
      res[b.text] = b.find_next('a').text.strip().strip('\n') 
     if b.next_sibling != ' ': 
      res[b.text] = b.next_sibling.strip().strip('\n') 

res 

输出:

{'Race or Ethnicity:': 'Black', 
'Husband:': 'R. Kelly', 
'Died:': '25-Aug', 
'Nationality:': 'United States', 
'Executive summary:': 'R&B singer, died in plane crash', 
'Mother:': 'Diane', 
'Birthplace:': 'Brooklyn, NY', 
'Born:': '16-Jan', 
'Boyfriend:': 'Damon Dash', 
'Sexual orientation:': 'Straight', 
'Occupation:': 'Singer', 
'Cause of death:': 'Accident - Airplane', 
'Brother:': 'Rashad Haughton', 
'Remains:': 'Interred,', 
'Gender:': 'Female', 
'Father:': 'Michael Haughton', 
'Location of death:': 'Marsh Harbour, Abaco Island, Bahamas'} 

对于precisel Y本页面,您还可以凑高中,例如,像这样:

res['High School'] = soup.find_all('p')[9].text.split(':')[1].strip() 
+0

你介意解释你的代码吗? –

+0

@Rightleg,你不明白的是什么? –

+0

@DmitriyFialkovskiy对URL运行时,它会给出错误: 'res [b.text] = b.next_sibling.strip(url ='http://www.nndb.com/ people/742/000024670 /'' ).strip('\ n') AttributeError:'NoneType'对象没有属性'strip'' – Volatil3

0

您正在寻找next_sibling标签属性。 这可以为您提供下一个NavigableString或下一个Tag,具体取决于它先找到的内容。

这里是你如何使用它:

html = """..."""    
soup = BeautifulSoup(html) 

bTags = soup.find_all('b') 
for it_tag in bTags: 
    print(it_tag.string) 
    print(it_tag.next_sibling) 

输出:

Father: 
Michael Haughton 

Mother: 
Diane 

Brother: 

Rashad Haughton 
Husband: 

Boyfriend: 

这似乎有点过。 部分原因是由于换行符和空格,您可以使用str.strip方法轻松删除它。

仍然,BoyfriendHusband条目缺乏价值。 这是因为next_siblingNavigableString(即str)或Tag。 的<b>标签和标签<a>这里被解释为一个非空的文本之间的空白:

<b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> 
       ^

如果缺席,<b>Boyfriend:</b>的下一个兄弟会的<a>标签。 既然它存在,你必须检查:

  • 是否下一个兄弟是一个字符串或标签;
  • 如果它是一个字符串,它是否只包含空格。

如果一个兄弟是唯一的空白字符串,那么你正在寻找的信息是NavigableString的下一个兄弟,这将是一个<a>标签。

编辑的代码:

bTags = soup.find_all('b') 

for it_tag in bTags: 
    print(it_tag.string) 

    nextSibling = it_tag.next_sibling 
    if nextSibling is not None: 
     if isinstance(nextSibling, str): 
      if nextSibling.isspace(): 
       print(it_tag.next_sibling.next_sibling.string.strip()) 
      else: 
       print(nextSibling.strip()) 

     elif isinstance(it_tag.next_sibling, bs4.Tag): 
      print(it_tag.next_sibling.string) 

输出:

Father: 
Michael Haughton 
Mother: 
Diane 
Brother: 
Rashad Haughton 
Husband: 
R. Kelly 
Boyfriend: 
Damon Dash 

现在你可以很容易地建立自己的词典:

entries = {} 
bTags = soup.find_all('b') 

for it_tag in bTags: 
    key = it_tag.string.replace(':', '') 
    value = None 

    nextSibling = it_tag.next_sibling 
    if nextSibling is not None: 
     if isinstance(nextSibling, str): 
      if nextSibling.isspace(): 
       value = it_tag.next_sibling.next_sibling.string.strip() 
      else: 
       value = nextSibling.strip() 

     elif isinstance(it_tag.next_sibling, bs4.Tag): 
      value = it_tag.next_sibling.string 

    entries[key] = value 

输出词典:

{'Father': 'Michael Haughton', 
'Mother': 'Diane', 
'Brother': 'Rashad Haughton', 
'Husband': 'R. Kelly', 
'Boyfriend': 'Damon Dash'} 
+0

我得到的错误'27行,在解析 如果it_tag.next_sibling.isspace(): AttributeError的: 'NoneType' 对象没有属性'isspace' – Volatil3

+0

@ Volatil3我编辑了我的代码。请检查它是否适用于您。我添加了一个“无”检查,我压缩了测试。 –

+0

'key = it_tag.string.replace(':','') AttributeError:'NoneType'对象没有属性'replace'' – Volatil3