Python美丽的汤：如何提取标签旁边的文字？

我有以下HTMLPython美丽的汤：如何提取标签旁边的文字？

<p> 
<b>Father:</b> Michael Haughton 
<br> 
<b>Mother:</b> Diane 
<br><b>Brother:</b> 
Rashad Haughton<br> 
<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year) 
<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>

我不得不单独标题和文字，例如，母亲：黛安 ..

所以在最后我会作为字典的名单：

[{"label":"Mother","value":"Diane"}]

我试着以下但不工作：

def parse(u): 
    u = u.rstrip('\n') 
    r = requests.get(u, headers=headers) 
    if r.status_code == 200: 
     html = r.text.strip() 
     soup = BeautifulSoup(html, 'lxml') 
     headings = soup.select('table p') 
     for h in headings: 
      b = h.find('b') 
      if b is not None: 
       print(b.text) 
       print(h.text + '\n') 
       print('=================================') 


url = 'http://www.nndb.com/people/742/000024670/'

来源

2017-08-09 Volatil3

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

#html = '''<p> 
#<b>Father:</b> Michael Haughton 
#<br> 
#<b>Mother:</b> Diane 
#<br><b>Brother:</b> 
#Rashad Haughton<br> 
#<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year) 
#<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>''' 

page = urlopen('http://www.nndb.com/people/742/000024670/') 
source = page.read() 

soup = BeautifulSoup(source) 

needed_p = soup.find_all('p')[8] 

bs = needed_p.find_all('b') 

res = {} 

for b in bs: 
    if b.find_next('a').text: 
     res[b.text] = b.find_next('a').text.strip().strip('\n') 
    if b.next_sibling != ' ': 
     res[b.text] = b.next_sibling.strip().strip('\n') 

res

输出：

{'Brother:': 'Rashad Haughton', 
'Mother:': 'Diane', 
'Husband:': 'R. Kelly', 
'Father:': 'Michael Haughton', 
'Boyfriend:': 'Damon Dash'}

编辑：有关页面顶部的附加信息：

... (code above) ... 
soup = BeautifulSoup(source) 

needed_p = soup.find_all('p')[1:4] + [soup.find_all('p')[8]] # here explicitly selecting needed p-tags for further parsing 

res = {} 

for p in needed_p: 
    bs = p.find_all('b') 
    for b in bs: 
     if b.find_next('a').text: 
      res[b.text] = b.find_next('a').text.strip().strip('\n') 
     if b.next_sibling != ' ': 
      res[b.text] = b.next_sibling.strip().strip('\n') 

res

输出：

{'Race or Ethnicity:': 'Black', 
'Husband:': 'R. Kelly', 
'Died:': '25-Aug', 
'Nationality:': 'United States', 
'Executive summary:': 'R&B singer, died in plane crash', 
'Mother:': 'Diane', 
'Birthplace:': 'Brooklyn, NY', 
'Born:': '16-Jan', 
'Boyfriend:': 'Damon Dash', 
'Sexual orientation:': 'Straight', 
'Occupation:': 'Singer', 
'Cause of death:': 'Accident - Airplane', 
'Brother:': 'Rashad Haughton', 
'Remains:': 'Interred,', 
'Gender:': 'Female', 
'Father:': 'Michael Haughton', 
'Location of death:': 'Marsh Harbour, Abaco Island, Bahamas'}

对于precisel Y本页面，您还可以凑高中，例如，像这样：

res['High School'] = soup.find_all('p')[9].text.split(':')[1].strip()

来源

2017-08-09 09:43:53

你介意解释你的代码吗？ –

@Rightleg，你不明白的是什么？ –

@DmitriyFialkovskiy对URL运行时，它会给出错误： 'res [b.text] = b.next_sibling.strip（url ='http：//www.nndb.com/ people/742/000024670 /'' ）.strip（'\ n'） AttributeError：'NoneType'对象没有属性'strip'' – Volatil3

您正在寻找next_sibling标签属性。这可以为您提供下一个NavigableString或下一个Tag，具体取决于它先找到的内容。

这里是你如何使用它：

html = """..."""    
soup = BeautifulSoup(html) 

bTags = soup.find_all('b') 
for it_tag in bTags: 
    print(it_tag.string) 
    print(it_tag.next_sibling)

输出：

Father: 
Michael Haughton 

Mother: 
Diane 

Brother: 

Rashad Haughton 
Husband: 

Boyfriend:

这似乎有点过。部分原因是由于换行符和空格，您可以使用str.strip方法轻松删除它。

仍然，Boyfriend和Husband条目缺乏价值。这是因为next_sibling是NavigableString（即str）或Tag。的<b>标签和标签<a>这里被解释为一个非空的文本之间的空白：

<b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> 
       ^

如果缺席，<b>Boyfriend:</b>的下一个兄弟会的<a>标签。既然它存在，你必须检查：

是否下一个兄弟是一个字符串或标签;
如果它是一个字符串，它是否只包含空格。

如果一个兄弟是唯一的空白字符串，那么你正在寻找的信息是NavigableString的下一个兄弟，这将是一个<a>标签。

编辑的代码：

bTags = soup.find_all('b') 

for it_tag in bTags: 
    print(it_tag.string) 

    nextSibling = it_tag.next_sibling 
    if nextSibling is not None: 
     if isinstance(nextSibling, str): 
      if nextSibling.isspace(): 
       print(it_tag.next_sibling.next_sibling.string.strip()) 
      else: 
       print(nextSibling.strip()) 

     elif isinstance(it_tag.next_sibling, bs4.Tag): 
      print(it_tag.next_sibling.string)

输出：

Father: 
Michael Haughton 
Mother: 
Diane 
Brother: 
Rashad Haughton 
Husband: 
R. Kelly 
Boyfriend: 
Damon Dash

现在你可以很容易地建立自己的词典：

entries = {} 
bTags = soup.find_all('b') 

for it_tag in bTags: 
    key = it_tag.string.replace(':', '') 
    value = None 

    nextSibling = it_tag.next_sibling 
    if nextSibling is not None: 
     if isinstance(nextSibling, str): 
      if nextSibling.isspace(): 
       value = it_tag.next_sibling.next_sibling.string.strip() 
      else: 
       value = nextSibling.strip() 

     elif isinstance(it_tag.next_sibling, bs4.Tag): 
      value = it_tag.next_sibling.string 

    entries[key] = value

输出词典：

{'Father': 'Michael Haughton', 
'Mother': 'Diane', 
'Brother': 'Rashad Haughton', 
'Husband': 'R. Kelly', 
'Boyfriend': 'Damon Dash'}

来源

2017-08-09 09:43:46

我得到的错误'27行，在解析如果it_tag.next_sibling.isspace（）： AttributeError的： 'NoneType' 对象没有属性'isspace' – Volatil3

@ Volatil3我编辑了我的代码。请检查它是否适用于您。我添加了一个“无”检查，我压缩了测试。 –

'key = it_tag.string.replace（'：'，''） AttributeError：'NoneType'对象没有属性'replace'' – Volatil3

Python美丽的汤：如何提取标签旁边的文字？

回答

相关问题