Beautifulsoup刮除另一个单元格旁边的单元格的内容

我想刮除除了另一个单元格之外的单元格的内容， “Staatsform”，“Amtssprache”，“Postleitzahl”等。在图片中，所需的内容总是在正确的单元格中。Beautifulsoup刮除另一个单元格旁边的单元格的内容

的基本代码是以下一个，但我还是坚持了下来：

source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg') 
plain_text = source_code.text      
soup = BeautifulSoup(plain_text, "html.parser")  
stastaform = soup.find(text="Staatsform:")...???

提前非常感谢！

来源

2017-06-12 saitam

请包括描述两个感兴趣的单元格的HTML片段。 – DyZ

你只想要单元格中的文本，还是更多？ –

这工作的大部分时间：

def get_content_from_right_column_for_left_column_containing(text): 
    """return the text contents of the cell adjoining a cell that contains `text`""" 

    navigable_strings = soup.find_all(text=text) 

    if len(navigable_strings) > 1: 
     raise Exception('more than one element with that text!') 

    if len(navigable_strings) == 0: 

     # left-column contents that are links don't have a colon in their text content... 
     if ":" in text: 
      altered_text = text.replace(':', '') 

     # but `td`s and `th`s do. 
     else: 
      altered_text = text + ":" 

     navigable_strings = soup.find_all(text=altered_text) 

    try: 
     return navigable_strings[0].find_parent('td').find_next('td').text 
    except IndexError: 
     raise IndexError('there are no elements containing that text.')

来源

2017-06-12 17:01:12

我想在限制搜索到什么是所谓的英文维基百科的“信息框”必须小心。因此，我首先搜索标题'Basisdaten'，要求它是一个th元素。可能并不完全确定，但可能性更大。发现我在'Basisdaten'下查找tr元素，直到我找到另一个tr，包括一个（推测不同的）标题。在这种情况下，我搜索'Postleitzahlen：'，但是这种方法可以找到'Basisdaten'和下一个标题之间的任何/所有项目。

PS：我还应该提一下if not current.name的原因。我注意到一些行由BeautifulSoup视为字符串的新行组成。这些没有名称，因此需要在代码中专门对待它们。

import requests 
import bs4 
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text 
soup = bs4.BeautifulSoup(page, 'lxml') 
def getInfoBoxBasisDaten(s): 
    return str(s) == 'Basisdaten' and s.parent.name == 'th' 

basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0] 

wanted = 'Postleitzahlen:' 
current = basisdaten.parent.parent.nextSibling 
while True: 
    if not current.name: 
     current = current.nextSibling 
     continue 
    if wanted in current.text: 
     items = current.findAll('td') 
     print (items[0]) 
     print (items[1]) 
    if '<th ' in str(current): break 
    current = current.nextSibling

结果是这样的：两个单独的td元素，请求。

<td><a href="/wiki/Postleitzahl_(Deutschland)" title="Postleitzahl (Deutschland)">Postleitzahlen</a>:</td> 
<td>20095–21149,<br/> 
22041–22769,<br/> 
<a href="/wiki/Neuwerk_(Insel)" title="Neuwerk (Insel)">27499</a></td>

来源

2017-06-12 18:09:50

如果我使用'BeautifulSoup.get_text（）'去除html脚本等，这似乎对我有用。但不幸的是，我在这个网站上得到一个错误：'https：// de.wikipedia.org/wiki/Bremen'。你知道这是什么吗？ – saitam

我刚刚查看了两页的维基代码（在* Bearbeiten *视图中）。他们采取完全不同的方式来设置页面的格式，因此HTML是不同的。我没有高中以上的德语。我现在看到，不来梅网页上有一个“Infobox”，但不是在汉堡。这与英文维基百科中的情况相同。如果你想刮掉它，那么你需要能够识别你正在处理的格式和处理方式。 –

Beautifulsoup刮除另一个单元格旁边的单元格的内容

回答

相关问题