排除在Beautifulsoup基于内容标签

我刮类似于下面的HTML数据：排除在Beautifulsoup基于内容标签

<div class="target-content"> 
    <p id="random1"> 
     "the content of the p" 
    </p> 

    <p id="random2"> 
     "the content of the p" 
    </p> 

    <p> 
     <q class="semi-predictable"> 
     "q tag content that I don't want 
     </q> 
    </p> 

    <p id="random3"> 
     "the content of the p" 
    </p> 

</div>

我的目标是让所有的标签，与他们一起的内容，同时能够排除<q>标签及其内容。目前，我让所有的标签有以下方法：

contentlist = soup.find('div', class_='target-content').find_all('p')

我的问题，之后我发现结果集所有标签的，我怎么能过滤掉单，连同它的内容，包含<q>？

注：正从soup.find('div', class_='target-content')find_all('p')的结果集后，我反复地增加从结果以下列方式设置为列表中的每个：

content = '' 
    for p in contentlist: 
     content += str(p)

来源

2016-06-27 theeastcoastwest

您可以直接跳过具有q标签p标签内：

for p in soup.select('div.target-content > p'): 
    if p.q: # if q is present - skip 
     continue 
    print(p)

其中p.q是快捷方式p.find("q")。 div.target-content > p是一个CSS selector，它将匹配所有p标签，它们是div元素的直接子元素，target-content类。

来源

2016-06-27 15:54:11 alecxe

谢谢，这正是我试图理解。谢谢你的解释;我不认为像使用Beautifulsoup那样经常使用CSS选择器。 – theeastcoastwest

您可以使用filter来实现：

filter(lambda e: e.find('q') == None, soup.find('div', class_='target-content').find_all('p'))

来源

2016-06-27 15:57:10 Greg

感谢您的帮助，我最终使用了@alexce上面答案的一个变体，尽管您的证明也很有用。 – theeastcoastwest

排除在Beautifulsoup基于内容标签

回答

相关问题