通过使用Beautifulsoup找到文本的完全匹配

我想通过使用beautifulsoup从html中提取文本的确切匹配值。但我用我的确切文本获得几乎几乎匹配的文本。我的代码是：通过使用Beautifulsoup找到文本的完全匹配

from bs4 import BeautifulSoup 
import urllib2enter code here 
url="http://www.somesite.com" 
page=urllib2.urlopen(url) 
soup=BeautifulSoup(page,"lxml") 
for elem in soup(text=re.compile("exact text")): 
    print elem

对上述代码的输出是这样的：

1.exact text 
2.almost exact text

我怎样才能使用beautifulsoup只得到精确匹配？注：变量（ELEM）应在<class 'bs4.element.Comment'>型

来源

2017-05-22 karthi

使用BeautifulSoup的find_all方法，其string论证这一点。

作为一个例子，我在这里解析了一个关于牙买加的地方的维基百科小页面。我寻找所有文字为'牙买加存根'的字符串，但我希望找到一个。当我找到它时，显示文本及其父项。

>>> url = 'https://en.wikipedia.org/wiki/Cassava_Piece' 
>>> from bs4 import BeautifulSoup 
>>> import requests 
>>> page = requests.get(url).text 
>>> soup = BeautifulSoup(page, 'lxml') 
>>> for item in soup.find_all(string="Jamaica stubs"): 
...  item 
...  item.findParent() 
... 
'Jamaica stubs' 
<a href="/wiki/Category:Jamaica_stubs" title="Category:Jamaica stubs">Jamaica stubs</a>

退一步来说，阅读评论之后，一个更好的方式是：

>>> url = 'https://en.wikipedia.org/wiki/Hockey' 
>>> from bs4 import BeautifulSoup 
>>> import requests 
>>> import re 
>>> page = requests.get(url).text 
>>> soup = BeautifulSoup(page, 'lxml') 
>>> for i, item in enumerate(soup.find_all(string=re.compile('women', re.IGNORECASE))): 
...  i, item.findParent().text[:100] 
... 
(0, "Women's Bandy World Championships") 
(1, "The governing body is the 126-member International Hockey Federation (FIH). Men's field hockey has b") 
(2, 'The governing body of international play is the 77-member International Ice Hockey Federation (IIHF)') 
(3, "women's")

我的正则表达式使用IGNORECASE这样既“女性”和“女性”在维基百科中找到文章。我在for循环中使用了enumerate，这样我可以对显示的项目进行编号以便于阅读。

来源

2017-05-22 13:46:41

感谢您的帮助.. 上述代码不适合我。 'soup.find_all（string =“Jamaica stubs”）：'什么都不返回。 – karthi

您最好提供一个您尝试搜索的HTML示例或一些示例。 –

我想我已经在第二个版本中进行了改进。 –

您可以在soup搜索所需的元素，使用它的tag任何attribute值。

即：此代码将搜索所有a元素，id等于some_id_value。

然后它将loop找到每个元素，测试它的值是否等于"exact text"。

如果是这样，它会打印整个element。

for elem in soup.find_all('a', {'id':'some_id_value'}): 
    if elem.text == "exact text": 
     print(elem)

来源

2017-05-22 12:26:24

感谢您的回复......我只是想搜索文本的发生而不使用任何标签.. – karthi

通过使用Beautifulsoup找到文本的完全匹配

回答

相关问题