查找从关键字到关键字的字符串部分

因此，我的代码从URL中检索HTML文件并将其保存为文本文档。查找从关键字到关键字的字符串部分

urllib.urlretrieve("http://www.testlink.com", "example.txt") 
retrieve = open("example.txt", "r")

然后，我希望它从包含关键字的行中拉出特定的文本。该字符串应该是这样的：

<b class="whb">This is the text I want to retrieve</b> This is additional text that I don't want.

目前，我的代码打印整行，像这样：

for line in retrieve.readlines(): 
    if '<b class="whb">' in line: 
     print line

如何指定要打印的一条线的一部分？我想要什么之间b class =“whb”和/ b。

谢谢。

来源

2015-09-26 Ryan Broman

使用一个html解析器，然后拉出所有具有类whb的'b'标签。您可以使用标准库中的[HtmlParser类]（https://docs.python.org/2.7/library/htmlparser.html#module-HTMLParser）轻松完成此操作。 – ekhumoro

@ekhumoro下面的文档的代码片段没有奏效。无法结合str和文件对象 –

我会使用[BeautifulSoup]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）。不要重新发明轮子！ – jorgeh

正如我在评论中所说，我会用BeautifulSoup。这里有一个小例子：

from bs4 import BeautifulSoup 

html_doc = "<b class='whb'>This is the text I want to retrieve</b> This is additional text that I don't want." 

soup = BeautifulSoup(html_doc, 'html.parser') 

print soup.b.text

如果你有一个html_doc是与几个<b>标签较大的HTML文档，你可以替换最后一行：

print soup.find("b", {"class":"whb"}).text

如果html_doc有多个<b class='whb'>选项卡，并你希望他们所有的，然后用findAll()：

all_bs = [b.text for b in soup.findAll("b", {"class":"whb"})]

BeautifulSoup是一个真棒全featur ed web-scraper。请阅读documentation找到你需要在你的具体情况。

来源

2015-09-26 18:06:57 jorgeh

我忘了提及这个（对不起!!）：我想要每一个等等等等等等等等。你的代码打印出第一个实例。我如何得到这个的每个实例？ –

+0

我把我的实际代码放在GitHub上：[link]（https://github.com/Ph0enix0/WikiBot/tree/master） –

+0

如果你想要每个实例，你可以使用BeautifulSoup的findAll（）方法。例如。 'all_bs = [b.text for b在soup.findAll（“b”，{“class”：“whb”}）]'。我刚刚更新了我的答案以包含此内容。 – jorgeh

查找从关键字到关键字的字符串部分

回答

相关问题