提取使用BeautifulSoup

我有两个数字（NUM1，NUM2）文本行，我试图提取跨越具有相同格式的网页：提取使用BeautifulSoup

<div style="margin-left:0.5em;"> 
    <div style="margin-bottom:0.5em;"> 
    NUM1 and NUM2 are always followed by the same text across webpages 
    </div>

我想，正则表达式可能是要走的路对于这些特定的领域。下面是我尝试（从各种渠道借来的）：

def nums(self): 
    nums_regex = re.compile(r'\d+ and \d+ are always followed by the same text across webpages') 
    nums_match = nums_regex.search(self) 
    nums_text = nums_match.group(0) 
    digits = [int(s) for s in re.findall(r'\d+', nums_text)] 
    return digits

就其本身而言，一个功能之外，该代码指定文本的实际源（例如，nums_regex.search（文本））时的作品。但是，我正在修改另一个人的代码，而我自己以前从来没有真正使用过类或函数。下面是他们的代码示例：

@property 
def title(self): 
    tag = self.soup.find('span', class_='summary') 
    title = unicode(tag.string) 
    return title.strip()

正如你可能已经猜到了，我的代码是行不通的。我得到的错误：

nums_match = nums_regex.search(self) 
TypeError: expected string or buffer

它看起来像我没有正确喂养原文，但我该如何解决它？

来源

2016-02-11 Matt

尝试'nums_regex.search（self.soup.text）' – yurib

[我已经听过这个之前......]（http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags） –

您可以使用相同的正则表达式模式，以找到BeautifulSoup通过文本，然后提取所需的数字：

import re 

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages") 

for elm in soup.find_all("div", text=pattern): 
    print(pattern.search(elm.text).groups())

需要注意的是，因为你是尝试匹配的文本，而不是任何一个部分与HTML结构相关的，我认为将正则表达式应用于完整文档非常合适。

以下完整的工作示例代码示例。

随着BeautifulSoup正则表达式/ “通过短信” 搜索：

import re 

from bs4 import BeautifulSoup 

data = """<div style="margin-left:0.5em;"> 
    <div style="margin-bottom:0.5em;"> 
    10 and 20 are always followed by the same text across webpages 
    </div> 
</div> 
""" 

soup = BeautifulSoup(data, "html.parser") 
pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages") 

for elm in soup.find_all("div", text=pattern): 
    print(pattern.search(elm.text).groups())

正则表达式只搜索：

import re 

data = """<div style="margin-left:0.5em;"> 
    <div style="margin-bottom:0.5em;"> 
    10 and 20 are always followed by the same text across webpages 
    </div> 
</div> 
""" 

pattern = re.compile(r"(\d+) and (\d+) are always followed by the same text across webpages") 
print(pattern.findall(data)) # prints [('10', '20')]

来源

2016-02-11 21:45:23 alecxe

BeautifulSoup代码本身很好。我加了自我。 to soup.findall将其与其他代码整合，但这只是导致“（）”输出，即使应该有数字。 – Matt

@Matt，它适用于你提供的输入。你能分享你正在分析的完整的HTML和你目前拥有的代码吗？谢谢。 – alecxe

作品！我不确定我昨天做错了什么，但是当我添加self时，你的BeautifulSoup代码有效。今天喝汤。谢谢！ – Matt

提取使用BeautifulSoup

回答

相关问题