获得Python中的正则表达式的所有实例

我尝试使用以下获得Python中的正则表达式的所有实例

import re 

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>' 
match = re.findall(r'<a.*>(.*)</a>', s) 

for string in match: 
    print(string)

把所有的链接的innerHTML的，但我只得到了最后一次出现，“转到第4页” 我认为它看到一个大字符串和几个匹配的正则表达式，它们被视为重叠并被忽略。所以，我如何才能符合

集合[“转到第1页”，“转到第2页”，“转到第3页”，“转到第4页”]

来源

2013-07-26 SteveC

立即解决问题是regexp是贪婪的，那就是他们会尝试消耗尽可能长的字符串。所以你是正确的，它发现直到最后</a>它可以。将其更改为不贪婪（.*?）：

match = re.findall(r'<a.*?>(.*?)</a>', s) 
          ^

然而，这是解析HTML的一个可怕的方式，而不是稳健的，并且将打破上最小的变化。

这里做的更好的方法：

from bs4 import BeautifulSoup 

s = '<div><a href="page1.html" title="page1">Go to 1</a>, <a href="page2.html" title="page2">Go to page 2</a><a href="page3.html" title="page3">Go to page 3</a>, <a href="page4.html" title="page4">Go to page 4</a></div>' 
soup = BeautifulSoup(s) 
print [el.string for el in soup('a')] 
# [u'Go to 1', u'Go to page 2', u'Go to page 3', u'Go to page 4']

然后，您可以使用的电源也得到了HREF以及文字，如：

print [[el.string, el['href'] ]for el in soup('a', href=True)] 
# [[u'Go to 1', 'page1.html'], [u'Go to page 2', 'page2.html'], [u'Go to page 3', 'page3.html'], [u'Go to page 4', 'page4.html']]

来源

2013-07-26 22:38:05

谢谢！我真的不太明白？在正则表达式中，这是一个很好的学习经验。这里是我的工作 match = re.findall（r'（。*？）'，s） – SteveC

@ user1450120我没有看到其他。* :)无论如何 - 期待这个打破以后或可能会返回错误的结果......请看使用'beautifulsoup'解析HTML - 这很容易学习和灵活 –

什么样的输入可能会导致此问题被破坏？ – SteveC

我建议使用lxml：

from lxml import etree 

s = 'some html' 
tree = etree.fromstring(s) 
for ele in tree.iter('*'): 
    #do something

它为大文件处理提供了iterParse函数，并且还带入了像urll这样的文件类对象ib2.request对象。我一直在使用它很长一段时间来解析html和xml。

参见：http://lxml.de/tutorial.html#the-element-class

来源

2013-07-26 22:45:51 Mai

我会避免在解析使用正则表达式HTML ALL成本。根据原因检查出this article和this SO post。但概括起来......

试图解析使用正则表达式HTML每一次，邪恶的孩子哭处女的血，和俄罗斯的黑客PWN你的web应用

相反，我会采取看看一个python HTML解析包，如BeautifulSoup或pyquery。它们提供了很好的界面来遍历，检索和编辑HTML。

来源

2013-07-26 22:48:33 FastTurtle

获得Python中的正则表达式的所有实例

回答

相关问题