复杂的正则表达式来提取python作者名称

我想创建一个正则表达式相当不成功，我正在做的是获取任何html元素的内容（作者| byline |作家）复杂的正则表达式来提取python作者名称

这里是我迄今为止

<([A-Z][A-Z0-9]*)class=\"(byLineTag|byline|author|by)\"[^>]*>(.*?)</\1>

什么，我需要匹配

<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>

或

例子210

<div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>

任何帮助将不胜感激。 -Stefan

来源

2011-07-04 Stefan Harris

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 – MitMaro

不要这样做。请参阅MitMaro的链接。想象一下像'

hello world

another block

'这样的东西。它不能做到。 HTML不是一种常规语言。使用适当的解析器。 –

您可以发布一些示例输入和预期的输出。 – Stephan

试试这个：

<([A-Z][A-Z0-9]*).*?class=\"(byLineTag|byline|author|by)\"[^>]*?>(.*?)</\1>

我有什么补充说：
- *？，以防类属性不会出现在开始标签之后。
- *？，设置*运营商非贪婪寻找收盘>

来源

2011-07-04 22:04:42 Stephan

感谢您的及时响应，这对我的第一个例子，但不是第二个。 –

我在正则表达式的结尾添加了一个小的增强功能，可以尝试使用它 – Stephan

你忘了考虑标记名称和第一属性名称之间的空间。另外，除非您确定class始终是第一个属性，否则您应该在表达式中考虑相反的情况。此外，如果你真的关心结束标记，那么\ 1应该是\ 0（反向引用是零索引的）。正如我在评论中指出的那样，您还应该在通配符中包含小写字符。

这里是一个更好的表达（我已经不顾结束标记，使其更简单）：

<[A-Za-z][A-Za-z0-9]*.*? class=["'](byLineTag|byline|author|by)["'][^>]*>

Remeber先一起运行的所有行，以避免发生错误时，标签被跨越几行拆分。当然，如果你使用Python的HTML解析器，你可能会节省很多工作。

来源

2011-07-04 22:29:12 jforberg

谢谢，但这并不能捕获标记的内容。 –

HTMLParser是你的朋友。 – jforberg

正则表达式并不是特别适合解析HTML。谢天谢地，还有一些专门为解析HTML而创建的工具，例如BeautifulSoup和lxml;其中后者被证明如下：

markup = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6><div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>''' 

import lxml.html 

import lxml.html 
doc = lxml.html.fromstring(markup) 
for a in doc.cssselect('.author, .by, .byline, .byLineTag'): 
    print a.text_content() 
# By JACK EWING and LANDON THOMAS Jr. 
# By 
# Sarah Shemkus

来源

2011-07-04 22:29:32 bernie

+1为使用CSS选择器的替代方法。我一定错过了.cssselect（） –

使用正则表达式解析为已经提到的原因，HTML强烈建议不。使用现有的HTML解析器。作为一个简单的例子，我已经包含了一个使用lxml和它的CSS选择器的例子。

from lxml import etree 
from lxml.cssselect import CSSSelector 

## Your html string 
html_string = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>''' 

## lxml html parser 
html = etree.HTML(html_string) 

## lxml CSS selector 
sel = CSSSelector('.author, .byline, .writer') 

## Call the selector to get matches 
matching_elements = sel(html) 

for elem in matching_elements: 
    primt elem.text

来源

2011-07-04 22:30:34

复杂的正则表达式来提取python作者名称

回答

相关问题