Python网络刮

关于Python中使用正则表达式：Python网络刮

pathstring = '<span class="titletext">(.*)</span>'
pathFinderTitle = re.compile(pathstring)

我的输出是：

Govt has nothing to do with former CAG official RP Singh: 
Sibal</span></a></h2></div><div class="esc-lead-article-source-wrapper"> 
<table class="al-attribution single-line-height" cellspacing="0" cellpadding="0"> 
<tbody><tr><td class="al-attribution-cell source-cell"> 
<span class='al-attribution-source'>Times of India</span></td> 
<td class="al-attribution-cell timestamp-cell"> 
<span class='dash-separator'>&nbsp;- </span> 
<span class='al-attribution-timestamp'>&lrm;46 minutes ago&lrm;

文本找到应该在第一个“</SPAN已经停止> ”。

请提出这里有什么问题。

来源

2012-11-23 Kundan Kumar

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html –

.*是贪婪匹配的任何字符;它将消耗尽可能多的字符。相反，使用.*?非贪婪版本，在

pathstring = '<span class="titletext">(.*?)</span>'

来源

2012-11-23 22:24:38 phihag

+1用于回答问题而不是告诉他不使用正则表达式 – codeape

他的问题是标题为“Python网页抓取” - 所以有一个隐含的问题“我如何做网页抓取？”。正则表达式不是答案。 http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html –

我会用pyquery，而不是去对正则表达式...它是基于LXML疯狂的建议，使HTML解析容易，因为使用jQuery。

像这样的东西是你需要的一切：

doc = PyQuery(html) 
doc('span.titletext').text()

你也可以使用beautifulsoup，但结果总是相同的：不要使用正则表达式解析HTML，有工具，有制作你的生活更轻松。

来源

2012-11-23 22:28:05 StefanoP

.*将匹配</span>所以它一直持续到最后一个。

最好的答案是：不要用正则表达式解析html。使用lxml库（或类似的东西）。

from lxml import html 

html_string = '<blah>' 
tree = html.fromstring(html_string) 
titles = tree.xpath("//span[@class='titletext']") 
for title in titles: 
    print title.text

使用合适的xml/html解析器将为您节省大量时间和麻烦。如果您推出自己的解析器，则必须满足格式错误的标签，注释和其他许多事情。不要重新发明轮子。

来源

2012-11-23 22:28:48

你也可以很容易地使用BeautifulSoup这是很好的做这种事情。

#using BeautifulSoup4, install by "pip install BeautifulSoup4" 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(html) 
result = soup.find('span', 'titletext')

接着，当你正在寻找result将持有<span>与titletext类。

来源

2012-11-24 00:12:13 jdotjdot

回答

相关问题