2017-03-02 64 views
-2

首先,我对这种可怕的questiontitle抱歉,但我不能想出一个更好的。正则表达式,这个RegEx有什么问题?

所以我试图用Python来构建一个小工具,以提高自己的技能,它刮掉数据从Imdb.com和输出标题和来自HTML过滤其他的东西。

我正在使用此正则表达式进行我的搜索:<h3 class="findSectionHeader"><a name="tt"><\/a>Titles<\/h3>[\s]{0,3}(.*?)<\/td> <\/tr><\/table>这应该会导致a>Titles<\/h3>之后和<\/tr><\/table>之前的所有内容,但我做错了什么。我已经加入了[\ S] {0,3},因为我认为这可能是因为\ n或别的东西,但它并没有解决它。

这是源块:

<div class="findSection"> 
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3> 
<table class="findList"> 
<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" > 
<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" /> 
</a> </td> <td class="result_text"> 
<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" > 
<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> 
<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a> 
</td> <td class="result_text"> 
<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) </td> </tr></table> 
+0

不要试图用正则表达式来处理HTML,改用DOM解析器。 [Beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)应该是一个蟒良好的起点。 –

+0

问题是你的'。*?'不符合换行符。如果启用单行模式's',它会按预期工作。 –

+1

@rawing啊,不用,它也使用作品的时候'([\ S \ S] *?)'任何字符,空格藏汉匹配非空白字符!谢谢 –

回答

0

尝试使用以下正则表达式

(?s)(?<=<\/h3>\n).*?(?=</tr></table>) 

看到regex demo/explanation

import re 
regex = r"(?s)(?<=<\/h3>\n).*?(?=</tr></table>)" 
str = """<div class="findSection"> 
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3> 
<table class="findList"> 
<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" > 
<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" /> 
</a> </td> <td class="result_text"> 
<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" > 
<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> 
<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a> 
</td> <td class="result_text"> 
<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) </td> </tr></table>""" 
matches = re.finditer(regex, str) 
for matchNum, match in enumerate(matches): 
    matchNum = matchNum + 1 
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group())) 
0

您可以在标志re.DOTALL添加到您的通话re使.匹配换行符:

src = '''<div class="findSection"> 
<h3 class="findSectionHeader"><a name="tt"></a>Titles</h3> 
<table class="findList"> 
<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" > 
<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" /> 
</a> </td> <td class="result_text"> 
<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" > 
<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> 
<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a> 
</td> <td class="result_text"> 
<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) </td> </tr></table>''' 

expr = r'<h3 class="findSectionHeader"><a name="tt"><\/a>Titles<\/h3>[\s]{0,3}(.*?)<\/td> <\/tr><\/table>' 

import re 

print re.findall(expr, src, re.DOTALL) 

产量:

['<table class="findList">\n<tr class="findResult odd"> <td class="primary"> <a href="/title/tt1474684/?ref_=fn_al_tt_1" >\n<img src="https://images-na.ssl-images-amazon.com/images/M/_AL_.jpg" />\n</a> </td> <td class="result_text"> \n<a href="/title/tt1474684<a href="/title/tt3155298/?ref_=fn_al_tt_3" >\n<img src="http://ia.media-imdb.com/imagestd class="primary_photo"> \n<a href="/tiopicture/32x44/film-3119741174._CB522736599_.png" /></a>\n</td> <td class="result_text"> \n<a href="/title/tt1501661/?ref_=fn_al_tt_10" >Luther</a> (1968) (TV Movie) '] 
+0

其实,这是我昨天已经试过这样:'结果= re.findall(r'REGEX”,STR(结果),旗帜= re.DOTALL)',但它没有工作,也许我失败了。 –

相关问题