网页解析器网址re.findall（）

我假设没有使用正则表达式来解析器通过以下事情。这些是结果。

ex) location.href = "login/html"; 
ex) location.href = "featureId/html";

我想获得所有的字符串结果，但我cound't得到他们。

的代码如下：

# -*- coding: utf-8 -*- 
import urllib2 
import re 

url_reg= re.compile('(location\.(href|assign|replace)|window\.location)\s*(=|\()+.*(;|$)') 
url ='http://zero.webappsecurity.com/' 
request = urllib2.Request(url) 
res = urllib2.urlopen(request) 
html = res.read().decode('utf-8') 

print html 
print re.findall(url_reg, html)

运行出结果的来源如下：

[(u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';'), (u'location.href', u'href', u'=', u';')]

本来，我PALN得到尽可能follws。

location.href = path + "login" + ".html"; 
location.href = path + featureId + ".html"; 
location.href = "/" + "online-banking" + ".html"; 
location.href = path + featureName +".html";

请给我一些建议。

来源

2014-07-24 kjh_passion

它看起来像你的字符串丢失。 –

请让我知道什么字符串丢失 –

你没有清楚地描述你的问题。这是你想要的字符串，对吧？

window.location.href = path + "login" + ".html"; 
window.location.href = path + featureId + ".html"; 
window.location.href = "/" + "online-banking" + ".html"; 
window.location.href = path + featureName +".html"; 
window.location.href = link.page; 
window.location.href = path + link.page + ".html";

然后，你应该使用这样的模式。

... 
url_reg= re.compile('window\.location\.href = ["/ +\-\.; a-zA-Z]*') 
print url_reg.findall(html) 
...

来源

2014-07-24 09:26:47

非常感谢。 –

网页解析器网址re.findall（）

回答

相关问题