2015-06-30 53 views
0

这是一个例子,因为它是不容易解释:如何使用Ruby regexp从HTML内容中提取URL?

<li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> <a href="javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com');%20" onclick="visited('f6a1ok3n4d4p');" style="float:left;">random strings - 4</a> <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&amp;a_aid=&amp;a_bid=&amp;chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style 

在上面的内容我想从

javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com') 

字符串"f6a1ok3n4d4p""site2.com"提取,然后让它为

http://site2.com/f6a1ok3n4d4p 

and same for

javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com') 

成为

http://site1.com/zsgn82c4b96d 

我需要它与Ruby的正则表达式来完成。

+2

欢迎堆栈溢出。你有什么试图解决这个问题?有几件事你要求,但是你没有展示你写的任何代码,所以这听起来好像你要求我们为你写一个解决方案,这不是Stack Overflow的工作方式。另外,*为什么你需要使用正则表达式来完成它?哪一部分?另外,请将您的示例HTML减少到最低程度,以证明您正在处理的内容。除此之外的任何事情都会浪费我们的时间,因为我们会尽力帮助您。 –

回答

1

可以这样进行:

require 'uri' 
str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')" 

# regex scan to get values within javascript:show 
vals = str.scan(/javascript:show\((.*)\)/)[0][0].split(',') 
# => ["'f6a1ok3n4d4p'", "'random%20strings%204'", "%20'site2.com'"] 

# joining resultant Array elements to generate url 
url = "http://" + URI.decode(a.last).tr("'", '').strip + "/" + a.first.tr("'", '') 
# => "http://site2.com/f6a1ok3n4d4p" 

显然我的答案是并非万无一失。如果scan返回[],您可以通过检查来更好地进行检查吗?

1

这应该可以做到,尽管正则表达式不是特别灵活。

js_link_regex = /href=\"javascript:show\('([^']+)','[^']+',%20'([^']+)'\)/ 
link = <<eos 
    <li id="l_f6a1ok3n4d4p" class="online"> <div class="link"> <a href="javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com');%20" onclick="visited('f6a1ok3n4d4p');" style="float:left;">random strings - 4</a> <a style="float:left; display:block; padding-top:3px;" href="http://www.webtrackerplus.com/?page=flowplayerregister&amp;a_aid=&amp;a_bid=&amp;chan=flow"><img border="0" src="/resources/img/fdf.gif"></a> <!-- a class="none" href="#">random strings - 4 site2.com - # - </a --> </div> <div class="params"> <span>Submited: </span>7 June 2015 | <span>Host: </span>site2.com </div> <div class="report"> <a title="" href="javascript:report(3191274,%203,%202164691,%201)" class="alert"></a> <a title="" href="javascript:report(3191274,%203,%202164691,%200)" class="work"></a> <b>100% said work</b> </div> <div class="clear"></div> </li> <li id="l_zsgn82c4b96d" class="online"> <div class="link"> <a href="javascript:show('zsgn82c4b96d','random%20strings%204',%20'site1.com');%20" onclick="visited('zsgn82c4b96d');" style 
eos 

matches = link.scan(js_link_regex) 
matches.each do |match| 
    puts "http://#{match[1]}/#{match[0]}" 
end 
1

要只匹配您的情况下,

str = "javascript:show('f6a1ok3n4d4p','random%20strings%204',%20'site2.com')" 

parts = str.scan(/'([\w|\.]+)'/).flatten # => ["f6a1ok3n4d4p", "site2.com"] 

puts "http://#{parts[1]}/#{parts[0]}" # => http://site2.com/f6a1ok3n4d4p