2014-03-03 45 views
0

我想解析href标签HTML。基本上我正在尝试获取URL和描述。我也尝试用空格分割描述并计算每个单词出现的数量,最后将它们写入两个单独的文件。我的解析器工作正常,但它效率很低,我会说它在2分钟内解析1MB的文本。红宝石 - 有效解析文本文件

下面是我的代码:

hrefTag = "<a href=\"" 
qtMark = "\"" 
descStart = "\">" 
hrefEnd = "</a>" 
if line.include? hrefTag 
    dest = line[/#{hrefTag}(.*?)#{qtMark}/m, 1] 
    descStIn = line.rindex(descStart) 
    descEndIn = line.rindex(hrefEnd) 
    if (descStIn != nil && descEndIn != nil) 
     desc = line[(descStIn+2)..(descEndIn-1)] 
    end 
end 
if (source != "" && dest != "") 
    occur = Hash.new(0) 
    mainEntry = "original-url=\"" + source + 
    "\", dest-url=\"" + dest + "\"" 
    descEntry = "" 
    if (desc != nil && desc != "") 
     descEntry = ", desc=\"" + desc + "\"" 
     words = desc.split(' ') 
     words.each { |word| occur[word] += 1 } 
    end 
    firstEntry = mainEntry+descEntry+"\n\n" 
    File.open(firstOutput, 'a') { |file| 
     file.write(firstEntry) 
    } 
    occur.each { |word, occurrance| 
     wordEntry = ", word=\"" + word + 
     "\", count=" + occurrance.to_s 
     secondEntry = mainEntry+wordEntry+"\n\n" 
     File.open(secondOutput, 'a') { |file| 
      file.write(secondEntry) 
     } 
    } 

我怎样才能使它更有效?哪些部件是最低效的?

+5

使用[nokogiri](http://nokogiri.org/)....它会让你的生活变得轻松。 –

回答

0

要查看最需要花费的时间,请使用ruby-prof或类似工具对代码进行分析。安装Ruby-教授:

gem install ruby-prof 

运行它来调用脚本:

ruby-prof <script.rb> 

当你的脚本完成(或者,如果你CTRL-C),总结方法调用,采取在每个方法的时间等等。这里有一段输出:

Sort by: self_time 

%self  total  self  wait  child  calls name 
    8.67  0.008  0.008  0.000  0.000  2 JSON::Ext::Parser#parse 
    8.45  0.022  0.008  0.000  0.014  99 IO#read_nonblock 
    6.66  0.006  0.006  0.000  0.000  99 <Module::Kernel>#select 
    2.78  0.003  0.003  0.000  0.000  235 IO#write 
    1.17  0.001  0.001  0.000  0.000  57 Enumerator#next 
    0.99  0.049  0.001  0.000  0.048  207 *Array#each