0
我想解析href
标签HTML
。基本上我正在尝试获取URL和描述。我也尝试用空格分割描述并计算每个单词出现的数量,最后将它们写入两个单独的文件。我的解析器工作正常,但它效率很低,我会说它在2分钟内解析1MB的文本。红宝石 - 有效解析文本文件
下面是我的代码:
hrefTag = "<a href=\""
qtMark = "\""
descStart = "\">"
hrefEnd = "</a>"
if line.include? hrefTag
dest = line[/#{hrefTag}(.*?)#{qtMark}/m, 1]
descStIn = line.rindex(descStart)
descEndIn = line.rindex(hrefEnd)
if (descStIn != nil && descEndIn != nil)
desc = line[(descStIn+2)..(descEndIn-1)]
end
end
if (source != "" && dest != "")
occur = Hash.new(0)
mainEntry = "original-url=\"" + source +
"\", dest-url=\"" + dest + "\""
descEntry = ""
if (desc != nil && desc != "")
descEntry = ", desc=\"" + desc + "\""
words = desc.split(' ')
words.each { |word| occur[word] += 1 }
end
firstEntry = mainEntry+descEntry+"\n\n"
File.open(firstOutput, 'a') { |file|
file.write(firstEntry)
}
occur.each { |word, occurrance|
wordEntry = ", word=\"" + word +
"\", count=" + occurrance.to_s
secondEntry = mainEntry+wordEntry+"\n\n"
File.open(secondOutput, 'a') { |file|
file.write(secondEntry)
}
}
我怎样才能使它更有效?哪些部件是最低效的?
使用[nokogiri](http://nokogiri.org/)....它会让你的生活变得轻松。 –