我写了下面的抓取工具,从文件中获取URL列表并获取页面。问题在于,2个小时左右后,系统变得非常慢并几乎无法使用。该系统是8核ram的四核linux。有人可以告诉我如何解决这个问题。Ruby线程 - 资源不足
require 'rubygems'
require 'net/http'
require 'uri'
threads = []
to_get = File.readlines(ARGV[0])
dir = ARGV[1]
errorFile = ARGV[2]
error_f = File.open(errorFile, "w")
puts "Need to get #{to_get.length} queries ..!!"
start_time = Time.now
100.times do
threads << Thread.new do
while q_word = to_get.pop
toks = q_word.chop.split("\t")
entity = toks[0]
urls = toks[1].chop.split("::")
count = 1
urls.each do |url|
q_final = URI.escape(url)
q_parsed = URI.parse(q_final)
filename = dir+"/"+entity+"_"+count.to_s
if(File.exists? filename)
count = count + 1
else
begin
res_http = Net::HTTP.get(q_parsed.host, q_parsed.request_uri)
File.open(filename, 'w') {|f| f.write(res_http) }
rescue Timeout::Error
error_f.write("timeout error " + url+"\n")
rescue
error_f.write($!.inspect + " " + filename + " " + url+"\n")
end
count = count + 1
end
end
end
end
end
puts "waiting here"
threads.each { |x| x.join }
puts "finished in #{Time.now - start_time}"
#puts "#{dup} duplicates found"
puts "writing output ..."
error_f.close()
puts "Done."
是的,就是这样。在100个线程运行的情况下,该程序保持最近100次下载的内存。它可以在写入文件后使用“res_http = nil”更快地释放它们,或者更好的是,将下载和写入子例程,以便res_http快速超出范围。 GC应该照顾其余的。 – 2011-03-01 00:00:23