2013-01-10 37 views
1

我遇到严重问题,试图在rails中解析一些CSV文件。 基本上我的应用程序获取用户上传CSV文件。应用程序然后转换文件以确保它是UTF-8格式,然后尝试解析并处理它。但是,当应用程序试图解析它,但我得到MalformedCSVError说明“第1行非法报价”使用rails CSV(FasterCSV)的格式错误的CSV错误

现在我没有得到,是如果我将原始文件复制到一个新的文件并保存它,然后我可以毫无问题地在轨道控制台上解析它。

如果我试图解析原始文件,它抱怨无效字符为UTF-8编码(该文件不是以UTF-8,因此该应用将其转换)

如果我试图解析文件,该应用程序已转换为UTF-8并将行结束符更改为LF,但未能解析。

如果我在应用程序产生的版本和我制作的复制/粘贴版本(其工作)之间做了文件差异,那么存在0个差异,所以我确实无法弄清楚为什么可以解析,以及一个不是。

有什么建议吗?我的应用程序正在处理的文件,如下所示:

def create 
@survey = Survey.new(params[:survey]) 

# Now we need to try and convert this to UTF-8 if it isn't already 
encoded = File.read(@survey.survey_data.current_path) 
encoding = CharlockHolmes::EncodingDetector.detect(encoded) 

# We've got a guess at the encoding, 
# so we can try and convert it but it 
# may still fail so we need to handle 
# that 
begin 
    re_encoded = CharlockHolmes::Converter.convert(encoded, encoding[:encoding], 'UTF-8') 
    re_encoded = re_encoded.gsub(/\r\n?/, "\n") 

    # Now replace the uploaded file 
    File.open(@survey.survey_data.current_path, 'w') { |f| 
    f.write(re_encoded) 
    } 
rescue ArgumentError 
    puts "UH OH!!!!!" 
end 

puts "#{@survey.survey_data.current_path}" 
@parsed = CSV.read(@survey.survey_data.current_path) 

文件上传宝石CarrierWave如果让任何区别。

请有人可以帮助我,因为这让我疯狂!

编辑

错误说,这是第1行1号线(假设从0没有索引)是

"Survey","RD","GarrysMDs","NigelsMDs","PaulsMDs","StephensMDs","BrinleyJ","CarolineP","DaveL","GrantR","GregS","Kent","NeilC","NicolaP","AndyC","DarrenS","DeanB","KarenF","PaulR","RichardF","SteveG","BrianG","GordonA","NickD","NickR","NickT","RayL","SimonH","EdmondH","JasonF","MikeS","SamanthaN","TimB","TravisF","AlanS","Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8PM","Q8N","Q9","Q10","Q11","Q12","Q13","Q14","Q15","Q16PM","Q16N","Q17PM","Q17N","Q18PM","Q18N","Q19","Q20","Q21","Q22","comment","Q23.1","Q23.2","Q23.3","TQ23.1","TQ23.2","VPM","VN","VQ1","VQ2","VQ3","VQ4","VQ5","VQ6","VQ7","VQ8N","VQ8PM","VQ9","VQ10","VQ11","VQ12","VQ13","VQ14","VQ15","VQ16","VQ16N","VQ16PM","VQ17","VQ17N","VQ17PM","VQ18","VQ18N","VQ18PM","VQ19","VQ20","VQ21","VQ22","VQ23.1","VQ23.2","VQ23.3","VRD","XQ16","XQ17","XQ18" 
+0

错误是哪一行? –

+0

它表示第1行。我现在将其添加到问题 – PaReeOhNos

+0

你如何做差异?如果一个人不解析,另一个不解决,那么两者之间就必须有所区别。不要只运行'diff',而是运行'cmp'。它会捕获确切的字节差异。 – Casper

回答

4

嗯,这是刺激!

原来的文件有一个BOM这是导致CSV解析器打破。用

加载文件
CSV.open("path/to/file.csv", "rb:bom|encoding") 

允许它完美地解析它!所以烦了多久才开始追踪,但它现在正在工作,现在无需转换为UTF-8!

+0

哦,亲爱的:)好,你已经知道了,对于那些想知道的人:http://en.wikipedia.org/wiki/Byte_order_mark。你知道你可以接受你自己的答案... – Casper