2016-03-26 24 views
1

我有,其内容如下CSV文件文本限定符:CSV解析Java中这仅适用于内容有逗号

1,"hello, there",I have a csv in which,"only when ""double quote"" or comma are there in the content",it will be wrapped in the double quotes,otherwise not,something like 1/2" will not be wrapped up in double quotes. 

我用OpenCSV等CSV库解析,但它并没有工作。

我用StackOverflow question引用的正则表达式,但它也没有工作。

但是,当我在Excel中打开它工作正常。有人可以给我一个关于如何解析这个CSV文件的提示。

请注意,当内容包含逗号时,只有它包含在文本限定符中。当这样的内容包含在双引号中,并且双引号是内容的一部分时,则使用双引号将其转义。换句话说,它变成了双重双引号。但是,如果内容有双引号,那么它不会包含在文本限定符中。

请告知这一点。

当解析应尽可能低于上述内容的输出:

输出应该如下:

1 
hello, there 
I have a csv in which 
only whn "double quote" or comma are there in the content 
it will be wrapped in the double quotes 
otherwise not 
something like 1/2" will not be wrapped up in double quotes. 

我使用利用正则表达式打开CSV和还试图分裂尝试:

",(?=([^\"]*\"[^\"]*\")*[^\"]*$)" 

但没用。

我的数据是象下面这样:

PRODUCT,,1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVE,P,2510906459,,DEWALT TOOLS,,,<br><img src="http://example.com/image.png"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED - EQUAL TO 115-DCD960KL, 

希望这是解析为以下(我用来表示一个空单元格时,我们在Excel中看到它)

PRODUCT 
<BLANK> 
1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVE 
P 
2510906459 
<BLANK> 
DEWALT TOOLS 
<BLANK> 
<BLANK> 
<br><img src="http://example.com/image.png"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED - EQUAL TO 115-DCD960KL 
+0

你现在的问题现在很难阅读和理解。你能否正确地格式化,添加你的内容可能是什么样的例子,以及你想如何拆分它,以及你到目前为止所尝试过的吗? –

+0

谢谢塞巴斯蒂安读这本书。编辑内容。请让我知道现在它是否更具可读性。 –

+1

如前所述,我会说这个问题是不可能解决的,除非你对“类似于1/2的东西”做出更严格的定义“ –

回答

1

我没有什么问题uniVocity-parsers解析您的输入:

String input = "PRODUCT,,1/2\" 18V CORDLESS XRP LI-LON DRILL/DRIVE,P,2510906459,,DEWALT TOOLS,,,<br><img src=\"http://example.com/image.png\"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2\" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED - EQUAL TO 115-DCD960KL,"; 
    Reader reader = new StringReader(input); 

    CsvParserSettings settings = new CsvParserSettings(); //many options here, check the tutorial. 
    settings.setNullValue("<BLANK>"); //use that to obtain <BLANK> to represent nulls 

    String[] row = new CsvParser(settings).parseAll(reader).get(0); 
    for(String element : row){ 
     System.out.println(element); 
    } 

输出:

PRODUCT 
<BLANK> 
1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVE 
P 
2510906459 
<BLANK> 
DEWALT TOOLS 
<BLANK> 
<BLANK> 
<br><img src="http://example.com/image.png"><br><br><p><b>UNIT OF MEASURE: EA<br><br> QTY PER UNIT OF MEASURE: 1<br><br> MINIMUM ORDER QUANTITY: 1<br></P></b>DEWALT TOOLS DCD960KL - 1/2" 18V CORDLESS XRP LI-LON DRILL/DRIVER KIT - XRP™ CORDLESS DRILLS - BEST IN CLASS LENGTH FOR IMPROVED BALANCE AND BETTER CONTROL|LED WORKLIGHT PROVIDES INCREASED VISIBILITY IN CONFINED SPACES|PATENTED 3-SPEED ALL-METAL TRANSMISSION MATCHES THE TOOL TO TASK FOR FASTEST APPLICATION SPEED AND IMPROVED - EQUAL TO 115-DCD960KL 
<BLANK> 

免责声明:我这个库的作者,它的开源和免费(Apache 2.0许可证)

+1

嗨@JeronimoBackes这是我解析我的csv的梦想API。你很喜欢 –

+1

谢谢你节省了大量的时间,其他的解析器都没有能够检测到多线程的情况 – ashoka

1

尝试遵循正则表达式:

Stream<String> lines = Files.lines(Paths.get("path to csv file")); 

Pattern regex = Pattern.compile("\"(.*?)\"(?=,|$)|(?<=(?:,|^))(.*?)(?=,|$)", 
     Pattern.CASE_INSENSITIVE | Pattern.MULTILINE); 

lines.forEach(line -> { 
    Matcher matcher = regex.matcher(line); 
    while (matcher.find()) { 
     String content = matcher.group(1) == null ? matcher.group() : matcher.group(1); 
     System.out.println(content); 
    } 
}); 

基于示例输入文本

1,"hello, there",I have a csv in which, 
"only when ""double quote"" or comma are there in the content", 
it will be wrapped in the double quotes,otherwise not, 
something like 1/2" will not be wrapped up in double quotes. 

它会发射。

1 
hello, there 
I have a csv in which 
only when ""double quote"" or comma are there in the content 
it will be wrapped in the double quotes 
otherwise not 
something like 1/2" will not be wrapped up in double quotes. 
+0

@ saleem-mirza感谢您的回答,它像魅力一样工作,唯一缺少的是它不会返回最后一个项目,您能否提出一些建议? –

+0

另外,双引号是通过原样:不被减少到非转义的双引号 –

+0

伙计们,检查出更新的代码,希望它可以为你工作 – Saleem