我同意Matt Jacob的answer - 你应该Text::CSV解析CSV除非你有一个很好的理由不这样做。
如果你要处理使用正则表达式的话,我想你会与m//
比split
做的更好。例如,这似乎涵盖了大多数单行CSV数据变体,尽管它不会像引用的字段一样去除引号,因为Text::CSV
会 - 这需要单独的后处理步骤。
use strict;
use warnings;
sub splitter
{
my($row) = @_;
my @fields;
my $i = 0;
while ($row =~ m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/g)
{
print "Found [$1]\n";
$fields[$i++] = $1;
}
for (my $j = 0; $j < @fields; $j++)
{
print "$j = [$fields[$j]]\n";
}
}
my $row;
$row = q'ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6';
print "Row 1: $row\n";
splitter($row);
$row = q'ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""';
print "Row 2: $row\n";
splitter($row);
很明显,它有相当数量的诊断代码。的输出(在Perl 5.22.0 Mac OS X上10.11.1)是:
Row 1: ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6
Found [ACC000121]
Found [2290]
Found ["01009900,01009901,01009902,01009903,01009904"]
Found [4]
Found [5]
Found [6]
0 = [ACC000121]
1 = [2290]
2 = ["01009900,01009901,01009902,01009903,01009904"]
3 = [4]
4 = [5]
5 = [6]
Row 2: ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""
Found [ACC000121]
Found [","]
Found [2290]
Found ["01009900,""aux data"",01009902,01009903,01009904"]
Found []
Found [5"abc"]
Found [6]
Found [""]
0 = [ACC000121]
1 = [","]
2 = [2290]
3 = ["01009900,""aux data"",01009902,01009903,01009904"]
4 = []
5 = [5"abc"]
6 = [6]
7 = [""]
在Perl代码,匹配是:
m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/
这看起来并捕获(在$1
)可以是空字段后跟逗号,也可以是非双引号后面跟零个或多个非逗号,或者是双引号,后跟零次或多次出现的序列“不是双引号或两个连续的双引号引号“和另一个双引号;它然后期望逗号或字符串的结尾。
处理多行字段需要多一点工作。删除转义双引号还需要更多的工作。
使用Text::CSV
更简单,更不容易出错(并且它可以处理比这更多的变体)。
看起来像'split'按设计工作。另外,你的第一行并没有做你认为正在做的事情。你有一个空字符串和3个“undef”。 –
是否有可能将整个双引号字符串分成一个标量变量。如何实现这个可以实现 – user
我很好奇,当代码显式地只处理4个字段时,您希望输出中的6个字段。 –