什么是比较perl中的字符串数组的最佳方法

我想比较多个包含目录文件列表的字符串数组。目标是确定每个目录中存在哪些文件以及哪些文件不存在。试想一下：什么是比较perl中的字符串数组的最佳方法

List1 List2 List3 List4 
a  a  e  f 
b  b  d  g 
c  f  a  h

结果应该是：

列表1：

 List1 List2 List3 List4 
a  yes  yes  yes  no 
b  yes  yes  no  no 
c  yes  no  no  no

列表2：

 List1 List2 List3 List4 
a  yes  yes  yes  no 
b  yes  yes  no  no 
f  no  yes  no  yes

...

我可以去通过所有的数组并浏览每个条目，经过所有其他阵列和做的grep：

for my $curfile (@currentdirfiles) { 
    if(grep(/$curfile/, @otherarrsfiles)) { 
     // Set 'yes' 
    } else { 
     // set 'no' 
    } 
}

我唯一担心的是，我与幅度的0^2N为了结束了。我可能无法做任何事情，因为我最终会循环遍历所有的数组。 grep函数中可能有一个改进，但我不确定。

有什么想法？

来源

2011-04-27 EDJ

现在问题已经修改，这产生了你想要的答案。它在O（n ）时间内工作，这对于问题是最佳的（有输出）。

#!/usr/bin/env perl 

use strict; 
use warnings; 

#List1 List2 List3 List4 
#a  a  e  f 
#b  b  d  g 
#c  f  a  h 

my(@lists) = ({ a => 1, b => 1, c => 1 }, 
       { a => 1, b => 1, f => 1 }, 
       { e => 1, d => 1, a => 1 }, 
       { f => 1, g => 1, h => 1 }, 
      ); 

my $i = 0; 
foreach my $list (@lists) 
{ 
    analyze(++$i, $list, @lists); 
} 

sub analyze 
{ 
    my($num, $ref, @lists) = @_; 
    printf "List %d\n", $num; 

    my $pad = "  "; 
    foreach my $i (1..4) 
    { 
     print "$pad List$i"; 
     $pad = ""; 
    } 
    print "\n"; 

    foreach my $file (sort keys %{$ref}) 
    { 
     printf "%-8s", $file; 
     foreach my $list (@lists) 
     { 
      my %dir = %{$list}; 
      printf "%-8s", (defined $dir{$file}) ? "yes" : "no"; 
     } 
     print "\n"; 
    } 
    print "\n"; 
}

我得到的输出是：

List 1 
     List1 List2 List3 List4 
a  yes  yes  yes  no  
b  yes  yes  no  no  
c  yes  no  no  no  

List 2 
     List1 List2 List3 List4 
a  yes  yes  yes  no  
b  yes  yes  no  no  
f  no  yes  no  yes  

List 3 
     List1 List2 List3 List4 
a  yes  yes  yes  no  
d  no  no  yes  no  
e  no  no  yes  no  

List 4 
     List1 List2 List3 List4 
f  no  yes  no  yes  
g  no  no  no  yes  
h  no  no  no  yes

来源

2011-04-27 05:09:11

'定义'是什么让我更容易忍受，而在哈希工作将是最有效地搜索数百和数千行（文件）的事实。谢谢。 – EDJ 2011-04-27 18:27:00

对于大量的字符串查找，您通常希望使用散列。下面是做这件事的一种方法：

use strict; 
use warnings; 

# Define the lists: 
my @lists = (
    [qw(a b c)], # List 1 
    [qw(a b f)], # List 2 
    [qw(e d a)], # List 3 
    [qw(f g h)], # List 4 
); 

# For each file, determine which lists it is in: 
my %included; 

for my $n (0 .. $#lists) { 
    for my $file (@{ $lists[$n] }) { 
    $included{$file}[$n] = 1; 
    } # end for each $file in this list 
} # end for each list number $n 

# Print out the results: 
my $fileWidth = 8; 

for my $n (0 .. $#lists) { 

    # Print the header rows: 
    printf "\nList %d:\n", $n+1; 

    print ' ' x $fileWidth; 
    printf "%-8s", "List $_" for 1 .. @lists; 
    print "\n"; 

    # Print a line for each file: 
    for my $file (@{ $lists[$n] }) { 
    printf "%-${fileWidth}s", $file; 

    printf "%-8s", ($_ ? 'yes' : 'no') for @{ $included{$file} }[0 .. $#lists]; 
    print "\n"; 
    } # end for each $file in this list 
} # end for each list number $n

来源

2011-04-27 05:05:26 cjm

你对哈希的评论帮了我大忙。谢谢。 – EDJ 2011-04-27 18:27:54

最明显的方法是使用perl5i和自动装箱：

use perl5i; 
my @list1 = qw(one two three); 
my @list2 = qw(one two four);  

my $missing = @list1 -> diff(\@list2); 
my $both = @list1 -> intersect(\@list2);

在更严格的设置，使用哈希此作为文件名，将是独一无二的：

sub in_list { 
    my ($one, $two) = @_; 
    my (@in, @out); 
    my %a = map {$_ => 1} @$one; 

    foreach my $f (@$two) { 
     if ($a{$f}) { 
      push @in, $f; 
     } 
     else { 
      push @out, $f; 
     } 
    } 
    return (\@in, \@out); 
} 

my @list1 = qw(one two three); 
my @list2 = qw(one two four);  
my ($in, $out) = in_list(\@list1, \@list2); 

print "In list 1 and 2:\n"; 
print " $_\n" foreach @$in; 

print "In list 2 and not in list 1\n"; 
print " $_\n" foreach @$out;

来源

2011-04-27 05:06:18 Alex

为什么不记住每个文件是当你在阅读它们。

比方说，哟u有一个目录列表从@dirlist阅读：

use File::Slurp qw(read_dir); 
my %in_dir; 
my %dir_files; 

foreach my $dir (@dirlist) { 
    die "No such directory $dir" unless -d $dir; 
    foreach my $file (read_dir($dir)) { 
     $in_dir{$file}{$dir} = 1; 
     push @{ $dir_files{$dir} }, $file; 
    } 
}

现在$in_dir{filename}将每个感兴趣的目录中定义条目， $dir_files{directory}将为每个目录中的文件列表...

foreach my $dir (@dirlist) { 
    print "$dir\n"; 
    print join("\t", "", @dirlist); 
    foreach my $file (@{ $dir_files{$dir} }) { 
     my @info = ($file); 
     foreach my $dir_for_file (@dirlist) { 
      if (defined $in_dir{$file}{$dir_for_file}) { 
       push @info, "Yes"; 
      } else { 
       push @info, "No"; 
      } 
     } 
     print join("\t", @info), "\n"; 
    } 
}

来源

2011-04-27 05:07:29 unpythonic

谢谢，但列表以文件形式发送给我。我不是从目录中读取的。但是，好点虽然:) – EDJ 2011-04-27 18:25:08

我的代码更简单，但输出是不太你想要什么：

@lst1=('a', 'b', 'c'); 
@lst2=('a', 'b', 'f'); 
@lst3=('e', 'd', 'a'); 
@lst4=('f', 'g', 'h'); 

%hsh=(); 

foreach $item (@lst1) { 
    $hsh{$item}="list1"; 
} 

foreach $item (@lst2) { 
    if (defined($hsh{$item})) { 
     $hsh{$item}=$hsh{$item}." list2"; 
    } 
    else { 
     $hsh{$item}="list2"; 
    } 
} 

foreach $item (@lst3) { 
    if (defined($hsh{$item})) { 
     $hsh{$item}=$hsh{$item}." list3"; 
    } 
    else { 
     $hsh{$item}="list3"; 
    } 
} 

foreach $item (@lst4) { 
    if (defined($hsh{$item})) { 
     $hsh{$item}=$hsh{$item}." list4"; 
    } 
    else { 
     $hsh{$item}="list4"; 
    } 
} 

foreach $key (sort keys %hsh) { 
    printf("%s %s\n", $key, $hsh{$key}); 
}

给出：

a list1 list2 list3 
b list1 list2 
c list1 
d list3 
e list3 
f list2 list4 
g list4 
h list4

来源

2011-04-27 05:12:05

-2

我会建立一个散列使用目录条目作为包含哈希键（实际盟友设置）的每个列表中找到。迭代每个列表，每个新条目将其添加到外部散列，并使用包含首次遇到列表标识符的单个集合（或散列）。对于在散列中找到的任何条目，只需将当前列表标识符添加到值的集合/散列。

从那里你可以简单地发布处理散列的排序键，并创建结果表的行。

我个人认为Perl是丑陋的，但这里是Python中的示例：

#!/usr/bin/env python 
import sys 
if len(sys.argv) < 2: 
    print >> sys.stderr, "Must supply arguments" 
    sys.exit(1) 
args = sys.argv[1:] 

# build hash entries by iterating over each listing 
d = dict() 
for each_file in args: 
    name = each_file 
    f = open(each_file, 'r') 
    for line in f: 
     line = line.strip() 
     if line not in d: 
      d[line] = set() 
     d[line].add(name) 
    f.close() 

# post process the hash 
report_template = "%-20s" + (" %-10s" * len(args)) 
print report_template % (("Dir Entries",) + tuple(args)) 
for k in sorted(d.keys()): 
    row = list() 
    for col in args: 
     row.append("yes") if col in d[k] else row.append("no") 
    print report_template % ((k,)+tuple(row))

这应该主要是清晰可辨，就像它是伪代码。 (k,)和("Dir Entries",)表达式可能看起来有点奇怪;但这是为了强制它们是使用%运算符将字符串解压缩为格式字符串所必需的元组。例如，这些也可以写为tuple([k]+row)（包装[]中的第一项使其成为可以添加到另一个列表并且全部转换为元组的列表）。

除此之外，Perl的翻译应该非常简单，只需使用散列而不是字典和集合。（顺便说一下，这个例子可以处理任意数量的列表，作为参数提供，并以列的形式输出。很明显，在十几列之后，输出会变得很麻烦，不便于打印或显示;但这是一个简单的概括做）。

来源

2011-04-27 05:30:28

请不要在Python中回答Perl问题。大多数时候，人们没有选择使用哪种语言。他们的老板已经为他们做出了决定。 – shawnhcorey 2011-04-27 13:23:30

我表示，其目的是为了读取伪代码。它恰好也是Python的事实与它基本上是正交的（除了它允许我测试它的小事实之外）。 – 2011-04-27 19:44:24

对不起，迟到的回复，我一直抛光这一段时间，因为我不想再有一个负面评分（打消我）。

这是一个有趣的效率问题。我不知道我的解决方案是否适合你，但我想我会分享它。只有当你的数组没有太频繁地改变时，以及你的数组是否包含许多重复值，这可能是有效的。我没有对它进行任何效率检查。

基本上，解决方案是通过将数组值转换为位来移除交叉检查的一个维度，并且一次对整个数组进行按位比较。数组值被删除，排序并给出一个序列号。阵列总序列号然后通过按位或以单个值存储。单个阵列可以由此被检查一个序列号只有一个的操作，例如：

if (array & serialno)

这将需要一个运行准备数据，然后可将其保存在高速缓存或相似。这些数据可以在数据更改之前使用（例如文件/文件夹被删除或添加）。我在未定义的值上添加了一个致命的退出，这意味着数据在发生时必须刷新。

祝你好运！

use strict; 
use warnings; 

my @list1=('a', 'b', 'c'); 
my @list2=('a', 'b', 'f'); 
my @list3=('e', 'd', 'a'); 
my @list4=('f', 'g', 'h'); 

# combine arrays 
my @total = (@list1, @list2, @list3, @list4); 

# dedupe (Thanks Xetius for this code snippet) 
my %unique =(); 
foreach my $item (@total) 
{ 
    $unique{$item} ++; 
} 
# Default sort(), don't think it matters 
@total = sort keys %unique; 

# translate to serial numbers 
my %serials =(); 
for (my $num = 0; $num <= $#total; $num++) 
{ 
    $serials{$total[$num]} = $num; 
} 

# convert array values to serial numbers, and combine them 
my @tx =(); 
for my $entry (@list1) { $tx[0] |= 2**$serials{$entry}; } 
for my $entry (@list2) { $tx[1] |= 2**$serials{$entry}; } 
for my $entry (@list3) { $tx[2] |= 2**$serials{$entry}; } 
for my $entry (@list4) { $tx[3] |= 2**$serials{$entry}; } 

&print_all; 

sub inList 
{ 
    my ($value, $list) = @_; 
    # Undefined serial numbers are not accepted 
    if (! defined ($serials{$value})) { 
      print "$value is not in the predefined list.\n"; 
      exit; 
    } 
    return (2**$serials{$value} & $tx[$list]); 
} 

sub yesno 
{ 
    my ($value, $list) = @_; 
    return (&inList($value, $list) ? "yes":"no"); 
} 
# 
# The following code is for printing purposes only 
# 
sub print_all 
{ 
    printf "%-6s %-6s %-6s %-6s %-6s\n", "", "List1", "List2", "List3", "List4"; 
    print "-" x 33, "\n"; 
    &table_print(@list1); 
    &table_print(@list2); 
    &table_print(@list3); 
    &table_print(@list4); 
} 

sub table_print 
{ 
    my @list = @_; 
    for my $entry (@list) { 
     printf "%-6s %-6s %-6s %-6s %-6s\n", $entry, 
      &yesno($entry, 0), 
      &yesno($entry, 1), 
      &yesno($entry, 2), 
      &yesno($entry, 3); 
    } 
    print "-" x 33, "\n"; 
}

来源

2011-04-27 21:34:41 TLP

什么是比较perl中的字符串数组的最佳方法

回答

相关问题