2011-06-22 38 views
0

好的,这对我自己的学习来说比实际需要更多。从一批文件中删除项目的更优雅的解决方案?

我有以下格式文件:

Loading parser from serialized file ./englishPCFG.ser.gz ... done [2.8 sec]. 
Parsing file: chpt1_1.txt 
Parsing [sent. 1 len. 42]: [1.1, Organisms, Have, Changed, over, Billions, of, Years, 1, Long, before, the, mechanisms, of, biological, evolution, were, understood, ,, some, people, realized, that, organisms, had, changed, over, time, and, that, living, organisms, had, evolved, from, organisms, no, longer, alive, on, Earth, .] 
(ROOT 
    (S 
    (S 
     (NP (CD 1.1) (NNS Organisms)) 
     (VP (VBP Have) 
     (VP (VBN Changed) 
      (PP (IN over) 
      (NP 
       (NP (NNS Billions)) 
       (PP (IN of) 
       (NP (NNP Years) (CD 1))))) 
      (SBAR 
      (ADVP (RB Long)) 
      (IN before) 
      (S 
       (NP 
       (NP (DT the) (NNS mechanisms)) 
       (PP (IN of) 
        (NP (JJ biological) (NN evolution)))) 
       (VP (VBD were) 
       (VP (VBN understood)))))))) 
    (, ,) 
    (NP (DT some) (NNS people)) 
    (VP (VBD realized) 
     (SBAR 
     (SBAR (IN that) 
      (S 
      (NP (NNS organisms)) 
      (VP (VBD had) 
       (VP (VBN changed) 
       (PP (IN over) 
        (NP (NN time))))))) 
     (CC and) 
     (SBAR (IN that) 
      (S 
      (NP (NN living) (NNS organisms)) 
      (VP (VBD had) 
       (VP (VBN evolved) 
       (PP (IN from) 
        (NP 
        (NP (NNS organisms)) 
        (ADJP 
         (ADVP (RB no) (RBR longer)) 
         (JJ alive)))) 
       (PP (IN on) 
        (NP (NNP Earth))))))))) 
    (. .))) 

num(Organisms-2, 1.1-1) 
nsubj(Changed-4, Organisms-2) 
aux(Changed-4, Have-3) 
ccomp(realized-22, Changed-4) 
prep_over(Changed-4, Billions-6) 
prep_of(Billions-6, Years-8) 
num(Years-8, 1-9) 
advmod(understood-18, Long-10) 
dep(understood-18, before-11) 
det(mechanisms-13, the-12) 
nsubjpass(understood-18, mechanisms-13) 
amod(evolution-16, biological-15) 
prep_of(mechanisms-13, evolution-16) 
auxpass(understood-18, were-17) 
ccomp(Changed-4, understood-18) 
det(people-21, some-20) 

我需要删除所有不属于重要的依存关系(最后一节)。然后保存新文件。这是我的工作代码:

#!usr/bin/perl 
use strict; 
use warnings; 

##Call with *.txt on command line 
##EDIT TO ONLY FIND FILES YOU WANT CHANGED: 
my @files = glob("parsed"."*.txt"); 

foreach my $file (@files) { 
my @newfile; 
    open(my $parse_corpus, '<', "$file") or die $!; 
    while (my $sentences = <$parse_corpus>) { 
    #print $sentences, "\n\n"; 
     if ($sentences =~ /(\w+)\(\S+\-\d+\,\s\S+\-\d+\)/) { 
      if ($sentences =~ /subj\w*\(|obj\w*\(|prep\w*\(|xcomp\w*\(|agent\w*\(|purpcl\w*\(|conj_and\w*\(/) { 
       push (@newfile, $sentences); 
      } 

     } 
     else { 
      push (@newfile, $sentences); 
     } 
    } 
open(FILE ,'>', "select$file"); 
print FILE @newfile; 
close FILE 
} 

而改变的输出文件的一部分:

nsubj(Changed-4, Organisms-2) 
prep_over(Changed-4, Billions-6) 
prep_of(Billions-6, Years-8) 
nsubjpass(understood-18, mechanisms-13) 
prep_of(mechanisms-13, evolution-16) 
nsubj(realized-22, people-21) 
nsubj(changed-26, organisms-24) 
prep_over(changed-26, time-28) 
nsubj(evolved-34, organisms-32) 
conj_and(changed-26, evolved-34) 
prep_from(evolved-34, organisms-36) 
prep_on(evolved-34, Earth-41) 

是否有显著更好的方法,或者用一个更优雅的/聪明的解决方案?

感谢您的时间,这又纯粹是为了您的兴趣,所以如果您没有时间,请不要帮忙。

回答

3

如果我理解了你的逻辑,你希望默认打印到outfile,除非遇到满足条件的'句子'。如果你遇到第一个条件,你只想输出到outfile,如果第二个条件也是如此。在那种情况下,我倾向于选择“如果这样,除非那个逻辑”,但那只是我。 ;)以下是您的代码示例。

use strict; 
use warnings; 
use autodie; 

##Call with *.txt on command line 
##EDIT TO ONLY FIND FILES YOU WANT CHANGED: 
my @files = glob("parsed" . "*.txt"); 

foreach my $file (@files) { 
    open my $parse_corpus, '<', "$file"; 
    open my $outfile, '>', "select$file"; 
    while (my $sentences = <$parse_corpus>) { 
     if($sentences =~ /(\w+)\(\S+\-\d+\,\s\S+\-\d+\)/) { 
      next unless $sentences =~ /subj\w*\(|obj\w*\(|prep\w*\(|xcomp\w*\(|agent\w*\(|purpcl\w*\(|conj_and\w*\(/; 
     } 
     print $outfile $sentences; 
    } 
} 

我没有试图重构你的正则表达式。我确实发现它更符合我的效率感,即在输入文件的同时逐行处理输出文件。这消除了第二个循环,以及对输出数组的需求。

此外,我使用了autodie编译指示,而不是在每次IO操作之后指定'or die'。由于我在输出文件上使用了一个词法文件句柄,因此它自己关闭。与autodie结合使用,隐式关闭甚至可以启用。

+0

干得好!我从中学到了,特别是outfile的效率部分。我喜欢“除非”更改,以及“autodie”。谢谢! p.s.将“my $ newfile”更改为“my $ outfile”以获得完美的代码。 – Jon

+1

有没有办法做一个“双下一个”,所以你跳过(内部和外部)循环迭代(我不知道为什么,只是想知道) – Jon

+0

你可以标记你的循环,然后'标签下'。但是在这个代码的情况下,它会跳转到下一个文件。请参阅perlsyn文档。 – DavidO

相关问题