0
好的,这对我自己的学习来说比实际需要更多。从一批文件中删除项目的更优雅的解决方案?
我有以下格式文件:
Loading parser from serialized file ./englishPCFG.ser.gz ... done [2.8 sec].
Parsing file: chpt1_1.txt
Parsing [sent. 1 len. 42]: [1.1, Organisms, Have, Changed, over, Billions, of, Years, 1, Long, before, the, mechanisms, of, biological, evolution, were, understood, ,, some, people, realized, that, organisms, had, changed, over, time, and, that, living, organisms, had, evolved, from, organisms, no, longer, alive, on, Earth, .]
(ROOT
(S
(S
(NP (CD 1.1) (NNS Organisms))
(VP (VBP Have)
(VP (VBN Changed)
(PP (IN over)
(NP
(NP (NNS Billions))
(PP (IN of)
(NP (NNP Years) (CD 1)))))
(SBAR
(ADVP (RB Long))
(IN before)
(S
(NP
(NP (DT the) (NNS mechanisms))
(PP (IN of)
(NP (JJ biological) (NN evolution))))
(VP (VBD were)
(VP (VBN understood))))))))
(, ,)
(NP (DT some) (NNS people))
(VP (VBD realized)
(SBAR
(SBAR (IN that)
(S
(NP (NNS organisms))
(VP (VBD had)
(VP (VBN changed)
(PP (IN over)
(NP (NN time)))))))
(CC and)
(SBAR (IN that)
(S
(NP (NN living) (NNS organisms))
(VP (VBD had)
(VP (VBN evolved)
(PP (IN from)
(NP
(NP (NNS organisms))
(ADJP
(ADVP (RB no) (RBR longer))
(JJ alive))))
(PP (IN on)
(NP (NNP Earth)))))))))
(. .)))
num(Organisms-2, 1.1-1)
nsubj(Changed-4, Organisms-2)
aux(Changed-4, Have-3)
ccomp(realized-22, Changed-4)
prep_over(Changed-4, Billions-6)
prep_of(Billions-6, Years-8)
num(Years-8, 1-9)
advmod(understood-18, Long-10)
dep(understood-18, before-11)
det(mechanisms-13, the-12)
nsubjpass(understood-18, mechanisms-13)
amod(evolution-16, biological-15)
prep_of(mechanisms-13, evolution-16)
auxpass(understood-18, were-17)
ccomp(Changed-4, understood-18)
det(people-21, some-20)
我需要删除所有不属于重要的依存关系(最后一节)。然后保存新文件。这是我的工作代码:
#!usr/bin/perl
use strict;
use warnings;
##Call with *.txt on command line
##EDIT TO ONLY FIND FILES YOU WANT CHANGED:
my @files = glob("parsed"."*.txt");
foreach my $file (@files) {
my @newfile;
open(my $parse_corpus, '<', "$file") or die $!;
while (my $sentences = <$parse_corpus>) {
#print $sentences, "\n\n";
if ($sentences =~ /(\w+)\(\S+\-\d+\,\s\S+\-\d+\)/) {
if ($sentences =~ /subj\w*\(|obj\w*\(|prep\w*\(|xcomp\w*\(|agent\w*\(|purpcl\w*\(|conj_and\w*\(/) {
push (@newfile, $sentences);
}
}
else {
push (@newfile, $sentences);
}
}
open(FILE ,'>', "select$file");
print FILE @newfile;
close FILE
}
而改变的输出文件的一部分:
nsubj(Changed-4, Organisms-2)
prep_over(Changed-4, Billions-6)
prep_of(Billions-6, Years-8)
nsubjpass(understood-18, mechanisms-13)
prep_of(mechanisms-13, evolution-16)
nsubj(realized-22, people-21)
nsubj(changed-26, organisms-24)
prep_over(changed-26, time-28)
nsubj(evolved-34, organisms-32)
conj_and(changed-26, evolved-34)
prep_from(evolved-34, organisms-36)
prep_on(evolved-34, Earth-41)
是否有显著更好的方法,或者用一个更优雅的/聪明的解决方案?
感谢您的时间,这又纯粹是为了您的兴趣,所以如果您没有时间,请不要帮忙。
干得好!我从中学到了,特别是outfile的效率部分。我喜欢“除非”更改,以及“autodie”。谢谢! p.s.将“my $ newfile”更改为“my $ outfile”以获得完美的代码。 – Jon
有没有办法做一个“双下一个”,所以你跳过(内部和外部)循环迭代(我不知道为什么,只是想知道) – Jon
你可以标记你的循环,然后'标签下'。但是在这个代码的情况下,它会跳转到下一个文件。请参阅perlsyn文档。 – DavidO