使用shell脚本从文件中提取独特的行块

从文件中提取块行时，我面临的问题很少。考虑以下两个文件使用shell脚本从文件中提取独特的行块

File-1 
1.20/abc/this_is_test_1 
perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2 
exec perl/RRP/RRP-1.30/JEDI/CommonReq/confAbvExp 
perl/LRP/BaseLibs/close-MMM 
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE") 
this/or/that 

File-2 
exec 1.20/setup/testird 
exec 1.20/sql/temp/Test3 
exec 1.20/setup/testxyz 
exec 1.20/sql/fondle_opr_sql_labels 
exec 1.20/setup/testird 
exec 1.20/sql/temp/NEWTest 
exec 1.20/setup/testxyz 
exec 1.20/sql/fondle_opr_sql_xfer 
exec 1.20/setup/testird 
exec 1.20/sql/set_sec_not_0 
exec 1.20/setup/testpqr 
exec 1.20/sql/sql_ba_statuses_on_mult 
exec perl/RRP/SetupReq/testdef_ijk 
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp 
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess1 
exec perl/RRP/SetupReq/testdef_ijk 
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp 
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2 
exec perl/RRP/SetupReq/testdef_ijk 
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp 
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess3 
exec 1.20/setup/testird 
exec 1.20/sql/sqlmenu_purr_labl 
exec 1.20/sql/est_time_at_non_drp_plc 
exec 1.20/sql/half_Brd_Supply_mix_single 
exec 1.20/setup/testird 
exec 1.20/sql/temp/Test 
exec 1.20/setup/testird 
exec 1.20/sql/temp/Test2 
exec perl/LRP/SetupReq/testird_LRP("LRP") 
exec perl/BaseLibs/launch_client("LRP") 
exec perl/LRP/LRP-classic-4.14/churrip/chorSingle 
exec perl/LRP/BaseLibs/setupLRPMMMTab 
exec perl/LRP/BaseLibs/launchMMM 
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE") 
#PAUSE Expand Churrip tree view & open all nodes 
exec perl/LRP/LRP-classic-4.14/Corrugator/multipleSeriesWeb 
exec perl/BaseLibs/ShutApp("Self Destruction System") 
exec perl/LRP/BaseLibs/close-MMM 
exec 1.20/setup/testmiddle 
exec 1.20/sql/collective_reads 
exec 1.20/setup/testinit 
exec 1.20/abc/this_is_test_1 
exec 1.20/abc/this_is_test_1 
exec perl/LRP/SetupReq/abcDEF 
exec perl/BaseLibs/launch_client("sqlC","LRP") 
exec perl/LRP/LRP-perl-4.20/fireTrigger

现在在文件-1的每一行，我想从文件中提取2线相关的功能块。在文件-2 A块被定义为下面

exec 1.20/setup/xxxxx 
blah blah blah 
blah blah blah 
. 
. 
. 
all lines till next setup line is found

例如

exec 1.20/setup/testinit 
exec 1.20/abc/this_is_test_1 
exec 1.20/abc/this_is_test_1

或

exec perl/LRP/SetupReq/xxxxx 
blah blah blah 
blah blah blah 
. 
. 
. 
all lines till next setup line is found

例如

exec perl/LRP/SetupReq/testird_LRP("LRP") 
exec perl/BaseLibs/launch_client("LRP") 
exec perl/LRP/LRP-classic-4.14/churrip/chorSingle 
exec perl/LRP/BaseLibs/setupLRPMMMTab 
exec perl/LRP/BaseLibs/launchMMM 
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE") 
#PAUSE Expand Churrip tree view & open all nodes 
exec perl/LRP/LRP-classic-4.14/Corrugator/multipleSeriesWeb 
exec perl/BaseLibs/ShutApp("Self Destruction System") 
exec perl/LRP/BaseLibs/close-MMM

我迄今设法提取相关的blo从文件-2中正与下面的脚本

Shell Script 
#set -x 
FLBATCHLIST=$1 
BATCHFILE=$2 

TEMPDIR="/usr/tmp/tempBatchDir" 
rm -rf $TEMPDIR/* 

WORKFILE="$TEMPDIR/failedTestList.txt" 
CPBATCHFILE="$TEMPDIR/orig.test" 
TESTSETFILE="$TEMPDIR/testset.txt" 
TEMPFILE="$TEMPDIR/temp.txt" 
DIFFFILE="$TEMPDIR/diff.txt" 

#Output 
FAILEDBATCH="$TEMPDIR/FailedBatch.test" 
LOGFILE="$TEMPDIR/log.txt" 

createBatch() 
{ 

TESTNAME=$1 
#First process the $CPBATCHFILE to not have any blank lines, leading and trailing whitespaces 
# delete BOTH leading and trailing whitespace from each line and blank lines from file 
sed -i 's/^[[:space:]]*//;s/[[:space:]]*$//g;/^$/d' $CPBATCHFILE 
FOUND=0 
STATUS=1 
while [ $STATUS -ne "0" ] 
do 
     if [ ! -s $CPBATCHFILE ]; then 
       echo "$CPBATCHFILE is empty" >> $LOGFILE 
       STATUS=0 
     fi 
     awk '/[Ss]etup.*[Tt]est/ || /perl\/[[:alpha:]]*\/[Ss]etup[rR]eq/{if(b) exit; else b=1}1' $CPBATCHFILE > $TESTSETFILE 
     grep -i "$TESTNAME$" $TESTSETFILE >> $LOGFILE 2>&1 
     if [ $? -eq "0" ]; then 
       echo "test found" >> $LOGFILE 
       cat $TESTSETFILE >> $FAILEDBATCH 
       FOUND=1 
     fi 
     TSTFLLINES=`wc -l < $TESTSETFILE` 
     CPBTCHLINES=`wc -l < $CPBATCHFILE` 
     DIFF=`expr $CPBTCHLINES - $TSTFLLINES` 
     tail -n $DIFF $CPBATCHFILE > $DIFFFILE 
     mv $DIFFFILE $CPBATCHFILE 
done 

if [ $FOUND -eq 0 ]; then 
     echo $TESTNAME > $TEMPDIR/test.txt 
     ABSTEST=$(echo $TESTNAME | sed 's/\\//g') 
     echo "FATAL ERROR: Test \"$ABSTEST\" not found in batch" | tee -a $LOGFILE 
fi 

} 

####STARTS HERE#### 
mkdir -p $TEMPDIR 
#cat $TEMPDIR/test.txt 
#FLBATCHLIST="$TEMPDIR/test.txt" 
# delete run, BOTH leading and trailing whitespace and blank lines from file 
sed 's/^[eE][xX][eE][cC]//g;s/^[[:space:]]*//;s/[[:space:]]*$//g;/^$/d' $FLBATCHLIST > $WORKFILE 

# escaping special characters like '\' and '.' in the path names for better grepping 
sed -i 's/\([\/\.\"]\)/\\\1/g' $WORKFILE 

for fltest in $(cat $WORKFILE) 
do 
     echo $fltest >> $LOGFILE 
     cp $BATCHFILE $CPBATCHFILE 
     createBatch $fltest 
done 

sed -i 's/\//\\/g' $FAILEDBATCH 
## Clean up 
cp $FAILEDBATCH .

的问题，这个脚本的帮助是

这需要一些时间，因为它对于文件1的各线横贯文件-2。我想知道是否有更好的解决方案，我只需要遍历File-2一次。
该脚本确实解决了我的问题，但我剩下的文件中有重复的行块。我想知道是否有办法删除重复的行块。

这是我的输出，当我执行脚本

exec 1.20\setup\testinit 
exec 1.20\abc\this_is_test_1 
exec 1.20\abc\this_is_test_1 
exec perl\RRP\SetupReq\testdef_ijk 
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp 
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess2 
exec perl\RRP\SetupReq\testdef_ijk 
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp 
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess1 
exec perl\RRP\SetupReq\testdef_ijk 
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp 
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess2 
exec perl\RRP\SetupReq\testdef_ijk 
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp 
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess3 
exec perl\LRP\SetupReq\testird_LRP("LRP") 
exec perl\BaseLibs\launch_client("LRP") 
exec perl\LRP\LRP-classic-4.14\churrip\chorSingle 
exec perl\LRP\BaseLibs\setupLRPMMMTab 
exec perl\LRP\BaseLibs\launchMMM 
exec perl\LRP\BaseLibs\launchLRPCHURRTA("TYRE") 
#PAUSE Expand Churrip tree view & open all nodes 
exec perl\LRP\LRP-classic-4.14\Corrugator\multipleSeriesWeb 
exec perl\BaseLibs\ShutApp("Self Destruction System") 
exec perl\LRP\BaseLibs\close-MMM 
exec perl\LRP\SetupReq\testird_LRP("LRP") 
exec perl\BaseLibs\launch_client("LRP") 
exec perl\LRP\LRP-classic-4.14\churrip\chorSingle 
exec perl\LRP\BaseLibs\setupLRPMMMTab 
exec perl\LRP\BaseLibs\launchMMM 
exec perl\LRP\BaseLibs\launchLRPCHURRTA("TYRE") 
#PAUSE Expand Churrip tree view & open all nodes 
exec perl\LRP\LRP-classic-4.14\Corrugator\multipleSeriesWeb 
exec perl\BaseLibs\ShutApp("Self Destruction System") 
exec perl\LRP\BaseLibs\close-MMM

我试图寻找我的答案了网，但无法找到一个具体到我的需求。

给定的文件-1和文件2 这是我希望我的脚本来输出（我列出我所期望的输出在FILE-1的每一行）

For line "1.20/abc/this_is_test_1" in FILE-1 
Output 
exec 1.20/setup/testinit 
exec 1.20/abc/this_is_test_1 
exec 1.20/abc/this_is_test_1 

For line "perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2" in FILE-1 
Output 
exec perl/RRP/SetupReq/testdef_ijk 
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp 
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2 

For line "exec perl/RRP/RRP-1.30/JEDI/CommonReq/confAbvExp" in FILE-1 
Output 
do nothing as there is no line matching this is in FILE-2 

For line "perl/LRP/BaseLibs/close-MMM" in FILE-1 
Output 
exec perl/LRP/SetupReq/testird_LRP("LRP") 
exec perl/BaseLibs/launch_client("LRP") 
exec perl/LRP/LRP-classic-4.14/churrip/chorSingle 
exec perl/LRP/BaseLibs/setupLRPMMMTab 
exec perl/LRP/BaseLibs/launchMMM 
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE") 
#PAUSE Expand Churrip tree view & open all nodes 
exec perl/LRP/LRP-classic-4.14/Corrugator/multipleSeriesWeb 
exec perl/BaseLibs/ShutApp("Self Destruction System") 
exec perl/LRP/BaseLibs/close-MMM  

For line "exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE")" in FILE-1 
Output 
Do nothing as it would generate the same black as line "perl/LRP/BaseLibs/close-MMM" in FILE-1 did 

For Line "this/or/that" in FILE-1 
Output 
Do nothing as there is no line matching this is in FILE-2

所以我最终输出应该类似

exec 1.20/setup/testinit 
exec 1.20/abc/this_is_test_1 
exec 1.20/abc/this_is_test_1 

exec perl/RRP/SetupReq/testdef_ijk 
exec perl/RRP/RRP-1.30/JEDI/SetupReq/confAbvExp 
exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess2 

exec perl/LRP/SetupReq/testird_LRP("LRP") 
exec perl/BaseLibs/launch_client("LRP") 
exec perl/LRP/LRP-classic-4.14/churrip/chorSingle 
exec perl/LRP/BaseLibs/setupLRPMMMTab 
exec perl/LRP/BaseLibs/launchMMM 
exec perl/LRP/BaseLibs/launchLRPCHURRTA("TYRE") 
#PAUSE Expand Churrip tree view & open all nodes 
exec perl/LRP/LRP-classic-4.14/Corrugator/multipleSeriesWeb 
exec perl/BaseLibs/ShutApp("Self Destruction System") 
exec perl/LRP/BaseLibs/close-MMM

如果任何人都可以给我一些关于如何进行的指导，这将是非常好的。是的，我忘了提及，这不是一个家庭作业问题:-)。

非常感谢

来源

2012-12-04 aazim

几乎一个很大的问题。考虑到您的file1和file2，请考虑编辑以包含示例输出块。祝你好运。 – shellter

'blah blah blah'代表什么？每场比赛后你总是需要三条线吗？无论如何，你可以从你的'file-1'中制作一个'sed'脚本，并且只需要在大输入文件上运行一次。 – tripleee

@tripleee我已经在问题中添加了关于什么等等等等的信息。包含setup关键字的行之间可能有任意数量的行。你也可以提供一些关于你是在暗示我如何处理sed脚本的更多见解吗？我认为我没有正确理解它。非常感谢 – aazim

感谢@tripleee和@Jarmund的建议。从你的投入中，我终于能够找出解决我的问题的方法。我从关联数组暗示，使每个块的唯一关键，所以这里是我做过什么

取文件-2和每块转换成单行

的awk“/ [SS] etup。 [Tt] est/||/perl/[[：alpha：]]/[Ss] etup [Rr] eq/{if（b）exit; else b = 1} 1'file-2> $ TESTSETFILE cat $ TESTSETFILE | SED '：一，N; $ BA; S/\ n //克; S/// G' >> $ SINGLELINEFILE
现在，在这个文件中的每一行是唯一的入口
在此之后我用grep每一行中的文件-1和现在发现各自块（其被转换为单个线）
然后我使用AWK或排序-u找到唯一条目在我的解决办法文件

也许这解决方案并不是最好的，但它比前一个更快。

这是我的新的脚本

FLBATCHLIST=$1 
BATCHFILE=$2 

TEMPDIR="./tempBatchdir" 
rm -rf $TEMPDIR/* 
WORKFILE="$TEMPDIR/failedTestList.txt" 
CPBATCHFILE="$TEMPDIR/orig.test" 
TESTSETFILE="$TEMPDIR/testset.txt" 
DIFFFILE="$TEMPDIR/diff.txt" 
SINGLELINEFILE="$TEMPDIR/singleline.txt" 
TEMPFILE="$TEMPDIR/temp.txt" 
#Output 
FAILEDBATCH="$TEMPDIR/FailedBatch.test" 
LOGFILE="$TEMPDIR/log.txt" 

convertSingleLine() 
{ 
sed -i 's/^[[:space:]]*//;s/[[:space:]]*$//g;/^$/d' $CPBATCHFILE 
STATUS=1 
while [ $STATUS -ne "0" ] 
do 
     if [ ! -s $CPBATCHFILE ]; then 
       echo "$CPBATCHFILE is empty" >> $LOGFILE 
       STATUS=0 
     fi 
     awk '/[Ss]etup.*[Tt]est/ || /perl\/[[:alpha:]]*\/[Ss]etup[Rr]eq/{if(b) exit; else b=1}1' $CPBATCHFILE > $TESTSETFILE 
     cat $TESTSETFILE | sed ':a;N;$!ba;s/\n//g;s/ //g' >> $SINGLELINEFILE 
     echo "**" >> $SINGLELINEFILE 
     TSTFLLINES=`wc -l < $TESTSETFILE` 
     CPBTCHLINES=`wc -l < $CPBATCHFILE` 
     DIFF=`expr $CPBTCHLINES - $TSTFLLINES` 
     tail -n $DIFF $CPBATCHFILE > $DIFFFILE 
     mv $DIFFFILE $CPBATCHFILE 
done 
} 

####STARTS HERE#### 
mkdir -p $TEMPDIR 

sed 's/^[eE][xX][eE][cC]//g;s/^[[:space:]]*//;s/[[:space:]]*$//g;/^$/d' $FLBATCHLIST > $WORKFILE 
sed -i 's/\([\/\.\"]\)/\\\1/g' $WORKFILE 

cp $BATCHFILE $CPBATCHFILE 
convertSingleLine 

for fltest in $(cat $WORKFILE) 
do 
     echo $fltest >> $LOGFILE 
     grep "$fltest" $SINGLELINEFILE >> $FAILEDBATCH 
     if [ $? -eq "0" ]; then 
       echo "TEST FOUND" >> $LOGFILE 
     else 
       ABSTEST=$(echo $fltest | sed 's/\\//g') 
       echo "FATAL ERROR: Test \"$ABSTEST\" not found in $BATCHFILE" | tee -a $LOGFILE 
     fi 
done 

awk '!x[$0]++' $FAILEDBATCH > $TEMPFILE 
mv $TEMPFILE $FAILEDBATCH 

sed -i "s/exec/\\nexec /g;s/#/\\n#/g" $FAILEDBATCH 
sed -i '1d;s/\//\\/g' $FAILEDBATCH

这里是输出

$ crflbatch file-1 file-2 
FATAL ERROR: Test "perl/RRP/RRP-1.30/JEDI/CommonReq/confAbvExp" not found in file-2 
FATAL ERROR: Test "this/or/that" not found in file-2 

$ cat tempBatchdir/FailedBatch.test 
exec 1.20\setup\testinit 
exec 1.20\abc\this_is_test_1 
exec 1.20\abc\this_is_test_1 

exec perl\RRP\SetupReq\testdef_ijk 
exec perl\RRP\RRP-1.30\JEDI\SetupReq\confAbvExp 
exec perl\RRP\RRP-1.30\JEDI\JEDIExportSuccess2 

exec perl\LRP\SetupReq\testird_LRP("LRP") 
exec perl\BaseLibs\launch_client("LRP") 
exec perl\LRP\LRP-classic-4.14\churrip\chorSingle 
exec perl\LRP\BaseLibs\setupLRPMMMTab 
exec perl\LRP\BaseLibs\launchMMM 
exec perl\LRP\BaseLibs\launchLRPCHURRTA("TYRE") 
#PAUSEExpandChurriptreeview&openallnodes 
exec perl\LRP\LRP-classic-4.14\Corrugator\multipleSeriesWeb 
exec perl\BaseLibs\ShutApp("SelfDestructionSystem") 
exec perl\LRP\BaseLibs\close-MMM 
$

来源

2012-12-11 03:57:38 aazim

只要线顺序并不重要，你可以从文件中删除重复这种方式，从日ecommand提示：

sort filename | uniq

要了解哪些线是目前在这两个文件，我使用了一个创建散列（或关联数组，如果你愿意的话）的perl脚本。然后，我通过文件A扫描，每行添加到散列，使用行作为键，并将值设置为1.然后我对文件A做了同样的处理，但将值设置为2，并且如果键已经存在，我加了2个。结果只会经过每个文件一次，最后我知道如果密钥的值为1，它只存在于文件A中，如果它的值为2，它只存在于文件B中，并且如果它的值为3，它就存在于两者中。

编辑： 我发现了一些perl代码从一个项目铺设，完全按照我上面描述的。在这段代码中，我只是之后的差异，但它应该很容易修改，以您的需求

my %found; 
foreach my $item (@qlist) { $found{$item} += 2 }; 
foreach my $item (@xlist) { $found{$item} += 1 }; 

foreach my $found (keys(%found)) 
{ 
    if ($found{$found} == 3) 
    { 
    # It's in both files. Not doing anything. 
    } 
    elsif ($found{$found} == 2) 
    { 
    print "$found found in the QC-list, but not the x-list.\n"; 
    } 
    elsif ($found{$found} == 1) 
    { 
    print "$found found in the x-list, but not the QC-list.\n"; 
    } 
}

来源

2012-12-04 22:36:58 Jarmund

甚至'排序-u asdf' ... – twalberg

感谢Jarmund的即时响应。如果我不必关心线条的顺序，这将是非常容易的。这就是使它变得复杂的原因，我不能在这里用我的帮助排序。我知道perl有一点，但我更愿意先看看shell脚本中是否有解决方案。谢谢 – aazim

以下假定“设置”行是每个块是唯一的。我们用这条线作为关联数组的关键字，它跟踪我们已经打印的块。

脚本的第一行将第一个文件读入一个名为regex的变量，该变量从第一个文件收集我们想要匹配的行（成语NR==FNR表示当前文件的行号等于行号所有收集到的文件，也就是说，只有当我们从参数列表中读取第一个文件时才是如此）。脚本的其余部分相当简单，我希望。

awk 'NR==FNR { gsub(/\//,"\\/"); regex = regex sep $0; sep = "|" ; next} 
    /[Ss]etup/ { label = $0; printing = 0; collected = nl = "" } 
    { collected = collected nl $0; nl=RS } 
    $0 ~ regex { if(!printed[label]) { 
     printed[label] = printing = 1; print collected } } 
    printing { print }' File-1 File-2

如果“设置”行不一定是唯一的，也许你可以使用collected值作为重点。

这应该（我希望）针对来自File-1的多行匹配与File-2中的相同块匹配。

我知道我暗示在sed解决方案的评论，但这原来是这样的问题，其中awk感觉更自然。当然，它可以用Perl或Python来完成，或者你也可以用。

来源

2012-12-06 12:36:43 tripleee

谢谢tripleee。它并没有真正解决我的问题，但给了我一个宝贵的建议，我可以如何继续:-)。不幸的是，由于我没有足够的声望，我不能满足你的答案。 – aazim

我做了一个awk解决方案的快速运行，并在输出中得到了不需要的行“exec perl/RRP/RRP-1.30/JEDI/JEDIExportSuccess1”。我对awk不太满意，所以需要我花很长时间才能理解你在上面做了什么:-)。非常感谢 – aazim

使用shell脚本从文件中提取独特的行块

回答

相关问题