2014-09-27 57 views
-5

我有两个文件,file1包含file2的子字符串。我想匹配file1到file2并输出匹配左侧的部分,而不是匹配本身。我也想知道如何输出比赛的权利,而不是比赛本身。 这是我的部分数据(这些字符串也可能不匹配,只是示例数据:输出匹配字符串的左边或右边部分

文件1

ACUGUACAGGCCACUGCCUUGC 
CUGCGCAAGCUACUGCCUUGCU 
UGGAAUGUAAAGAAGUAUGUAU 
CGAAUCAUUAUUUGCUGCUCUA 
AUCACAUUGCCAGGGAUUACC 
UUCACAGUGGCUAAGUUCUGC 

文件2

CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG 
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG 
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC 
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG 
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC 

例如:

文件1:

            GCUGUGGAGAUAACUGCGC 

文件2

CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC 

输出

CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCC 

回答

1

这里有几个方法可以只保留com的文字es如果它存在

a <- "GCUGUGGAGAUAACUGCGC" 
b <- "CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGC" 

strsplit(b, a)[[1]][1] 
sub(paste0(a, ".*$"), "", b) 

现在,您只需要将文件读入R并遍历每个模式。我不完全相信你在找什么,但这里是一个想法

# read data into 2 variables, a and b 
# you could use readLines() to do read from disk 
a <- readLines(textConnection("ACUGUACAGGCCACUGCCUUGC 
CUGCGCAAGCUACUGCCUUGCU 
UGGAAUGUAAAGAAGUAUGUAU 
CGAAUCAUUAUUUGCUGCUCUA 
AUCACAUUGCCAGGGAUUACC 
UUCACAGUGGCUAAGUUCUGC")) 

b <- readLines(textConnection("CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG 
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG 
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC 
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG 
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC")) 

现在,从第一个文件循环每个值

lapply(a, function(x) sapply(strsplit(b, x), "[", 1)) 
+0

@ GracieD:输出的每个元素都是相同的。尝试:ll = lapply(a,函数(i)sapply(strsplit(b,a [i]),“[[”,1));对于(我在2:长度(ll))打印(相同(ll [[i]],ll [[i-1]])) – rnso 2014-09-28 02:04:31

+0

@rnso谢谢。更新。 – GracieD 2014-09-28 04:06:04

1

开放的文件句柄到字符串来进行测试:

use strict; 
use warnings; 
use autodie; 

open my $fh1, '<', \ "ACUGUACAGGCCACUGCCUUGC\nCUGCGCAAGCUACUGCCUUGCU\nUGGAAUGUAAAGAAGUAUGUAU\nCGAAUCAUUAUUUGCUGCUCUA\nAUCACAUUGCCAGGGAUUACC\nUUCACAGUGGCUAAGUUCUGC\n"; 
open my $fh2, '<', \ "CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG\nCUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG\nGCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC\nCUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG\nGGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC\n"; 

while (!eof $fh1 && !eof $fh2) { 
    chomp(my $line1 = <$fh1>); 
    chomp(my $line2 = <$fh2>); 

    print join(' ', split /$line1/, $line2, 2), "\n"; 
} 

输出:

GUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA CAGG 
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA AG 
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA UUCAGGC 
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG G 
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA ACGCAACC 
1

你甚至可以试试这个下面的Perl代码前,后以及使用$预匹配($`),$ POSTMATCH($')和$ MATCH($ &)的字符串匹配:

InputFiles:

FILE1.TXT:

ACUGUACAGGCCACUGCCUUGC 
CUGCGCAAGCUACUGCCUUGCU 
UGGAAUGUAAAGAAGUAUGUAU 
CGAAUCAUUAUUUGCUGCUCUA 
AUCACAUUGCCAGGGAUUACC 
UUCACAGUGGCUAAGUUCUGC 

FILE2.TXT:

CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUAACUGUACAGGCCACUGCCUUGCCAGG 
CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAACUGCGCAAGCUACUGCCUUGCUAG 
GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUAUGGAAUGUAAAGAAGUAUGUAUUUCAGGC 
CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUGCGAAUCAUUAUUUGCUGCUCUAG 
GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAAAUCACAUUGCCAGGGAUUACCACGCAACC 

代码:

use strict; 
use warnings; 

open my $fh1, '<', "file1.txt" or die "Couldnt open the file file1.txt : $!"; 
open my $fh2, '<', "file2.txt" or die "Couldnt open the file file2.txt : $!"; 

while(!eof $fh1 && !eof $fh2) 
{ 
    chomp(my $line1 = <$fh1>); 
    chomp(my $line2 = <$fh2>); 

    if($line2 =~ /$line1/isg) 
    { 
      print "Prematch: $`\n";   
      print "Postmatch: $'\t"; 
      } 
    }  
close($fh1); 
close($fh2); 

输出:

Prematch: CCAGGCUGAGGUAGUAGUUUGUACAGUUUGAGGGUCUAUGAUACCACCCGGUACAGGAGAUA Postmatch: CAGG 
Prematch: CUGGCUGAGGUAGUAGUUUGUGCUGUUGGUCGGGUUGUGACAUUGCCCGCUGUGGAGAUAA Postmatch: AG 
Prematch: GCUUGGGACACAUACUUCUUUAUAUGCCCAUAUGAACCUGCUAAGCUA Postmatch: UUCAGGC 
Prematch: CUGUAGCAGCACAUCAUGGUUUACAUACUACAGUCAAGAUG Postmatch: G 
Prematch: GGCUGCUUGGGUUCCUGGCAUGCUGAUUUGUGACUUGAGAUUAAA Postmatch: ACGCAACC