在MATLAB中将长字符串拆分为子字符串的最有效方法

我正在研究MATLAB中的一个函数，该函数比较两个基因序列并确定它们的相似性。为此，我通过使用for循环遍历它们，将两个序列分割成更小的子串，一次移动一个核苷酸，并将子串添加到单元阵列中。在MATLAB中将长字符串拆分为子字符串的最有效方法

因此，例如，为4子长度字符串ATGCAAAT不会被分割

ATGC，AAAT

而是作为

ATCG，TGCA，GCAA，CAAA，AAAT

我试图让这个函数的执行速度更快，而且因为这两个for循环提供了将近90％的执行时间，所以我想知道是否在MATLAB中有更快的方法来执行此操作。

这里是代码我目前正在使用：

SubstrSequence1 = {};             
SubstrSequence2 = {}; 
for i = 1:length(Sequence1)-(SubstringLength-1)     
    SubstrSequence1 = [SubstrSequence1, Sequence1(i:i+SubstringLength-1)]; 
end 

for i = 1:length(Sequence2)-(SubstringLength-1)     
    SubstrSequence2 = [SubstrSequence2, Sequence2(i:i+SubstringLength-1)]; 
end

来源

2015-11-26 dacm

这个怎么样？

str = 'ATGCAAAT'; 
n = 4; 
strs = str(bsxfun(@plus, 1:n, (0:numel(str)-n).'));

结果是2D char数组：

strs = 
ATGC 
TGCA 
GCAA 
CAAA 
AAAT

所以局部字符串是strs(1,:)，strs(2,:)等

如果希望结果作为字符串的细胞arrray，最后加上：

strs = cellstr(strs);

生产

strs = 
    'ATGC' 
    'TGCA' 
    'GCAA' 
    'CAAA' 
    'AAAT'

，然后部分列有strs{1}，strs{2}等

来源

2015-11-26 22:26:55

非常感谢，这个伟大的工程。它比我的循环快了约3倍，并且比这里的其他建议稍微快一点。 – dacm

很高兴！ –

下面是使用hankel得到SubstrSequence1一种方法 -

A = 1:numel(Sequence1); 
out = cellstr(Sequence1(hankel(A(1:SubstringLength),A(SubstringLength:end)).'))

您可以按照相同的步骤找到SubstrSequence2。

采样运行 -

>> Sequence1 = 'ATGCAAAT'; 
>> SubstringLength = 4; 
>> A = 1:numel(Sequence1); 
>> cellstr(Sequence1(hankel(A(1:SubstringLength),A(SubstringLength:end)).')) 
ans = 
    'ATGC' 
    'TGCA' 
    'GCAA' 
    'CAAA' 
    'AAAT'

来源

2015-11-26 22:33:29 Divakar

我用'hankel'开始了，但无法使它工作！ –

@LuisMendo我以bsxfun开始，但为时已晚！ :) – Divakar

我确信你已经开始了：-P –

一种方法是生成指数的矩阵适当地提取你想要的字符串：

>> sequence = 'ATGCAAAT'; 
>> subSequenceLength = 4; 
>> numSubSequence = length(sequence) - subSequenceLength + 1; 
>> idx = repmat((1:numSubSequence)', 1, subSequenceLength) + repmat(0:subSequenceLength-1, numSubSequence, 1); 
>> result = sequence(idx) 

    result = 

     ATGC 
     TGCA 
     GCAA 
     CAAA 
     AAAT

来源

2015-11-26 22:45:06

在MATLAB中将长字符串拆分为子字符串的最有效方法

回答

相关问题