蛋白质序列模式匹配python

我正在研究蛋白质序列的匹配算法。我从一个对齐的蛋白质序列开始，我试图将一个错误排列的序列转换成正确对齐的序列。下面是一个例子：蛋白质序列模式匹配python

原始对齐序列：---- AB - CD -----

未对齐的序列：--a - BC --- D-

预期的输出应该是这样的：

原来的排列顺序：---- AB - CD -----

未对齐的序列：---- AB - CD ---- - （都是现在一样）

我被告知是非常具体关于我的问题，但我想匹配的序列长度> 4000个字符，并且在粘贴时看起来很荒谬。不过，我会发布代表我的问题的两个序列，而且应该这样做。

seq="---A-A--AA---A--" 
newseq="AA---A--A-----A-----" 
seq=list(seq) #changing maaster sequence from string to list 
newseq=list(newseq) #changing new sequence from string to list 
n=len(seq) #obtaining length of master sequence 
newseq.extend('.') #adding a tag to end of new sequence to account for terminal gaps 

print(seq, newseq,n) #verification of sequences in list form and length 

for i in range(n) 
    if seq[i]!=newseq[i]: 
     if seq[i] != '-': #gap deletion 
      del newseq[i] 

     elif newseq[i] != '-': 
      newseq.insert(i,'-') #gap insertion 


     elif newseq[i] == '-': 
      del newseq[i] 


old=''.join(seq) #changing list to string 
new=''.join(newseq) #changing list to string 
new=new.strip('.') #removing tag 

print(old) #verification of master-sequence fidelity 
print(new) #verification of matching sequence

我从这个特殊的代码获取和设置序列的输出是：

--- AA - AA --- A--

--- AA - A- --- A ----- A -----

我似乎无法得到循环正确删除字符之间不需要的破折号不止一次，因为其余的循环迭代被使用在添加短划线/删除短划线对。
这是这里问题的一个好开始。

我怎样才能成功写入该循环，以获得期望的我的输出（两个相同的序列）

来源

2012-06-13 AHuck

没有环路在此代码示例 –

感谢您指出了这一点！我想我在混洗中丢失了循环命令。 – AHuck

我编辑你的代码，它现在是给正确的输出：

seq="----AB--C-D-----" 
newseq="--A--BC---D-" 
seq=list(seq) #changing maaster sequence from string to list 
newseq=list(newseq) #changing new sequence from string to list 
n=len(seq) #obtaining length of master sequence 
newseq.extend('.') #adding a tag to end of new sequence to account for terminal gaps 

print(seq, newseq,n) #verification of sequences in list form and length 
for i in range(len(seq)): 
    if seq[i]!=newseq[i]: 
     if seq[i]=='-': 
      newseq.insert(i,'-') 

     elif newseq[i]=='-': 
      newseq.insert(i,seq[i]) 
     else: 
      newseq.insert(i,seq[i]) 

else: 
    newseq=newseq[0:len(seq)] 

old=''.join(seq) #changing list to string 
new=''.join(newseq) #changing list to string 
new=new.strip('.') #removing tag 

print(old) #verification of master-sequence fidelity 
print(new) #verification of matching sequence

输出：

----AB--C-D----- 
----AB--C-D-----

和AA---A--A-----A-----：

---A-A--AA---A-- 
---A-A--AA---A--

来源

2012-06-13 16:31:48

这个算法与前面的算法不一样，考虑到特定位置，不同尺寸的字符串之间可能的不匹配，并且如果之后出现更好的解决方案，则不会回溯。请考虑研究动态编程。 – rlinden

我一定会为未来的工作追求动态编程。尽管这些代码一般用于我的直接用途（序列总是相同的顺序，只有一个解决方案，并且此代码适用于不同大小的字符串）。谢谢！ – AHuck

序列比对的问题是众所周知的，它的溶液被很好地描述。有关介绍性文字，请参见Wikipedia。我所知道的最佳解决方案涉及动态编程，您可以在this site处看到Java中的示例实现。

来源

2012-06-13 16:30:32 rlinden

蛋白质序列模式匹配python

回答

相关问题