用Python提取Fasta月光蛋白质序列

我想通过Python从Moonlighting Protein Database（www.moonlightingproteins.org/results.php?search_text=）中提取含有氨基酸序列的FASTA文件，因为它是一个迭代过程，我宁愿学习如何编程，而不是手动完成，B/C来吧，我们在2016年。问题是我不知道如何编写代码，因为我是一个菜鸟程序员:(。基本的伪代码将是：提前用Python提取Fasta月光蛋白质序列

for protein_name in site: www.moonlightingproteins.org/results.php?search_text=: 

     go to the uniprot option 

     download the fasta file 

     store it in a .txt file inside a given folder

感谢

来源

2016-09-20 Manolo Flores

我建议谷歌上搜索“网络与Python介绍刮”或类似的术语，并与有点乱搞。现在你的问题太抽象了。 – Swier

我强烈建议要问笔者数据库从！：

我想在项目中使用MoonProt数据库来分析使用生物信息学的氨基酸序列或结构。

如果您对感兴趣，请使用MoonProt数据库分析序列和/或结构的月光蛋白质，请通过[email protected]与我们联系。

假设你发现了一些有趣的东西，你将如何在论文或论文中引用它？ “序列未经作者同意而从公共网页上删除”。更好地赞扬原始研究人员。

这是一个很好的介绍scraping

但是，回到你原来的问题。

import requests 
from lxml import html 
#let's download one protein at a time, change 3 to any other number 
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3') 
#convert the html document to something we can parse in Python 
tree = html.fromstring(page.content) 
#get all table cells 
cells = tree.xpath('//td') 

for i, cell in enumerate(cells): 
    if cell.text: 
     #if we get something which looks like a FASTA sequence, print it 
     if cell.text.startswith('>'): 
      print(cell.text) 
    #if we find a table cell which has UniProt in it 
    #let's print the link from the next cell 
    if 'UniProt' in cell.text_content(): 
     if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib: 
      print(cells[i + 1].find('a').attrib['href'])

来源

2016-09-20 21:17:04

用Python提取Fasta月光蛋白质序列

回答

相关问题