2014-01-07 49 views
0

我想下载从搜索结果下载第一个pdb文件(下载链接给出以下名称)。我使用蟒蛇,硒和美丽。直到现在我已经开发了代码。使用python beautifulsoup和硒下载文件

import urllib2 
from BeautifulSoup import BeautifulSoup 
from selenium import webdriver 


uni_id = "P22216" 

# set parameters 
download_dir = "/home/home/Desktop/" 
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id 

print "url - ", url 


# opening the url 
text = urllib2.urlopen(url).read(); 

#print "text : ", text 
soup = BeautifulSoup(text); 
#print soup 
print 


table = soup.find("table", {"class":"queryBlue"}) 
#print "table : ", table 

status = 0 
rows = table.findAll('tr') 
for tr in rows: 
    try: 
     cols = tr.findAll('td') 
     if cols: 
      link = cols[1].find('a').get('href') 
     print "link : ", link 
      if link: 
       if status==1: 
        main_url = "http://www.rcsb.org" + link 
       print "main_url-----", main_url 
       status = False 
       browser.click(main_url) 
     status+=1 

    except: 
    pass 

我正在变成无。
如何下载搜索列表中的第一个文件? (即2YGV在这种情况下)

Download link is : /pdb/protein/P32447 
+0

为我工作。获取'/pdb/explore/explore.do?structureId = 2YGV'。什么问题?你不能下载它? – ton1c

+0

我也有,但如何下载该文件。 dat我的问题 – sam

回答

2

我不知道究竟是你想下载,但我这里是如何下载2YGV文件:

import urllib 
import urllib2 
from bs4 import BeautifulSoup  

uni_id = "P22216"  
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id  
text = urllib2.urlopen(url).read()  
soup = BeautifulSoup(text)  
link = soup.find("span", {"class":"iconSet-main icon-download"}).parent.get("href")  
urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb") 

该脚本将下载该文件来自页面上的链接。这个脚本不需要selenium,但我用urllib来检索文件。你可以阅读this post了解更多信息,如何使用urllib下载文件。


编辑:

或者使用此代码,找到下载链接(这一切都取决于你要下载从什么网址是什么文件):

import urllib 
import urllib2 
from bs4 import BeautifulSoup 


uni_id = "P22216" 
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id 
text = urllib2.urlopen(url).read() 
soup = BeautifulSoup(text) 
table = soup.find("table", {"class":"queryBlue"}) 
link = table.find("a", {"class":"tooltip"}).get("href") 
urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb") 

这里是你如何做你在评论中提出的问题的例子:

import mechanize 
from bs4 import BeautifulSoup 


SEARCH_URL = "http://www.rcsb.org/pdb/home/home.do" 

l = ["YGL130W", "YDL159W", "YOR181W"] 
browser = mechanize.Browser() 

for item in l: 
    browser.open(SEARCH_URL) 
    browser.select_form(nr=0) 
    browser["q"] = item 
    html = browser.submit() 

    soup = BeautifulSoup(html) 
    table = soup.find("table", {"class":"queryBlue"}) 
    if table: 
     link = table.find("a", {"class":"tooltip"}).get("href") 
     browser.retrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")[0] 
     print "Downloaded " + item + " as " + str(link.split("=")[-1]) + ".pdb" 
    else: 
     print item + " was not found" 

输出:

Downloaded YGL130W as 3KYH.pdb 
Downloaded YDL159W as 3FWB.pdb 
YOR181W was not found 
+0

我阅读并理解你的代码。谢谢。我有列表l = [YGL130W,YDL159W,YOR181W]。与此我必须去http://www.rcsb.org/pdb/home/home.do,然后我必须采取每个ID并在该网站搜索。结果页面有链接搜索pdb。我必须点击它,然后才能下载pdb页面,否则我将获得多个pdbs。如果多个pdbs,那么我必须下载搜索结果的第一个pdb。 – sam

+1

编辑答案。希望这有助于 – ton1c

+0

你一个惊人的编码器。谢谢 – sam