2012-11-15 143 views
1

这是我的第一个Python项目,所以它是非常基础和基本的。 我经常需要清理朋友的病毒,并且我使用的免费程序经常更新。我没有手动下载每个程序,而是试图创建一个简单的方法来自动化该过程。由于我也在学习python,我认为这将是一个很好的练习机会。从多个网站下载文件。

问题:

我必须找到一些链接的.exe文件。我可以找到正确的网址,但在尝试下载时出现错误。

有没有办法将所有链接添加到列表中,然后创建一个函数来遍历列表并在每个url上运行该函数?我已经谷歌了很多,我似乎无法使它工作。也许我不是在朝着正确的方向思考?

import urllib, urllib2, re, os 
from BeautifulSoup import BeautifulSoup 

# Website List 
sas = 'http://cdn.superantispyware.com/SUPERAntiSpyware.exe' 
tds = 'http://support.kaspersky.com/downloads/utils/tdsskiller.exe' 
mbam = 'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1' 
tr = 'http://www.simplysup.com/tremover/download.html' 
urllist = [sas, tr, tds, tr] 
urrllist2 = [] 

# Find exe files to download 

match = re.compile('\.exe') 
data = urllib2.urlopen(urllist) 
page = BeautifulSoup(data) 

# Check links 
#def findexe(): 
for link in page.findAll('a'): 
    try: 
     href = link['href'] 
     if re.search(match, href): 
      urllist2.append(href) 

    except KeyError: 
     pass 

os.chdir(r"C:\_VirusFixes") 
urllib.urlretrieve(urllist2, os.path.basename(urllist2)) 

正如你所看到的,我已经离开注释掉了,因为我不能让它正常工作的功能。

我是否应该放弃列表并单独下载它们?我试图保持高效。

任何建议,或者如果你能指出我在正确的方向,它将不胜感激。

回答

0

除了mikez302's answer,这里有一个稍微更可读的方式编写代码:

import os 
import re 
import urllib 
import urllib2 

from BeautifulSoup import BeautifulSoup 

websites = [ 
    'http://cdn.superantispyware.com/SUPERAntiSpyware.exe' 
    'http://support.kaspersky.com/downloads/utils/tdsskiller.exe' 
    'http://www.bleepingcomputer.com/download/malwarebytes-anti-malware/dl/7/?1' 
    'http://www.simplysup.com/tremover/download.html' 
] 

download_links = [] 

for url in websites: 
    connection = urllib2.urlopen(url) 
    soup = BeautifulSoup(connection) 
    connection.close() 

    for link in soup.findAll('a', {href: re.compile(r'\.exe$')}): 
     download_links.append(link['href']) 

for url in download_links: 
    urllib.urlretrieve(url, r'C:\_VirusFixes', os.path.basename(url)) 
+0

谢谢你的协助。我想我明白我现在错过了循环。不幸的是,它仍然不适合我。它仍然有问题的网址。我会继续排除故障。 – MBH

0

urllib2.urlopen是访问单个URL的函数。如果你想访问多个,你应该遍历列表。你应该做这样的事情:

for url in urllist: 
    data = urllib2.urlopen(url) 
    page = BeautifulSoup(data) 

    # Check links 
    for link in page.findAll('a'): 
     try: 
      href = link['href'] 
      if re.search(match, href): 
       urllist2.append(href) 

     except KeyError: 
      pass 

    os.chdir(r"C:\_VirusFixes") 
    urllib.urlretrieve(urllist2, os.path.basename(urllist2)) 
0

上面的代码没有工作对我来说,在我的情况这是因为页面通过脚本来组装它们的链接,而不是将它包含在代码中。当我遇到了这个问题我用下面的代码这仅仅是一个刮刀:

import os 
import re 
import urllib 
import urllib2 

from bs4 import BeautifulSoup 

url = '' 

connection = urllib2.urlopen(url) 
soup = BeautifulSoup(connection) #Everything the same up to here 
regex = '(.+?).zip'  #Here we insert the pattern we are looking for 
pattern = re.compile(regex) 
link = re.findall(pattern,str(soup)) #This finds all the .zip (.exe) in the text 
x=0 
for i in link: 
    link[x]=i.split(' ')[len(i.split(' '))-1] 
# When it finds all the .zip, it usually comes back with a lot of undesirable 
# text, luckily the file name is almost always separated by a space from the 
# rest of the text which is why we do the split 
    x+=1 

os.chdir("F:\Documents") 
# This is the filepath where I want to save everything I download 

for i in link: 
    urllib.urlretrieve(url,filename=i+".zip") # Remember that the text we found doesn't include the .zip (or .exe in your case) so we want to reestablish that. 

这是不是像以前的答案代码一样高效,但它会为大多数几乎所有的现场工作。