2017-07-31 49 views
0

我想在python刮板中实现一个代理。
不过看来我不能的urlopen使用参数代理()在我看到的教程建议(可能是版本的事?!)Python 3 urllib与FancyURLopener抛出找不到文件

proxy = {'http' : 'http://example:8080' } 
req = urllib.request.Request(Site,headers=hdr, proxies=proxy) 
resp = urllib.request.urlopen(req).read() 

所以我试图让智能出documentation的请求,建议在哪里创建一个揭幕战。然而,这没有头文件的参数。并建议像这样opener.addheaders = [] 没有什么我尝试工作。(代理IP的测试代码工作)
以下constelation看起来是我的最佳做法,但抛出“无法找到文件错误”。不确定原因。
如果你能告诉我如何将代理与完整的头文件集一起使用,那将会很不错。

代码:

import bs4 as bs 
import urllib.request 
import ssl 
import re 
from pprint import pprint ## for printing out a readable dict. can be deleted afterwards 

######################################################### 
##     Parsing with beautiful soup 
######################################################### 

ssl._create_default_https_context = ssl._create_unverified_context 
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 
     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
     'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 
     'Accept-Encoding': 'none', 
     'Accept-Language': 'en-US,en;q=0.8', 
     'Connection': 'keep-alive'} 
Site = 'https://example.com' 
proxy = {'http' : 'http://example:8080' } 

def openPage(Site, hdr): 
    ## IP check 
    print('Actual IP', urllib.request.urlopen('http://httpbin.org/ip').read()) 

    req = urllib.request.Request(Site,headers=hdr) 
    opener = urllib.request.FancyURLopener(proxy) 
    opener.addheaders = [('User-agent', 'Mozilla/5.0')] 

    ## IP check 
    print('Fake IP', opener.open('http://httpbin.org/ip').read()) 
    resp = opener.open(req).read() 
## soup = bs.BeautifulSoup(resp,'lxml') 
## return(soup) 

soup = openPage(Site,hdr).... 

错误:

Traceback (most recent call last): File "C:\Program Files\Python36\lib\urllib\request.py", line 1990, in open_local_file 
    stats = os.stat(localname) FileNotFoundError: [WinError 2] The system cannot find the file specified: 'urllib.request.Request object at 0x000001D94816A908' 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): File "C:/Projects/Python/Programms/WebScraper/scraper.py", line 72, in <module> 
    mainNav() File "C:/Projects/Python/Programms/WebScraper/scraper.py", line 40, in mainNav 
    soup = openPage(Site,hdr,ean) File "C:/Projects/Python/Programms/WebScraper/scraper.py", line 32, in openPage 
    resp = opener.open(req).read() File "C:\Program Files\Python36\lib\urllib\request.py", line 1762, in open 
    return getattr(self, name)(url) File "C:\Program Files\Python36\lib\urllib\request.py", line 1981, in open_file 
    return self.open_local_file(url) File "C:\Program Files\Python36\lib\urllib\request.py", line 1992, in open_local_file 
    raise URLError(e.strerror, e.filename) urllib.error.URLError: <urlopen error The system cannot find the file specified> 

回答

0

下面的代码已经成功。我已经从fancyURLopener改为使用之前定义的代理函数代理来安装我自己的开启者。之后添加了标题

def openPage(site, hdr, proxy): 


    ## Create opener 
    proxy_support = urllib.request.ProxyHandler(proxy) 
    opener = urllib.request.build_opener(proxy_support)##proxy_support 
    urllib.request.install_opener(opener) 
    opener.addheaders = hdr