不完全确定问题在这里。如何通过TOR通过Polipo连接到https网站与Scrapy?
运行的Python 2.7.3,并Scrapy 0.16.5
我创建了一个非常简单的Scrapy蜘蛛来测试连接到我的地方Polipo即可代理,这样我可以通过TOR发送请求了。我的蜘蛛的基本代码如下:
from scrapy.spider import BaseSpider
class TorSpider(BaseSpider):
name = "tor"
allowed_domains = ["check.torproject.org"]
start_urls = [
"https://check.torproject.org"
]
def parse(self, response):
print response.body
对于我的代理中间件,我定义:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')
我在我的设置文件HTTP_PROXY被定义为HTTP_PROXY = 'http://localhost:8123'
。
现在,如果我将我的起始URL更改为http://check.torproject.org,一切正常,没有问题。
如果我试图对https://check.torproject.org跑,每次都遇到一个400错误请求错误(我也尝试过不同的https://开头的网站,和所有的人都有同样的问题):
2013-07-23 21:36:18+0100 [scrapy] INFO: Scrapy 0.16.5 started (bot: arachnid)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, RandomUserAgentMiddleware, ProxyMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Enabled item pipelines:
2013-07-23 21:36:18+0100 [tor] INFO: Spider opened
2013-07-23 21:36:18+0100 [tor] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-23 21:36:18+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 1 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Retrying <GET https://check.torproject.org> (failed 2 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Gave up retrying <GET https://check.torproject.org> (failed 3 times): 400 Bad Request
2013-07-23 21:36:18+0100 [tor] DEBUG: Crawled (400) <GET https://check.torproject.org> (referer: None)
2013-07-23 21:36:18+0100 [tor] INFO: Closing spider (finished)
只是为了仔细检查一下,我的TOR/Polipo设置没有问题,我可以在终端运行下面的curl命令,并且连接正常:curl --proxy localhost:8123 https://check.torproject.org/
任何有关错误的建议这里?
你的https_proxy设置为什么? HTTP和HTTPS通常通过不同的端口等发送,并且需要不同的代理。 – Andenthal
不知道我关注。当然连接到一个HTTP代理,然后连接到一个HTTPS URL,应该工作正常吗?为什么我必须连接到HTTPS代理才能连接到HTTPS URL?如果是这样的话,上面的cURL命令不会失败吗? –