如何在Python中编写Web代理

我正在尝试在Python中编写Web代理。我们的目标是访问如下网址：http://proxyurl/http://anothersite.com/，并且像通常一样查看他的内容http://anothersite.com。我通过滥用请求库得到了相当不错的结果，但这不是请求框架的预期用途。我以前写过twisted的代理，但我不确定如何将其连接到我正在尝试执行的操作。这里就是我在那么远，如何在Python中编写Web代理

import os 
import urlparse 

import requests 

import tornado.ioloop 
import tornado.web 
from tornado import template 

ROOT = os.path.dirname(os.path.abspath(__file__)) 
path = lambda *a: os.path.join(ROOT, *a) 

loader = template.Loader(path(ROOT, 'templates')) 


class ProxyHandler(tornado.web.RequestHandler): 
    def get(self, slug): 
     if slug.startswith("http://") or slug.startswith("https://"): 
      if self.get_argument("start", None) == "true": 
       parsed = urlparse.urlparse(slug) 
       self.set_cookie("scheme", value=parsed.scheme) 
       self.set_cookie("netloc", value=parsed.netloc) 
       self.set_cookie("urlpath", value=parsed.path) 
      #external resource 
      else: 
       response = requests.get(slug) 
       headers = response.headers 
       if 'content-type' in headers: 
        self.set_header('Content-type', headers['content-type']) 
       if 'length' in headers: 
        self.set_header('length', headers['length']) 
       for block in response.iter_content(1024): 
        self.write(block) 
       self.finish() 
       return 
     else: 
      #absolute 
      if slug.startswith('/'): 
       slug = "{scheme}://{netloc}{original_slug}".format(
        scheme=self.get_cookie('scheme'), 
        netloc=self.get_cookie('netloc'), 
        original_slug=slug, 
       ) 
      #relative 
      else: 
       slug = "{scheme}://{netloc}{path}{original_slug}".format(
        scheme=self.get_cookie('scheme'), 
        netloc=self.get_cookie('netloc'), 
        path=self.get_cookie('urlpath'), 
        original_slug=slug, 
       ) 
     response = requests.get(slug) 
     #get the headers 
     headers = response.headers 
     #get doctype 
     doctype = None 
     if '<!doctype' in response.content.lower()[:9]: 
      doctype = response.content[:response.content.find('>')+1] 
     if 'content-type' in headers: 
      self.set_header('Content-type', headers['content-type']) 
     if 'length' in headers: 
      self.set_header('length', headers['length']) 
     self.write(response.content) 


application = tornado.web.Application([ 
    (r"/(.+)", ProxyHandler), 
]) 

if __name__ == "__main__": 
    application.listen(8888) 
    tornado.ioloop.IOLoop.instance().start()

刚一说明，我设置cookie保存方案，netloc和urlpath如果有启动= true在查询字符串。这样，任何相对或绝对链接，然后命中代理使用该cookie来解析完整的网址。

通过此代码，如果您转到http://localhost:8888/http://espn.com/?start=true，您将看到ESPN的内容。但是，在下面的网站上根本不起作用：http://www.bottegaveneta.com/us/shop/。我的问题是，最好的方法是什么？目前我正在实施这个强大的方法还是有这样做的一些可怕的陷阱？如果这是正确的，为什么像我指出的某些网站根本不工作？

谢谢你的帮助。

来源

2013-05-13 Kang Roodle

Bottega Veneta不允许您直接访问资源。例如，尝试点击http://www.bottegaveneta.com/us/shop/css/bottegaveneta/form.css - 我得到一个HTML 404页面。 – 2013-05-14 02:29:40

我猜这是与HTTP Referrer有关。你也可以尝试设置。 – 2013-05-14 02:30:49

@Cole哦，你是指引用者？（https://en.wikipedia.org/wiki/HTTP_referer#Origin_of_the_term_referer） – rakslice 2013-10-04 01:32:42

我想你不需要你的最后一个块。这似乎为我工作。

class ProxyHandler(tornado.web.RequestHandler): 
    def get(self, slug): 
     print 'get: ' + str(slug) 

     if slug.startswith("http://") or slug.startswith("https://"): 
      if self.get_argument("start", None) == "true": 
       parsed = urlparse.urlparse(slug) 
       self.set_cookie("scheme", value=parsed.scheme) 
       self.set_cookie("netloc", value=parsed.netloc) 
       self.set_cookie("urlpath", value=parsed.path) 
      #external resource 
      else: 
       response = requests.get(slug) 
       headers = response.headers 
       if 'content-type' in headers: 
        self.set_header('Content-type', headers['content-type']) 
       if 'length' in headers: 
        self.set_header('length', headers['length']) 
       for block in response.iter_content(1024): 
        self.write(block) 
       self.finish() 
       return 
     else: 

      slug = "{scheme}://{netloc}/{original_slug}".format(
       scheme=self.get_cookie('scheme'), 
       netloc=self.get_cookie('netloc'), 
       original_slug=slug, 
      ) 
      print self.get_cookie('scheme') 
      print self.get_cookie('netloc') 
      print self.get_cookie('urlpath') 
      print slug 
     response = requests.get(slug) 
     #get the headers 
     headers = response.headers 
     #get doctype 
     doctype = None 
     if '<!doctype' in response.content.lower()[:9]: 
      doctype = response.content[:response.content.find('>')+1] 
     if 'content-type' in headers: 
      self.set_header('Content-type', headers['content-type']) 
     if 'length' in headers: 
      self.set_header('length', headers['length']) 
     self.write(response.content)

来源

2013-05-13 16:53:50

-3

您可以将用户的请求模块

import requests 

proxies = { 
    "http": "http://10.10.1.10:3128", 
    "https": "http://10.10.1.10:1080", 
} 

requests.get("http://example.org", proxies=proxies)

request docs

来源

2013-05-21 08:25:53 sinceq

为什么不是+1或更多？ – sinceq 2013-06-28 10:29:24

，因为他试图*写*代理，而不是*使用*一个 – Xavier 2013-08-05 15:22:33

可以使用插座模块中的标准库，如果你使用的是Linux的epoll作为好。

你可以看到一个简单的异步服务器在这里的示例代码：https://github.com/aychedee/octopus/blob/master/octopus/server.py

来源

2013-08-03 18:14:56 aychedee

如果你想真正的代理，你可以使用：

tornado-proxy

或

simple proxy based on Twisted

但我认为它不会很难适应你的情况。

来源

2013-08-26 14:57:37 shirk3y

我最近写了一个类似的web应用程序。请注意，这是我做到这一点的方式。我不是说你应该这样做。这些都是一些我碰到的陷阱：相对于绝对

更改属性值有不只是抓取的网页，并将其呈现给客户更多地参与。很多时候，您无法在没有任何错误的情况下代理网页。

为什么像我指出的某些网站根本不工作？

许多网页依赖资源的相对路径以便以格式良好的方式显示网页。例如，下面的图片代码：

<img src="/header.png" />

将导致客户做一个请求：

http://proxyurl/header.png

哪些失败。该“SRC”值应转换为：

http://anothersite.com/header.png.

所以，你需要分析的东西，如BeautifulSoup，循环中的HTML文档在所有的标签并为您的属性，如：

'src', 'lowsrc', 'href'

而且改变他们的价值观因此，这样的标签就变成了：

<img src="http://anothersite.com/header.png" />

此方法适用于更多标签而不仅仅是图片。一个，脚本，链接，李和框架是你应该改变以及一些。

HTML有心计

现有方法应该让你走得很远，但你还没有完成。

两个

<style type="text/css" media="all">@import "/stylesheet.css?version=120215094129002";</style>

而且

<div style="position:absolute;right:8px;background-image:url('/Portals/_default/Skins/BE/images/top_img.gif');height:200px;width:427px;background-repeat:no-repeat;background-position:right top;" >

是代码，很难达到与使用BeautifulSoup修改的例子。

在第一个例子中，有一个css @Import给相对的uri。第二个涉及来自内联CSS语句的'url（）'方法。

在我的情况下，我写了可怕的代码来手动修改这些值。你可能想使用正则表达式，但我不确定。

重定向

随着Python的请求或urllib2的您可以轻松地遵循自动重定向。只要记住要保存新的（基本）uri;您将需要它来改变'从相对到绝对'的属性值操作。

您还需要处理'硬编码'重定向。如此一：

<meta http-equiv="refresh" content="0;url=http://new-website.com/">

需要改变到：

<meta http-equiv="refresh" content="0;url=http://proxyurl/http://new-website.com/">

基地标签

的base tag指定基本URL /目标文档中的所有相对URL。您可能想要更改该值。

最后完成了吗？

没有。一些网站严重依赖javascript来在屏幕上绘制他们的内容。这些网站是最难代理的。我一直在考虑使用类似PhantomJS或Ghost的内容来获取和评估网页并将结果呈现给客户端。

也许我的source code可以帮到你。你可以用你想要的任何方式使用它。

来源

2013-11-01 15:10:50 cpb2

您可以在文档头中粘贴一个''标签，以便一举修正相关的URL。（但是，如果已经有一个！） – kindall 2013-11-01 15:38:40

我没有想到！我会尝试一下。谢谢！ – cpb2 2013-11-01 15:40:55

显然我在回答这个问题时已经很晚了，但是刚刚偶然发现了它。我自己一直在写类似于你的要求的东西。

它更像是一个HTTP转发器，但它的第一个任务是代理本身。目前还不完全完整，目前还没有读过我的文章 - 但那些文章都在我的待办事项清单上。

我已经使用mitmproxy来实现这一点。它可能不是那里最优雅的一段代码，我在这里和那里用了很多黑客来实现中继器的功能。我知道默认情况下，mitmproxy有办法很容易地实现中继器thingy，但是在我无法使用mitmproxy提供的功能的情况下，有一些特定的要求。

您可能会在https://github.com/c0n71nu3/python_repeater/ 处找到该项目当我有任何进展时，回购仍在进行中。

希望它能够为您提供帮助。

来源

2015-09-01 11:28:59 qre0ct

如何在Python中编写Web代理

回答

相关问题