在Tornado中启动异步，类Node.js的HTTP请求

我以前在Node.js中编写过应用程序，特别是数据删除器。这些类型的应用程序没有Web前端，但仅仅是使用cron作业进行定时处理，以异步地创建大量可能复杂的HTTP GET请求来拉取网页，然后从结果中抓取并存储数据。在Tornado中启动异步，类Node.js的HTTP请求

一个功能我可以写会是这样的一个样本：

// Node.js 

var request = require("request"); 

function scrapeEverything() { 
    var listOfIds = [23423, 52356, 63462, 34673, 67436]; 

    for (var i = 0; i < listOfIds.length; i++) { 
     request({uri: "http://mydatasite.com/?data_id = " + listOfIds[i]}, 
       function(err, response, body) { 
        var jsonobj = JSON.parse(body); 
         storeMyData(jsonobj); 
       }); 
    } 
}

此功能遍历的ID，使一群异步GET请求，从它然后存储数据。

我现在正在Python中编写一个刮板，并尝试使用Tornado做同样的事情，但是我在文档中看到的所有内容都指向作为Web服务器的Tornado，这不是我正在寻找的东西。有人知道怎么做吗？

来源

2012-07-18 jdotjdot

对于任何人谁碰到这个问题来了以后，我结束了使用'Twisted'代替，（http://twistedmatrix.com/trac/），这是Python程序一个伟大的异步模式，尽管它有一条学习曲线，但我能够做到这一点，而不必为解决Web服务器问题而工作。 – jdotjdot 2012-10-01 04:14:07

稍微复杂的答案比我以为我会扔在一起，但它是如何使用龙卷风ioloop和AsyncHTTPClient获取一些数据的快速演示。实际上我已经在Tornado上写了一个webcrawler，所以它可以用于“无头”。

import tornado.ioloop 
import tornado.httpclient 

class Fetcher(object): 
    def __init__(self, ioloop): 
     self.ioloop = ioloop 
     self.client = tornado.httpclient.AsyncHTTPClient(io_loop=ioloop) 

    def fetch(self, url): 
     self.client.fetch(url, self.handle_response) 

    @property 
    def active(self): 
     """True if there are active fetching happening""" 

     return len(self.client.active) != 0 

    def handle_response(self, response): 
     if response.error: 
      print "Error:", response.error 
     else: 
      print "Got %d bytes" % (len(response.body)) 

     if not self.active: 
      self.ioloop.stop() 

def main(): 
    ioloop = tornado.ioloop.IOLoop.instance() 
    ioloop.add_callback(scrapeEverything) 
    ioloop.start() 

def scrapeEverything(): 
    fetcher = Fetcher(tornado.ioloop.IOLoop.instance()) 

    listOfIds = [23423, 52356, 63462, 34673, 67436] 

    for id in listOfIds: 
     fetcher.fetch("http://mydatasite.com/?data_id=%d" % id) 

if __name__ == '__main__': 
    main()

来源

2012-07-18 18:29:14 koblas

很好的答案。 HTTPClient上的龙卷风文档在这里http://www.tornadoweb.org/documentation/httpclient.html（因为OP找不到它们）。 – 2012-07-19 09:25:15

如果你是开放的替代龙卷风（我假设你使用刮掉套接字编程，而不是urllib2的），你可能有兴趣在asyncoro，异步的框架，同时（和分布式，容错）编程。使用asyncoro进行编程与线程非常相似，除了一些语法变化。你的问题可以用asyncoro为实现：

import asyncoro, socket 

def process(url, coro=None): 
    # create asynchronous socket 
    sock = asyncoro.AsynCoroSocket(socket.socket()) 
    # parse url to get host, port; prepare get_request 
    yield sock.connect((host, port)) 
    yield sock.send(get_request) 
    body = yield sock.recv() 
    # ... 
    # process body 

for i in [23423, 52356, 63462, 34673, 67436]: 
    asyncoro.Coro(process, "http://mydatasite.com/?data_id = %s" % i)

来源

2012-07-18 15:12:11

您还可以尝试不需要任何外部库的本机解决方案。对于linux，它基于epoll，可能看起来像this。用例：

# ------------------------------------------------------------------------------------ 
def sampleCallback(status, data, request): 
    print 'fetched:', status, len(data) 
    print data 

# ------------------------------------------------------------------------------------ 
fetch(HttpRequest('google.com:80', 'GET', '/', None, sampleCallback))

来源

2015-06-10 11:11:27

在Tornado中启动异步，类Node.js的HTTP请求

回答

相关问题