加速HTTP请求python和500错误

我有一个代码，使用查询和时间框架（可能会长达一年）从此newspaper检索新闻结果。加速HTTP请求python和500错误

结果每页分页最多10篇文章，由于我找不到增加它的方法，我为每个页面发出请求，然后检索每篇文章的标题，网址和日期。每个周期（HTTP请求和解析）需要30秒到1分钟，这非常缓慢。最终它会停止响应代码为500.我想知道是否有办法加速它或可能一次发出多个请求。我只是想检索所有页面中的文章细节。下面是代码：

import requests 
    import re 
    from bs4 import BeautifulSoup 
    import csv 

    URL = 'http://www.gulf-times.com/AdvanceSearchNews.aspx?Pageindex={index}&keywordtitle={query}&keywordbrief={query}&keywordbody={query}&category=&timeframe=&datefrom={datefrom}&dateTo={dateto}&isTimeFrame=0' 


    def run(**params): 
     countryFile = open("EgyptDaybyDay.csv","a") 
     i=1 
     results = True 
     while results: 
        params["index"]=str(i) 
        response = requests.get(URL.format(**params)) 
        print response.status_code 
        htmlFile = BeautifulSoup(response.content) 
        articles = htmlFile.findAll("div", { "class" : "newslist" }) 

        for article in articles: 
           url = (article.a['href']).encode('utf-8','ignore') 
           title = (article.img['alt']).encode('utf-8','ignore') 
           dateline = article.find("div",{"class": "floatright"}) 
           m = re.search("([0-9]{2}\-[0-9]{2}\-[0-9]{4})", dateline.string) 
           date = m.group(1) 
           w = csv.writer(countryFile,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL) 
           w.writerow((date, title, url)) 

        if not articles: 
           results = False 
        i+=1 
     countryFile.close() 


    run(query="Egypt", datefrom="12-01-2010", dateto="12-01-2011")

来源

2013-03-22 Jiyda Moussa

这是一个很好的机会来尝试gevent。

对于request.get部分，您应该有一个单独的例程，以便您的应用程序不必等待IO阻塞。

然后，你可以产生多个工人，并有队列传递请求和文章。也许与此类似：

import gevent.monkey 
from gevent.queue import Queue 
from gevent import sleep 
gevent.monkey.patch_all() 

MAX_REQUESTS = 10 

requests = Queue(MAX_REQUESTS) 
articles = Queue() 

mock_responses = range(100) 
mock_responses.reverse() 

def request(): 
    print "worker started" 
    while True: 
     print "request %s" % requests.get() 
     sleep(1) 

     try: 
      articles.put('article response %s' % mock_responses.pop()) 
     except IndexError: 
      articles.put(StopIteration) 
      break 

def run(): 
    print "run" 

    i = 1 
    while True: 
     requests.put(i) 
     i += 1 

if __name__ == '__main__': 
    for worker in range(MAX_REQUESTS): 
     gevent.spawn(request) 

    gevent.spawn(run) 
    for article in articles: 
     print "Got article: %s" % article

来源

2013-03-24 19:04:30 baloo

你也可以做到这一点与扭曲蟒蛇和递延事件 – 2013-03-24 19:23:03

的名单我现在认识到迭代可能实际的一篇文章中被发现之前停止。但你明白了 – baloo 2013-03-24 19:55:28