抓取在与谷歌应用程序引擎

在我RequestHandler的子类蟒蛇很多网址，我试图获取URL的范围：抓取在与谷歌应用程序引擎

class GetStats(webapp2.RequestHandler): 
    def post(self): 

    lastpage = 50 
    for page in range(1, lastpage): 
     tmpurl = url + str(page) 
     response = urllib2.urlopen(tmpurl, timeout=5) 
     html = response.read() 
     # some parsing html 
     heap.append(result_of_parsing) 

    self.response.write(heap)

但它与〜30页的URL作品（页面加载长，但它是作品）。如果超过30，我得到一个错误：

错误：服务器错误

服务器遇到错误，无法完成您的请求。

请在30秒后重试。

有没有什么办法来获取大量的网址？可能会更优化或不适合？多达几百页？

更新：

我使用BeautifulSoup解析每一个网页。我发现这个回溯在GAE日志：

Traceback (most recent call last): 
    File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle 
result = handler(dict(self._environ), self._StartResponse) 
    File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__ 
rv = self.router.dispatch(request, response) 
    File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher 
return route.handler_adapter(request, response) 
    File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__ 
return handler.dispatch() 
    File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch 
return method(*args, **kwargs) 
    File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 68, in post 
heap = get_times(tmp_url, 160) 
    File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 106, in get_times 
soup = BeautifulSoup(html) 
    File "libs/bs4/__init__.py", line 168, in __init__ 
self._feed() 
    File "libs/bs4/__init__.py", line 181, in _feed 
self.builder.feed(self.markup) 
    File "libs/bs4/builder/_htmlparser.py", line 56, in feed 
super(HTMLParserTreeBuilder, self).feed(markup) 
    File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 114, in feed 
self.goahead(0) 
    File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 155, in goahead 
startswith = rawdata.startswith 
DeadlineExceededError

来源

2014-10-28 xiº

是否所有请求都完成相同的服务器/域名？ – jDourlens 2014-10-28 13:59:26

@jDourlens是的。 – 2014-10-28 14:01:25

在60秒内完成所有请求吗？您只有60秒的时间才能返回请求。尝试把这个任务或类似的。 – 2014-10-28 14:02:11

它的失败，因为你只有60秒的时间返回给用户的响应，我要去猜测它需要更长的时间那么。

你会想用这个：https://cloud.google.com/appengine/articles/deferred

创建具有10分钟超时的任务。然后，您可以立即返回给用户，稍后他们可以通过另一个处理程序（您创建的）来“拾取”结果。如果收集所有网址需要更长的时间，那么您需要将它们分成更多任务。

看到这个：https://cloud.google.com/appengine/articles/deadlineexceedederrors

明白，为什么你不能去不再那么60秒。

来源

2014-10-28 14:34:51

我怎样才能抓住时机，然后完成工作？用于向用户显示结果。 – 2014-10-28 17:57:59

在初始请求时，将它们发送到您稍后将可用结果提供给的URL（例如/ result/job1001）。然后编写一些Javascript，每隔10秒从客户端刷新一次该页面。然后，当结果可用时，它们将在10秒内显示给用户。当然还有其他选择。 – 2014-10-28 19:12:52

编辑：可能来自AppEngine上限额和限制。对不起，我以前的答案：

由于这看起来像是从服务器的保护，以避免DDOS或从一个客户端报废。您有几个选项：

在继续之前等待一定数量的查询。从几个客户谁拥有不同的IP地址和发送信息返回到主脚本
制作请求（可能是昂贵的租用此不同的服务器。）。
你也能看到，如果网站的API来访问你需要的数据。

你也应该照顾作为sitowner能阻断/黑名单的IP，如果他决定你的要求并不好。

来源

2014-10-28 14:09:30 jDourlens

抓取在与谷歌应用程序引擎

回答

相关问题