在我RequestHandler的子类蟒蛇很多网址,我试图获取URL的范围:抓取在与谷歌应用程序引擎
class GetStats(webapp2.RequestHandler):
def post(self):
lastpage = 50
for page in range(1, lastpage):
tmpurl = url + str(page)
response = urllib2.urlopen(tmpurl, timeout=5)
html = response.read()
# some parsing html
heap.append(result_of_parsing)
self.response.write(heap)
但它与〜30页的URL作品(页面加载长,但它是作品)。 如果超过30,我得到一个错误:
错误:服务器错误
服务器遇到错误,无法完成您的请求。
请在30秒后重试。
有没有什么办法来获取大量的网址?可能会更优化或不适合? 多达几百页?
更新:
我使用BeautifulSoup解析每一个网页。我发现这个回溯在GAE日志:
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1529, in __call__
rv = self.router.dispatch(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1278, in default_dispatcher
return route.handler_adapter(request, response)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 1102, in __call__
return handler.dispatch()
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/webapp2-2.5.2/webapp2.py", line 570, in dispatch
return method(*args, **kwargs)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 68, in post
heap = get_times(tmp_url, 160)
File "/base/data/home/apps/s~gae/1.379703839015039430/main.py", line 106, in get_times
soup = BeautifulSoup(html)
File "libs/bs4/__init__.py", line 168, in __init__
self._feed()
File "libs/bs4/__init__.py", line 181, in _feed
self.builder.feed(self.markup)
File "libs/bs4/builder/_htmlparser.py", line 56, in feed
super(HTMLParserTreeBuilder, self).feed(markup)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 114, in feed
self.goahead(0)
File "/base/data/home/runtimes/python27/python27_dist/lib/python2.7/HTMLParser.py", line 155, in goahead
startswith = rawdata.startswith
DeadlineExceededError
是否所有请求都完成相同的服务器/域名? – jDourlens 2014-10-28 13:59:26
@jDourlens是的。 – 2014-10-28 14:01:25
在60秒内完成所有请求吗?您只有60秒的时间才能返回请求。尝试把这个任务或类似的。 – 2014-10-28 14:02:11