2014-12-30 86 views
1

我同时使用F12(铬)和邮递员来检查请求及其网站scrapy获得通过邮寄方法的数据,但有403

http://www.zhihu.com/

(电子邮件详细信息:jianguo.bai @ hirebigdata.cn,密码:wsc111111),然后去

http://www.zhihu.com/people/hynuza/columns/followed

我想获得Hynuza跟随的所有专栏,目前是105。打开页面时,只有20页,然后我需要向下滚动以获得更多。每次我向下滚动请求的细节是这样的:

Remote Address:60.28.215.70:80 
Request URL:http://www.zhihu.com/node/ProfileFollowedColumnsListV2 
Request Method:POST 
Status Code:200 OK 
Request Headersview source 
Accept:*/* 
Accept-Encoding:gzip,deflate 
Accept-Language:en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4 
Connection:keep-alive 
Content-Length:157 
Content-Type:application/x-www-form-urlencoded; charset=UTF-8 
Cookie:_xsrf=f1460d2580fbf34ccd508eb4489f1097; q_c1=867d4a58013241b7b5f15b09bbe7dc79|1419217763000|1413335199000; c_c=2a45b1cc8f3311e4bc0e52540a3121f7; q_c0="MTE2NmYwYWFlNmRmY2NmM2Q4OWFkNmUwNjU4MDQ1OTN8WXdNUkVxRDVCMVJaODNpOQ==|1419906156|cb0859ab55258de9ea95332f5ac02717fcf224ea"; __utma=51854390.1575195116.1419486667.1419902703.1419905647.11; __utmb=51854390.7.10.1419905647; __utmc=51854390; __utmz=51854390.1419905647.11.9.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/people/hynuza/columns/followed; __utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1 
Host:www.zhihu.com 
Origin:http://www.zhihu.com 
Referer:http://www.zhihu.com/people/hynuza/columns/followed 
User-Agent:Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36 
X-Requested-With:XMLHttpRequest 
Form Dataview sourceview URL encoded 
method:next 
params:{"offset":20,"limit":20,"hash_id":"18c79c6cc76ce8db8518367b46353a54"} 
_xsrf:f1460d2580fbf34ccd508eb4489f1097 

然后我用邮递员来模拟这样的要求:

enter image description here

正如你所看到的,它得到了想要我通缉,甚至我注销了这个网站。

根据这一切,我写我的蜘蛛像这样:

# -*- coding: utf-8 -*- 
import scrapy 
import urllib 
from scrapy.http import Request 


class PostSpider(scrapy.Spider): 
    name = "post" 
    allowed_domains = ["zhihu.com"] 
    start_urls = (
     'http://www.zhihu.com', 
    ) 

    def __init__(self): 
     super(PostSpider, self).__init__() 

    def parse(self, response): 
     return scrapy.FormRequest.from_response(
      response, 
      formdata={'email': '[email protected]', 'password': 'wsc111111'}, 
      callback=self.login, 
     ) 

    def login(self, response): 
     yield Request("http://www.zhihu.com/people/hynuza/columns/followed", 
         callback=self.parse_followed_columns) 

    def parse_followed_columns(self, response): 
     # here deal with the first 20 divs 
     params = {"offset": "20", "limit": "20", "hash_id": "18c79c6cc76ce8db8518367b46353a54"} 
     method = 'next' 
     _xsrf = 'f1460d2580fbf34ccd508eb4489f1097' 
     data = { 
      'params': params, 
      'method': method, 
      '_xsrf': _xsrf, 
     } 
     r = Request(
      "http://www.zhihu.com/node/ProfileFollowedColumnsListV2", 
      method='POST', 
      body=urllib.urlencode(data), 
      headers={ 
       'Accept': '*/*', 
       'Accept-Encoding': 'gzip,deflate', 
       'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4', 
       'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 
       'Cache-Control': 'no-cache', 
       'Cookie': '_xsrf=f1460d2580fbf34ccd508eb4489f1097; ' 
          'c_c=2a45b1cc8f3311e4bc0e52540a3121f7; ' 
          '__utmt=1; ' 
          '__utma=51854390.1575195116.1419486667.1419855627.1419902703.10; ' 
          '__utmb=51854390.2.10.1419902703; ' 
          '__utmc=51854390; ' 
          '__utmz=51854390.1419855627.9.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/;' 
          '__utmv=51854390.100--|2=registration_date=20141222=1^3=entry_date=20141015=1;', 
       'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) ' 
           'Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36', 
       'host': 'www.zhihu.com', 
       'Origin': 'http://www.zhihu.com', 
       'Connection': 'keep-alive', 
       'X-Requested-With': 'XMLHttpRequest', 
      }, 
      callback=self.parse_more) 
     r.headers['Cookie'] += response.request.headers['Cookie'] 
     print r.headers 
     yield r 
     print "after" 

    def parse_more(self, response): 
     # here is where I want to get the returned divs 
     print response.url 
     followers = response.xpath("//div[@class='zm-profile-card " 
            "zm-profile-section-item zg-clear no-hovercard']") 
     print len(followers) 

后来我有403这样的:

2014-12-30 10:34:18+0800 [post] DEBUG: Crawled (403) <POST http://www.zhihu.com/node/ProfileFollowedColumnsListV2> (referer: http://www.zhihu.com/people/hynuza/columns/followed) 
2014-12-30 10:34:18+0800 [post] DEBUG: Ignoring response <403 http://www.zhihu.com/node/ProfileFollowedColumnsListV2>: HTTP status code is not handled or not allowed 

所以不会进入parse_more

我已经工作了两天,但一无所有,任何帮助或建议将不胜感激。

+0

我认为你不应该在这里提到你的凭证 –

+0

@NaingLinAung没关系,这个账户只是为了测试。通过使用这个测试帐户,你们可以节省一些时间。 – shellbye

回答

0

登录顺序正确。但是,parsed_followed_columns()方法完全破坏会话。

你不能

你应该找到一种方法来读取直接从上一个页面的HTML内容信息和动态注入值数据[“_ XSRF”],而params [“hash_id”]使用硬编码值。

此外,我建议您删除此请求中的标题参数,这只能引起麻烦。

+0

我尝试了你所说的,'_xsrf'和'hash_id'是从前一页读取的,为了简单起见,我把它放在那里。 – shellbye