2017-02-06 36 views
0

我正在练习使用python3爬行。Jsessionid干扰爬行

我爬这个网站。

http://www.keri.org/web/www/research_0201?p_p_id=EXT_BBS&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fview&_EXT_BBS_sCategory=&_EXT_BBS_sKeyType=&_EXT_BBS_sKeyword=&_EXT_BBS_curPage=1&_EXT_BBS_optKeyType1=&_EXT_BBS_optKeyType2=&_EXT_BBS_optKeyword1=&_EXT_BBS_optKeyword2=&_EXT_BBS_sLayoutId=0 

我想从html代码中得到pdf的地址。

前)在HTML,PDF下来网址是

http://www.keri.org/web/www/research_0201?p_p_id=EXT_BBS&p_p_lifecycle=1&p_p_state=exclusive&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fget_file&_EXT_BBS_extFileId=5326 

但是,我的履带结果

http://www.keri.org/web/www/research_0201**;jsessionid=3875698676A3025D8877C4EEBA67D6DF**p_p_id=EXT_BBS&p_p_lifecycle=1&p_p_state=exclusive&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fget_file&_EXT_BBS_extFileId=5306 

我不能将文件下载甚至到下面的地址。

jsessionid从哪里来?

我可以删除它,但我不知道为什么。

** 为什么这么长的URL?笑

回答

1

我在我的代码测试了jsessionid剂量不会影响下载文件:

import requests, bs4 

r = requests.get('http://www.keri.org/web/www/research_0201?p_p_id=EXT_BBS&p_p_lifecycle=0&p_p_state=normal&p_p_mode=view&p_p_col_id=column-1&p_p_col_count=1&_EXT_BBS_struts_action=%2Fext%2Fbbs%2Fview&_EXT_BBS_sCategory=&_EXT_BBS_sKeyType=&_EXT_BBS_sKeyword=&_EXT_BBS_curPage=1&_EXT_BBS_optKeyType1=&_EXT_BBS_optKeyType2=&_EXT_BBS_optKeyword1=&_EXT_BBS_optKeyword2=&_EXT_BBS_sLayoutId=0') 
soup = bs4.BeautifulSoup(r.text, 'lxml') 
down_links = [(a.get('href'), a.find_previous('a').text)for a in soup('a', class_="download")] 
for link, title in down_links: 
    filename = title + '.pdf' 
    r = requests.get(link, stream=True, headers=headers) 
    with open(filename, 'wb') as f: 
     for chunk in r.iter_content(chunk_size=1024): 
      f.write(chunk) 
+0

@真的吗?我直接在网页浏览器中输入下载网址。但是,我无法下载该文件....无论如何,谢谢〜! – StackQ