2014-10-06 38 views
0

代码用于记录在登录网站后触发.txt文件的下载:如何使用python

import requests 
from bs4 import BeautifulSoup as bs 
import urllib 

payload = {  
    'email' : '[email protected]', 
    'password' : 'xxx' 
} 

with requests.Session() as s: 
    m = s.get('https://www.free-ebooks.net',headers={'User-agent': 'Mozilla/5.0'}) 
    t = s.post('https://www.free-ebooks.net',data = payload) 
    r = s.get('https://www.free-ebooks.net/ebook/The-Best-Scandal-Ever') 
    print r.content 

print r.content输出,我觉得我的登录是用于成功
代码触发下载:

<<<code same as above>>> 
with requests.Session() as s: 
    m = s.get('https://www.free-ebooks.net',headers={'User-agent': 'Mozilla/5.0'}) 
    t = s.post('https://www.free-ebooks.net',data = payload) 
    r = s.get('https://www.free-ebooks.net/ebook/The-Best-Scandal-Ever') 
    urllib.urlretrieve("https://www.free-ebooks.net/ebook/The-Best-Scandal-Ever/txt", "myfile007.pdf") 

在我的输出pdf我得到的是源代码,而不是pdf的原始内容。
我有我应该使用已经开始session.But的情况下不知道如何实现它的感觉。
ANY1?

+0

你是如何确认登录成功?你可以在's.cookies'中看到会话ID吗?增加@ falsetru的答案,这会触发文本下载实际的URL是'../的产品最佳的丑闻永远/ TXT?dl''..The-最佳丑闻前所未有/ txt'只是打开了一个网页在内部触发实际下载 – srj 2014-10-07 21:03:52

回答

0

requestsurllib是不同的。他们不共享信息(特别是cookie)。

使用requests“一致。

with requests.Session() as s: 
    m = s.get('https://www.free-ebooks.net', headers={'User-agent': 'Mozilla/5.0'}) 
    t = s.post('https://www.free-ebooks.net', data=payload) 
    r = s.get('https://www.free-ebooks.net/ebook/The-Best-Scandal-Ever') 
    resp = s.get("https://www.free-ebooks.net/ebook/The-Best-Scandal-Ever/txt", 
       stream=True) 
    with open("myfile007.pdf", "wb") as f: 
     f.writelines(resp.iter_content()) 
+0

不,我试图上面的代码,仍的源代码被存储到“myfile007”。 – dreamer 2014-10-06 17:31:54