从bash/curl下载web内容到python

-1

我写了一个bash脚本来下载某些网页的内容。为了使它工作，我需要抓住一个cookie，然后发送一些特殊的数据请求，然后我能够得到适当的链接下载其内容。我的剧本是这样的：从bash/curl下载web内容到python

#!/bin/bash 
for ((i=1;i<=$NB;++i)); do 
cookie=`curl -I "http://example.com/index.php" | grep Set-Cookie: | awk '{print $2}' |   cut -d ';' -f 1\` # cath a cookie 
curl -b $cookie --data "a_piece_of_data" "http://example.com/index.php" 
curl -b $cookie "http://example.com/proper_link_$i" &> $i.html 
sleep 3 
done

，因为我需要后来就摆脱所有的HTML/XHTML标签解析它（只提取纯文本），然后将其转换到XML我发现Python和它的LIB的将是完美的做到这一点。
所以我问你提示如何重写我的脚本到python？
这里是我想出到目前为止，但它仍然是远离我的bash例如：

import mechanize 
import urllib2 
import BeautifulSoup 
import lxml 

request = mechanize.Request("http://example.com/index.php") 
response = mechanize.urlopen(request) 
cj = mechanize.CookieJar() 
cj.extract_cookies(response, request) 
print cj

任何帮助/提示赞赏。

来源

2012-10-06 modzello86

如果你已经熟悉了*卷曲*，也许这将是更容易使用[pycurl ]（http://pypi.python.org/pypi/pycurl2/7.20.0.a1）模块。 –

我会用requests library

import requests 
session = requests.session() 
r = session.get('http://example.com/index.php') 
# session.cookies now contains any relevant cookies which will be 
# used in following requests of the session 
page = session.get('http://example.com/your_other_page')

然后使用lxml解析您的HTML：

import lxml.html 
page = lxml.html.fromstring(page.text)

来源

2012-10-06 11:25:40

thnx队友，当然虐待尝试这一点，但如何发送额外的数据包？ – modzello86

从bash/curl下载web内容到python

回答

相关问题