urllib2.urlopen失败，而urllib.urlopen工作在相同的URL

我想使用urllib和urllib2从特定的网站刮一些数据。urllib2.urlopen失败，而urllib.urlopen工作在相同的URL

现在urllib主要用于读取和处理数据，而urllib2的代码段主要用于读取和存储数据。

外部网站经历了一些变化，虽然urllib代码部分继续工作urllib2部分简单地龙骨翻转。

所以我做了一些检查，发现urllib2.urlopen（URL）总是返回一个空白字符串，而urllib.urlopen（URL）总是正常工作。

我进一步挖掘，双方的urllib和urllib的模块启用调试日志记录：

>>> response2 =urllib2.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist') 
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxltd.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n' 
reply: 'HTTP/1.1 302 Moved Temporarily\r\n' 
header: Server: nginx/0.7.67 
header: Date: Thu, 28 Nov 2013 19:12:28 GMT 
header: Transfer-Encoding: chunked 
header: Connection: close 
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist 
send: 'GET /web/guest/attendancelist HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n' 
reply: 'HTTP/1.1 301 Moved Permanently\r\n' 
header: Server: Apache-Coyote/1.1 
header: Location: /home/new/attendancelist.jsp 
header: Content-Length: 0 
header: Date: Thu, 28 Nov 2013 19:12:26 GMT 
header: Connection: close 
send: 'GET /home/new/attendancelist.jsp HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: www.xxxxxxxxplc.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.6\r\n\r\n' 
reply: 'HTTP/1.1 200 OK\r\n' 
header: Server: Apache-Coyote/1.1 
header: Set-Cookie: JSESSIONID=F02B1F76CCCF6F41BE48951F6E1A6205; Path=/home 
header: Content-Type: text/html;charset=utf-8 
header: Content-Length: 0 
header: Date: Thu, 28 Nov 2013 19:12:26 GMT 
header: Connection: close

而且....

>>> html3=urllib.urlopen('http://www.xxxxxxxxltd.com/web/guest/attendancelist') 
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxltd.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n' 
reply: 'HTTP/1.1 302 Moved Temporarily\r\n' 
header: Server: nginx/0.7.67 
header: Date: Thu, 28 Nov 2013 19:10:36 GMT 
header: Connection: close 
header: Location: http://www.xxxxxxxxplc.com/web/guest/attendancelist 
send: 'GET /web/guest/attendancelist HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n' 
reply: 'HTTP/1.1 301 Moved Permanently\r\n' 
header: Server: Apache-Coyote/1.1 
header: Location: /home/new/attendancelist.jsp 
header: Content-Length: 0 
header: Date: Thu, 28 Nov 2013 19:10:34 GMT 
header: Connection: close 
send: 'GET /home/new/attendancelist.jsp HTTP/1.0\r\nHost: www.xxxxxxxxplc.com\r\nUser-Agent: Python-urllib/1.17\r\n\r\n' 
reply: 'HTTP/1.1 200 OK\r\n' 
header: Server: Apache-Coyote/1.1 
header: Set-Cookie: JSESSIONID=8CFB903B80C42CA3DA37EDF90D84FF99; Path=/home 
header: Content-Type: text/html;charset=utf-8 
header: Date: Thu, 28 Nov 2013 19:10:35 GMT 
header: Connection: close

如可辨，在urllib2的连接流有显著更多的连接标题（其中之一是Connection标题，其值为Close）。

任何人都可以帮助找到为什么urllib2无法检索数据，而urllib模块运作良好。

我确定它与Connection标题有关，但我想要某种确认和思考过程解释。

谢谢。

来源

2013-11-28 Kris Ogirri

我在日志中看到的唯一区别是Accept-encoding头。哪些内容是由urllib返回的？ p.ex.它是纯html还是gziped？ – alko

真正的问题是，尽管urllib返回页面的实际内容（纯文本被正确地抓取和格式化），但urllib2响应不会返回任何数据（这通过将'Content-Length'值设置为0来确认urllib2头信息 –

我会建议使用curl来调试urllib的两个版本使用的头文件。有了一些试验和错误，你应该能够找到导致问题的标题并从那里开始。

来源

2013-11-29 07:05:05 Phil

感谢您的信息，我会尝试一下。你有任何链接可以帮助我使用CURL重新创建请求吗？我有点不确定我们是否需要像curl命令行（'wget '或类似的东西），或者我们可以使用基于浏览器的解决方案（例如'Fiddler'）。 –

urllib2.urlopen失败，而urllib.urlopen工作在相同的URL

回答

相关问题