错误机械化和美化httplib.InvalidURL：非数字端口：''（Python）

我正在浏览URL列表并使用Mechanize/BeautifulSoup与我的脚本打开它们。错误机械化和美化httplib.InvalidURL：非数字端口：''（Python）

但是我得到这个错误：

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 718, in _set_hostport 
    raise InvalidURL("nonnumeric port: '%s'" % host[i+1:]) 
httplib.InvalidURL: nonnumeric port: ''

这发生在这行代码：

page = mechanize.urlopen(req)

以下是我的代码。任何洞察我做错了什么？许多网址都有效，当它遇到某些我得到这个错误信息的时候，所以不知道为什么。

from mechanize import Browser 
from BeautifulSoup import BeautifulSoup 
import re, os 
import shutil 
import mechanize 
import urllib2 
import sys 
reload(sys) 
sys.setdefaultencoding("utf-8") 

mech = Browser() 
linkfile = open ("links.txt") 
urls = [] 
while 1: 
    url = linkfile.readline() 
    urls.append("%s" % linkfile.readline()) 
    if not url: 
     break 

for url in urls: 
    if "http://" or "https://" not in url: 
     url = "http://" + url 
    elif "..." in url: 
    elif ".pdf" in url: 
     #print "this is a pdf -- at some point we should save/log these" 
     continue 
    elif len (url) < 8: 
     continue 
    req = mechanize.Request(url) 
    req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8') 
    req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20100101 Firefox/17.0') 
    req.add_header('Accept-Language', 'Accept-Language en-US,en;q=0.5') 
    try: 
     page = mechanize.urlopen(req) 
    except urllib2.HTTPError, e: 
     print "there was an error opening the URL, logging it" 
     print e.code 
     logfile = open ("log/urlopenlog.txt", "a") 
     logfile.write(url + "," + "couldn't open this page" + "\n") 
     pass

来源

2013-01-01 user1328021

向我们展示失败的网址。 – Thomas

http://blog.21ic.com/more.asp?id=27916 – user1328021

适用于我...'http：//blog.21ic.com/more.asp？id = 27916'即。 – Thomas

我觉得这段代码

if "http://" or "https://" not in url:

是不是做你想要的（或者你认为它会做什么）的东西。

if "http://"

将始终评估为true，因此您的网址永远不会添加前缀。你需要重写它（例如）为：

if "https://" not in url and "http://" not in url:

而且，现在我开始测试你的作品：

urls = [] 
while 1: 
    url = linkfile.readline() 
    urls.append("%s" % linkfile.readline()) 
    if not url: 
     break

这实际上是为了确保您的URL文件不正确读取，每2号线被读入，你可能想借此读取：

urls = [] 
while 1: 
    url = linkfile.readline() 
    if not url: 
     break 
    urls.append("%s" % url)

的理由是 - 你叫linkfile.readline()两次，迫使它读取2线，仅保存Ë非常第二行到您的列表。

另外，您希望if子句在追加之前，以防止列表末尾出现空的条目。

但是你特别的URL例子适用于我。更多，我可能需要你的链接文件。

来源

2013-01-01 16:22:46 favoretti

我认为你是对的，但不知道这是什么原因造成的错误...当它试图打开它们时，URL是前缀。我做了一份印刷声明以保证这一点。 – user1328021

看我的编辑。这个特殊的URL对我来说很好，所以为了帮助你更多，我可能需要你的链接文件。 – favoretti

错误机械化和美化httplib.InvalidURL：非数字端口：''（Python）

回答

相关问题