2017-02-24 266 views
1

我编写了一个脚本来查找SO问题标题中的拼写错误。 我用了大约一个月。这工作正常。Python:urllib.error.HTTPError:HTTP错误404:未找到

但是现在,当我尝试运行它时,我得到了这个结果。

Traceback (most recent call last): 
    File "copyeditor.py", line 32, in <module> 
    find_bad_qn(i) 
    File "copyeditor.py", line 15, in find_bad_qn 
    html = urlopen(url) 
    File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen 
    return opener.open(url, data, timeout) 
    File "/usr/lib/python3.4/urllib/request.py", line 469, in open 
    response = meth(req, response) 
    File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response 
    'http', request, response, code, msg, hdrs) 
    File "/usr/lib/python3.4/urllib/request.py", line 507, in error 
    return self._call_chain(*args) 
    File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain 
    result = func(*args) 
    File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default 
    raise HTTPError(req.full_url, code, msg, hdrs, fp) 
urllib.error.HTTPError: HTTP Error 404: Not Found 

这是我的代码

import json 
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
from enchant import DictWithPWL 
from enchant.checker import SpellChecker 

my_dict = DictWithPWL("en_US", pwl="terms.dict") 
chkr = SpellChecker(lang=my_dict) 
result = [] 


def find_bad_qn(a): 
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active" 
    html = urlopen(url) 
    bsObj = BeautifulSoup(html, "html5lib") 
    que = bsObj.find_all("div", class_="question-summary") 
    for div in que: 
     link = div.a.get('href') 
     name = div.a.text 
     chkr.set_text(name.lower()) 
     list1 = [] 
     for err in chkr: 
      list1.append(chkr.word) 
     if (len(list1) > 1): 
      str1 = ' '.join(list1) 
      result.append({'link': link, 'name': name, 'words': str1}) 


print("Please Wait.. it will take some time") 
for i in range(298314,298346): 
    find_bad_qn(i) 
for qn in result: 
    qn['link'] = "https://stackoverflow.com" + qn['link'] 
for qn in result: 
    print(qn['link'], " Error Words:", qn['words']) 
    url = qn['link'] 

UPDATE

这是造成problem.Even虽然这个网址存在的URL。

https://stackoverflow.com/questions?page=298314&sort=active

我试图改变的范围内,以低一些的值。现在它工作正常。

为什么这发生在上面的url?

+0

你能打印请求的URL吗? – LoicM

+0

这一个https://stackoverflow.com/questions?page=298314&sort=active – jophab

+0

这实际上是奇怪的,我可以重现上面约270000上面的每个网页的完全相同的问题。页面存在但我得到一个错误,当请求与蟒 – LoicM

回答

2

显然,每页的默认显示问题数是50,因此您在循环中定义的范围超出了可用页数,每页有50个问题。范围应该调整为总页数的50以内。

此代码将捕获404错误,这是您得到错误的原因,并忽略它,以防万一您超出范围。

from urllib.request import urlopen 

def find_bad_qn(a): 
    url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active" 
    try: 
     urlopen(url) 
    except: 
     pass 

print("Please Wait.. it will take some time") 
for i in range(298314,298346): 
    find_bad_qn(i) 
+0

但该网址存在。 – jophab

+0

不,它会返回404错误代码,这意味着找不到网址。这是你的错误:urllib.error.HTTPError:HTTP错误404:未找到 – Atirag

+0

是的。但该网址存在。你可以尝试一下。我的范围值不是问题ID。这是页码在积极的问题 – jophab