2013-07-27 21 views
3

我正在编写一些使用机械化来访问网站的代码,但是通常情况下,当我运行Python代码时,它会无限期停止在我使用mechanize.ParseResponse的行。这不是给我一个错误,而是我必须通过CTRL+C打断它。另外,我相信我正在使用该方法的正确参数。但是,我很困惑,为什么我的程序会突然停止运行。任何想法?Python:机械化无限地随机停止程序

作为额外的背景,我在Mac上运行。

任何帮助将不胜感激!

编辑:以下是我的代码

注:我呼吁python bikes.py并在下面一行偶尔死机:

form = mechanize.ParseResponse(response, backwards_compat=False) 

有时,它也将停止在:

text = response.read() 

# bikes.py 
import re 
import webbrowser 
import mechanize 
import urllib 

brands = ["cannondale", "felt", "fuji", "giant", "specialized", "trek"] 
keywords = ["52", "53", "54", "shimano", "sora", "tiagra", "105", "ultegra", \ 
"road", "allez", "defy"] 
avoid = ["bmx", "mountain", "kids", "fixie", "jacket", "clothing", "fixed gear", \ 
"hybrid", "mtb"] 

def openLink(text): 
    text = text.lower() 
    open = False 
    for word in avoid: 
     if word in text: 
      return False 
    for word in keywords: 
     if word in text: 
      open = True 

    return open 

def scourPage(text, fileRead, fileWrite): 
    links = re.findall(r'class="row".+?href="(.+?)"', text) 

    for link in links: 
     if "http:" in link: 
      url = link 
     else: 
      url = homePage + link 

     page = urllib.urlopen(url) 
     pageText = page.read() 
     title = re.search(r'"postingtitle">.{0,10}<span.+?>[\s\'"]+(.+?)[\s\'"]{0,10}</h2>', \ 
     pageText, re.DOTALL) 
     body = re.search(r'"postingbody">(.+?)</section>', pageText, re.DOTALL) 
     openBody = False 
     openTitle = False 

     if body != None: 
      body = body.group(1) 
      openBody = openLink(body) 

     if title != None: 
      title = title.group(1) 
      openTitle = openLink(title) 

     if (openTitle and openBody) and (url not in fileRead) and (title not in fileRead): 
      fileWrite.write(title + "\n" + url + "\n") 

     fileWrite.close() 

homePage = "http://sfbay.craigslist.org" 
request = mechanize.Request(homePage) 
response = mechanize.urlopen(request) 
forms = mechanize.ParseResponse(response, backwards_compat=False) 
form = forms[0] 

request = form.click() 
response = mechanize.urlopen(request) 
emptySearch = response.geturl() 
request = mechanize.Request(emptySearch) 
response = mechanize.urlopen(request) 
forms = mechanize.ParseResponse(response, backwards_compat=False) 
form = forms[0] 

form["catAbb"] = ["bik"] 
form["maxAsk"] = "500" 
form.find_control("hasPic").items[0].selected = True 

for brand in brands: 
    form["query"] = brand 

    request = form.click() 
    response = mechanize.urlopen(request) 
    text = response.read() 

    fileR = open('bikes.txt', 'r').read() 
    fileA = open('bikes.txt', 'a') 

    scourPage(text, fileR, fileA) 

    fileA.close() 

    next = re.findall(r'class="nplink next".{0,50}<a href=\'(.+?)\'>', text, re.DOTALL) 

    while len(next) != 0: 
     text = urllib.urlopen(next[0]).read() 

     fileR = open('bikes.txt', 'r').read() 
     fileA = open('bikes.txt', 'a') 

     scourPage(text, fileR, fileA) 

     fileA.close() 

     next = re.findall(r'class="nplink next".{0,50}<a href=\'(.+?)\'>', text, re.DOTALL) 

此代码梳理通过Craigslist广告试图淘汰那些我不想要的。在这种情况下,我试图找到一辆公路自行车,并避免任何山地自行车和其他物品。

UPDATE:

相当等待很长一段时间后,我终于键盘再次中断运行,并停在form = mechanize.ParseResponse(response, backwards_compat=False)线。我试着跑一遍,并得到这个错误:

Traceback (most recent call last): 
    File "bikes.py", line 97, in <module> 
    forms = mechanize.ParseResponse(response, backwards_compat=False) 
    File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 945, in ParseResponse 
    File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 981, in _ParseFileEx 
    File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 758, in feed 
    File "build/bdist.macosx-10.8-intel/egg/mechanize/_sgmllib_copy.py", line 110, in feed 
    File "build/bdist.macosx-10.8-intel/egg/mechanize/_sgmllib_copy.py", line 192, in goahead 
    File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 654, in handle_charref 
    File "build/bdist.macosx-10.8-intel/egg/mechanize/_form.py", line 149, in unescape_charref 
ValueError: unichr() arg not in range(0x10000) (narrow Python build) 
+0

你可以给我们的代码? – svineet

+0

已添加。希望能帮助到你。 :X – Zhouster

回答

0

while回路可以去无限的,这说明了其行为。你有没有检查它不是?

当您的代码CTRL-C代码不一定意味着代码已损坏时,您会收到运行时错误。

+0

我已经在我的for循环中放置了打印语句,并且它不停在“text = response.read()”行。我非常确定while循环的运行正常,因为如果它是无限的,它将会打印大量的语句,但事实并非如此。我认为这与urllib有关,但这只是一个猜测。 – Zhouster