为什么我的Python脚本不能正确返回页面源代码？

-1

我刚刚写了一个脚本，意在通过字母表，并找到所有无人认领的四字母叽叽喳喳名称（真的只是为了练习，因为我是新来的Python）。我已经写了几个使用'urllib2'从url获取网站html的脚本，但这一次它似乎没有工作。这里是我的脚本：为什么我的Python脚本不能正确返回页面源代码？

import urllib2 

src='' 
url='' 
print "finding four-letter @usernames on twitter..." 
d_one='' 
d_two='' 
d_three='' 
d_four='' 
n_one=0 
n_two=0 
n_three=0 
n_four=0 
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'] 

while (n_one > 26): 
    while(n_two > 26): 
     while (n_three > 26): 
      while (n_four > 26): 
       d_one=letters[n_one] 
       d_two=letters[n_two] 
       d_three=letters[n_three] 
       d_four=letters[n_four] 
       url = "twitter.com/" + d_one + d_two + d_three + d_four 

       src=urllib2.urlopen(url) 
       src=src.read() 
       if (src.find('Sorry, that page doesn’t exist!') >= 0): 
        print "nope" 
        n_four+=1 
       else: 
        print url 
        n_four+=1 
      n_three+=1 
      n_four=0 
     n_two+=1 
     n_three=0 
     n_four=0 
    n_one+=1  
    n_two=0 
    n_three=0 
    n_four=0

运行这段代码返回以下错误：

SyntaxError: Non-ASCII character '\xe2' in file name.py on line 29, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

，并访问该链接，并做一些额外的搜索后，我添加以下行到文档的顶部：

# coding: utf-8

现在，虽然它不再返回错误，但似乎没有任何事情发生。我加了一行

print src

哪个应该打印每个url的html，但是当我运行它时什么也没有发生。任何建议将不胜感激。

来源

2012-08-13 zch

什么是/是第29行？显然上面的代码并不代表你的真实代码 - otherweise我们会看到你的代码中的特殊字符。 Downvote ... – 2012-08-13 04:12:55

第29行是“print'nope'”...我发誓我刚刚写了这个脚本5分钟前... – zch 2012-08-13 04:14:30

只是为了您的信息，这个脚本将需要很长时间才能运行。有'26 * 26 * 26 * 26 = 456976'可能的四个字母的单词。即使你能够每秒处理两次，你的脚本仍然会花费456976 * 0.5秒*（1分钟/ 60秒）*（1小时/ 60分钟）=大约63.47小时。 – 2012-08-13 04:28:07

嗯，你初始化n_one=0，然后做一个循环while (n_one > 26)。当Python第一次遇到它时，它看到while (0 > 26)这显然是错误的，因此它跳过了整个循环。

正如gnibbler的回答告诉你的，无论如何都有更干净的循环方法。

来源

2012-08-13 04:16:05 Blair

哇。你完全正确 - 他们应该是“<" not ">”。非常感谢您指出并提供快速帮助。 – zch 2012-08-13 04:17:42

您可以通过使用itertools.product

from itertools import product 
for d_one, d_two, d_three, d_four in product(letters, repeat=4): 
    ...

而不是定义的字母列表摆脱过度嵌套的，你可以只使用strings.ascii_lowercase

你应该告诉的urlopen您正在使用的协议（http ）

url = "http://twitter.com/" + d_one + d_two + d_three + d_four

此外，当您做得到那并不是一个页面牛逼存在的urlopen提出了一个404，所以你应该检查这，而不是看网页文本

来源

2012-08-13 04:13:10

太好了！谢谢你的提示;我会执行这个。 – zch 2012-08-13 04:15:43

为什么我的Python脚本不能正确返回页面源代码？

回答

相关问题