使用Python编写数据抓取

我想用Python抓取网站的内容。就像这样：使用Python编写数据抓取

Apple’s stock continued to dominate the news over the weekend, with Barron’s placing it on the top of its favorite 2013 stock list.

但随着错误结果打印出来：

Apple âs stock continued to dominate the news over the weekend, with Barronâs placing it on the top of its favorite 2013 stock list.

符号 “'” 无法显示，这里是我的代码：

#-*- coding: utf-8 -*- 

    import sys 
    reload(sys) 
    sys.setdefaultencoding('utf-8') 
    import urllib 
    from lxml import * 
    import urllib 
    import lxml.html as HTML 

    url = "http://www.forbes.com/sites/panosmourdoukoutas/2012/12/09/apple-tops-barrons- 10-favorite-stocks-for-2013/?partner=yahootix" 
    sock = urllib.urlopen(url) 
    htmlSource = sock.read() 
    sock.close() 

    root = HTML.document_fromstring(htmlSource) 
    contents = ' '.join([x.strip() for x in root.xpath("//div[@class='body']/descendant::text()")]) 

    print contents 

    f = open('C:/Users/yinyao/Desktop/Python Code/data.txt','w') 
    f.write(contents) 
    f.close()

然而，设置之后，printf的功能就没用了。为什么？我该怎么做？我使用的是Windows，默认的编码方式是gbk。

来源

2012-12-18 yinyao

你可以张贴在执行该刮的代码？ –

你是如何印制这份声明的？请发布您执行的确切命令以打印声明。 Python中没有printf函数，是吗？ – stackoverflowery

试试[Beautiful Soup]（http://www.crummy.com/software/BeautifulSoup/） –

首先，要确保你知道The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

其次，总是内部使用Unicode格式。尽早解码，编码时间较晚：当您取消网站时，将其解码为unicode并在脚本内部将其作为unicode内部处理。否则，你的代码将随机崩溃，例如，因为在某些中文网页的评论中遇到意外字符。只有当你通过它以后的某个地方（例如，一些可写流），你应该对其进行编码（“UTF-8”最好）

三，使用BeautifulSoup 4

来源

2012-12-18 08:56:19

谢谢！但我不知道何时以及如何将网站数据解码为unicode.I已重新编辑我的问题并显示了我的代码，您能否给我更多关于我的代码的建议？ – yinyao

首先，格式化你的问题*正确* http://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks，所以代码是可读的。其次，BautifulSoup会为你处理unicode –

谢谢！ BeautifulSoup很有用，但我已经通过将htmlSource解码为unicode来修复它。 – yinyao

使用Python编写数据抓取

回答

相关问题