2014-03-01 80 views
1

我在使用Python中的lxml解析JS时遇到了问题。当我执行下面的代码,我的输出是:使用lxml在python中解析html和js

“在0x10cec4e10 <元素DIV>”

from lxml.html.clean import Cleaner 
cleaner = Cleaner() 
cleaner.javascript = True 

text = urllib2.urlopen("URL").read().decode("utf-8") 
test = lxml.html.fromstring(cleaner.clean_html(text)) 
print test 

我想要得到的是没有JS的东西解析的文本。有人可以点亮一些光线吗?谢谢。

回答

1
import lxml 
import urllib2 

URL = "http://www.google.com/" 
ENCODING = "latin1" 

args = { 
    "javascript": True,   # strip javascript 
    "page_structure": False, # leave page structure alone 
    "style": True    # remove CSS styling 
} 
cleaner = lxml.html.clean.Cleaner(**args) 

# get the page source 
html = urllib2.urlopen(URL).read().decode(ENCODING) 
# clean it up 
clean = cleaner.clean_html(html) 

# print unformatted html dump 
print(clean) 

# print properly indented html 
tree = lxml.html.fromstring(clean) 
print(lxml.etree.tostring(tree, pretty_print=True)) 

需要注意的是漂亮的打印工作正常与lxml.etree.tostring(),但用不好lxml.html.tostring(),它不换行但不能缩进 - 去图。