2012-02-02 52 views
0

试图处理一个非常简单的HTML5脚本,并使用html5lib这个html5lib脚本是怎么回事?

import html5lib 

html = '''<!DOCTYPE html> 
<html lang="en"> 
    <head> 
     <title>Hi</title> 
    </head> 
    <body> 
     <script src="a.js"></script> 
     <script src="b.js"></script> 
    </body> 
</html> 
''' 

parser = html5lib.HTMLParser(tree = html5lib.treebuilders.getTreeBuilder("lxml")) 
walker = html5lib.treewalkers.getTreeWalker("lxml") 
serializer = html5lib.serializer.htmlserializer.HTMLSerializer() 

document = parser.parse(html) 
stream = walker(document) 
theHTML = serializer.render(stream) 

print theHTML 

输出使它看起来像:

<!DOCTYPE html><html lang=en><head> 
     <title>Hi</title> 
    </head> 
    <body> 
     <script src=a.js></script> 
     <script src=b.js></script> 

是啊。它只是在中途切断。将树生成器从lxml更改为dom不会执行任何操作。调整HTML会改变输出,但它仍然非常腐败。

回答

1

因此,关键似乎是omit_optional_tags=False某种程度上缺少它吃掉输出结束。

parser = html5lib.HTMLParser(tree = html5lib.treebuilders.getTreeBuilder("lxml")) 
document = parser.parse(html)  
walker = html5lib.treewalkers.getTreeWalker("lxml") 
stream = walker(document) 
s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False) 
output_generator = s.serialize(stream) 
for item in output_generator: 
     print item 


<!DOCTYPE html> 
<html lang=en> 
<head> 


<title> 
Hi 
</title> 


</head> 


<body> 


<script src=a.js> 
</script> 


<script src=b.js> 
</script> 




</body> 
</html> 
>>> 
+0

@schwa:请编辑我的答案和适当的解释。 – RanRag 2012-02-02 05:55:53

+0

无法使用您的代码重现。 's'甚至没有在你的代码中定义。想用无错的代码编辑你的回复? – schwa 2012-02-02 06:05:59

+0

@schwa看到我编辑的代码。 – RanRag 2012-02-02 06:21:35