2012-12-02 49 views
4

我想使用beautifulsoup来解析html,但是每当我用内联脚本标记打开页面时,beautifulsoup都会对内容进行编码,但最终不会解码它。如何使beautifulsoup编码和解码脚本标记的内容

这是我使用的代码:

from bs4 import BeautifulSoup 

if __name__ == '__main__': 

    htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>' 
    soup = BeautifulSoup(htmlData) 
    #... using BeautifulSoup ... 
    print(soup.prettify()) 

我想这样的输出:

<html> 
<head> 
    <script type="text/javascript"> 
    console.log("< < not able to write these & also these >> "); 
    </script> 
</head> 
<body> 
    <div> 
    start of div 
    </div> 
</body> 
</html> 

但我得到这样的输出:

<html> 
<head> 
    <script type="text/javascript"> 
    console.log("&lt; &lt; not able to write these &amp; also these &gt;&gt; "); 
    </script> 
</head> 
<body> 
    <div> 
    start of div 
    </div> 
</body> 
</html> 
+0

有一个[提交的bug(https://bugs.launchpad.net/beautifulsoup/+bug/950459)为这在美丽的汤3.看起来像美丽的汤4错误依然存在。你可能想[文件](https://bugs.launchpad.net/beautifulsoup/)一个错误报告。 –

回答

-1

你可以做这样的事情:

htmlCodes = (
('&', '&amp;'), 
('<', '&lt;'), 
('>', '&gt;'), 
('"', '&quot;'), 
("'", '&#39;'), 
) 

for i in htmlCodes: 
    soup.prettify().replace(i[1], i[0]) 
+1

-1。这很多错误。首先,您为每次迭代调用美化,丢弃之前替换的结果。其次,你可以摧毁任何不在javascript部分的字符实体引用。 –

1

你可能想尝试lxml

import lxml.html as LH 

if __name__ == '__main__': 
    htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>' 
    doc = LH.fromstring(htmlData) 
    print(LH.tostring(doc, pretty_print = True)) 

产生

<html> 
<head><script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script></head> 
<body> <div> start of div </div> </body> 
</html>