如何使beautifulsoup编码和解码脚本标记的内容

我想使用beautifulsoup来解析html，但是每当我用内联脚本标记打开页面时，beautifulsoup都会对内容进行编码，但最终不会解码它。如何使beautifulsoup编码和解码脚本标记的内容

这是我使用的代码：

from bs4 import BeautifulSoup 

if __name__ == '__main__': 

    htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>' 
    soup = BeautifulSoup(htmlData) 
    #... using BeautifulSoup ... 
    print(soup.prettify())

我想这样的输出：

<html> 
<head> 
    <script type="text/javascript"> 
    console.log("< < not able to write these & also these >> "); 
    </script> 
</head> 
<body> 
    <div> 
    start of div 
    </div> 
</body> 
</html>

但我得到这样的输出：

<html> 
<head> 
    <script type="text/javascript"> 
    console.log("&lt; &lt; not able to write these &amp; also these &gt;&gt; "); 
    </script> 
</head> 
<body> 
    <div> 
    start of div 
    </div> 
</body> 
</html>

来源

2012-12-02 user1557858

有一个[提交的bug（https://bugs.launchpad.net/beautifulsoup/+bug/950459）为这在美丽的汤3.看起来像美丽的汤4错误依然存在。你可能想[文件]（https://bugs.launchpad.net/beautifulsoup/）一个错误报告。 –

-1

你可以做这样的事情：

htmlCodes = (
('&', '&amp;'), 
('<', '&lt;'), 
('>', '&gt;'), 
('"', '&quot;'), 
("'", '&#39;'), 
) 

for i in htmlCodes: 
    soup.prettify().replace(i[1], i[0])

来源

2012-12-02 18:23:28 rofls

-1。这很多错误。首先，您为每次迭代调用美化，丢弃之前替换的结果。其次，你可以摧毁任何不在javascript部分的字符实体引用。 –

你可能想尝试lxml：

import lxml.html as LH 

if __name__ == '__main__': 
    htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>' 
    doc = LH.fromstring(htmlData) 
    print(LH.tostring(doc, pretty_print = True))

产生

<html> 
<head><script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script></head> 
<body> <div> start of div </div> </body> 
</html>

来源

2012-12-02 18:37:19 unutbu

如何使beautifulsoup编码和解码脚本标记的内容

回答

相关问题