BeatifulSoup4 get_text仍然有javascript

我试图删除所有使用bs4的html/javascript，但是，它并没有摆脱javascript。我仍然在那里看到它的文字。我怎样才能解决这个问题？BeatifulSoup4 get_text仍然有javascript

我尝试过使用nltk，它工作正常，但是，clean_html和clean_url将被删除前进。有没有办法使用汤get_text并获得相同的结果？

我试图寻找这些网页：

BeautifulSoup get_text does not strip all tags and JavaScript

我目前使用的NLTK的废弃的函数。

编辑

下面是一个例子：

import urllib 
from bs4 import BeautifulSoup 

url = "http://www.cnn.com" 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 
print soup.get_text()

我仍然看到CNN如下：

$j(function() { 
"use strict"; 
if (window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv()) { 
var pushLib = window.safaripushLib, 
current = pushLib.currentPermissions(); 
if (current === "default") { 
pushLib.checkPermissions("helloClient", function() {}); 
} 
} 
}); 

/*globals MainLocalObj*/ 
$j(window).load(function() { 
'use strict'; 
MainLocalObj.init(); 
});

我如何删除JS？

我发现唯一的其他选择是：

https://github.com/aaronsw/html2text

与html2text的问题是，它真的真的慢的时候，并创建noticable滞后，这是一个件事NLTK总是很好用。

来源

2014-04-02 KVISH

这将真正帮助，如果我们可以看到（的部分）的HTML包括JavaScript –

添加一个例子。 – KVISH

部分基于Can I remove script tags with BeautifulSoup?

import urllib 
from bs4 import BeautifulSoup 

url = "http://www.cnn.com" 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 

# kill all script and style elements 
for script in soup(["script", "style"]): 
    script.decompose() # rip it out 

# get text 
text = soup.get_text() 

# break into lines and remove leading and trailing space on each 
lines = (line.strip() for line in text.splitlines()) 
# break multi-headlines into a line each 
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 
# drop blank lines 
text = '\n'.join(chunk for chunk in chunks if chunk) 

print(text)

来源

2014-04-02 02:15:39

而不是'script.extract（）'，最好使用'script.decompose（）'，它只会在不返回标签对象的情况下删除。 –

为了防止编码错误在最后...

import urllib 
from bs4 import BeautifulSoup 

url = url 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 

# kill all script and style elements 
for script in soup(["script", "style"]): 
    script.extract() # rip it out 

# get text 
text = soup.get_text() 

# break into lines and remove leading and trailing space on each 
lines = (line.strip() for line in text.splitlines()) 
# break multi-headlines into a line each 
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 
# drop blank lines 
text = '\n'.join(chunk for chunk in chunks if chunk) 

print(text.encode('utf-8'))

来源

2014-07-26 06:51:10 bumpkin

BeatifulSoup4 get_text仍然有javascript

回答

相关问题