2014-04-02 25 views
29

我试图删除所有使用bs4的html/javascript,但是,它并没有摆脱javascript。我仍然在那里看到它的文字。我怎样才能解决这个问题?BeatifulSoup4 get_text仍然有javascript

我尝试过使用nltk,它工作正常,但是,clean_htmlclean_url将被删除前进。有没有办法使用汤get_text并获得相同的结果?

我试图寻找这些网页:

BeautifulSoup get_text does not strip all tags and JavaScript

我目前使用的NLTK的废弃的函数。

编辑

下面是一个例子:

import urllib 
from bs4 import BeautifulSoup 

url = "http://www.cnn.com" 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 
print soup.get_text() 

我仍然看到CNN如下:

$j(function() { 
"use strict"; 
if (window.hasOwnProperty('safaripushLib') && window.safaripushLib.checkEnv()) { 
var pushLib = window.safaripushLib, 
current = pushLib.currentPermissions(); 
if (current === "default") { 
pushLib.checkPermissions("helloClient", function() {}); 
} 
} 
}); 

/*globals MainLocalObj*/ 
$j(window).load(function() { 
'use strict'; 
MainLocalObj.init(); 
}); 

我如何删除JS?

我发现唯一的其他选择是:

https://github.com/aaronsw/html2text

html2text的问题是,它真的真的慢的时候,并创建noticable滞后,这是一个件事NLTK总是很好用。

+0

这将真正帮助,如果我们可以看到(的部分)的HTML包括JavaScript –

+0

添加一个例子。 – KVISH

回答

55

部分基于Can I remove script tags with BeautifulSoup?

import urllib 
from bs4 import BeautifulSoup 

url = "http://www.cnn.com" 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 

# kill all script and style elements 
for script in soup(["script", "style"]): 
    script.decompose() # rip it out 

# get text 
text = soup.get_text() 

# break into lines and remove leading and trailing space on each 
lines = (line.strip() for line in text.splitlines()) 
# break multi-headlines into a line each 
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 
# drop blank lines 
text = '\n'.join(chunk for chunk in chunks if chunk) 

print(text) 
+5

而不是'script.extract()',最好使用'script.decompose()',它只会在不返回标签对象的情况下删除。 –

7

为了防止编码错误在最后...

import urllib 
from bs4 import BeautifulSoup 

url = url 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html) 

# kill all script and style elements 
for script in soup(["script", "style"]): 
    script.extract() # rip it out 

# get text 
text = soup.get_text() 

# break into lines and remove leading and trailing space on each 
lines = (line.strip() for line in text.splitlines()) 
# break multi-headlines into a line each 
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 
# drop blank lines 
text = '\n'.join(chunk for chunk in chunks if chunk) 

print(text.encode('utf-8'))