只从html页面提取单词

我正在使用python 2.7，并且我有一个文件夹和一个html页面列表，我只想从中提取单词。目前，我正在使用的过程是打开html文件，通过美丽的汤库运行它，获取文本并将其写入新文件。但这里的问题是我仍然得到输出中的javascript，css（body，color，＃000000 .etc），symbols（|，`，〜，[] .etc）和随机数。只从html页面提取单词

我该如何摆脱不必要的输出并仅获取文本？

path = *folder path* 
raw = open(path + "/raw.txt", "w") 
files = os.listdir(path) 
for name in files: 
    fname = os.path.join(path, name) 
    try: 
     with open(fname) as f: 
      b = f.read() 
      soup = BeautifulSoup(b) 
      txt = soup.body.getText().encode("UTF-8") 
      raw.write(txt)

来源

2014-12-29 user3702643

你所说的“字”是什么意思？为了从一个字符串中提取单词，需要一个非常有效的“单词”定义，一个可以变成算法的单词。例如，“挑选”一个单词，还是两个单词分隔的单词？那么“F1”，“i18n”和“α”呢？ –

在这种情况下，一个词被定义为任何可用在英语词典 – user3702643

所以你需要一个字典查找呢？（使用一些字典，你认为是“字典”）。 –

能去掉脚本和风格标签

import requests 
from bs4 import BeautifulSoup 

session = requests.session() 

soup = BeautifulSoup(session.get('http://stackoverflow.com/questions/27684020/extracting-only-words- from-html-pages').text) 

#This part here will strip out the script and style tags. 
for script in soup(["script", "style"]): 
script.extract() 

print soup.get_text()

来源

2014-12-29 06:14:54 mnjeremiah

完美工作。谢谢！ – user3702643

只从html页面提取单词

回答

相关问题