2017-08-14 85 views
1

我是新来的Python文本处理,我试图阻止词在文本文件中,有大约5000行。词干与NLTK(python)

我写了下面的脚本

from nltk.corpus import stopwords # Import the stop word list 
from nltk.stem.snowball import SnowballStemmer 

stemmer = SnowballStemmer('english') 

def Description_to_words(raw_Description): 
    # 1. Remove HTML 
    Description_text = BeautifulSoup(raw_Description).get_text() 
    # 2. Remove non-letters   
    letters_only = re.sub("[^a-zA-Z]", " ", Description_text) 
    # 3. Convert to lower case, split into individual words 
    words = letters_only.lower().split()      

    stops = set(stopwords.words("english"))     
    # 5. Remove stop words 
    meaningful_words = [w for w in words if not w in stops] 
    # 5. stem words 
    words = ([stemmer.stem(w) for w in words]) 

    # 6. Join the words back into one string separated by space, 
    # and return the result. 
    return(" ".join(meaningful_words)) 

clean_Description = Description_to_words(train["Description"][15]) 

但是当我测试的结果的话被未去梗,谁能帮助我知道什么是问题,我做的“Description_to_words”功能不对劲

而且,当我像下面那样单独执行干命令时,它就起作用了。

from nltk.tokenize import sent_tokenize, word_tokenize 
>>> words = word_tokenize("MOBILE APP - Unable to add reading") 
>>> 
>>> for w in words: 
...  print(stemmer.stem(w)) 
... 
mobil 
app 
- 
unabl 
to 
add 
read 

回答

1

下面是您的功能的每一步,修复。

  1. 删除HTML。

    Description_text = BeautifulSoup(raw_Description).get_text() 
    
  2. 除去非字母,但不删除空格,只是还没有。你也可以简化你的正则表达式。

    letters_only = re.sub("[^\w\s]", " ", Description_text) 
    
  3. 转换为小写,分成独立的话:我建议再次使用word_tokenize,在这里。

    from nltk.tokenize import word_tokenize 
    words = word_tokenize(letters_only.lower())     
    
  4. 删除停用词。

    stops = set(stopwords.words("english")) 
    meaningful_words = [w for w in words if not w in stops] 
    
  5. 词干。这是另一个问题。茎meaningful_words,而不是words

    return ' '.join(stemmer.stem(w) for w in meaningful_words]) 
    
+0

这很简单。非常感谢您的回复。有用。我很高兴:) – user3734568

+0

只是一个问题,我们可以在词形化词中使用相同的逻辑.lemmatize()正确 – user3734568

+1

@ user3734568是的,你可以,只需将'stemmer.stem(w)'改为'lemmatizer.lemmatize(word) ' –