如何将单词转换为句子字符串 - 文本分类

所以我目前正在与布朗语料库合作，而且我遇到了一个小问题。为了应用标记化特征，我首先需要将布朗语料库加入句子。这是我到目前为止有：如何将单词转换为句子字符串 - 文本分类

from nltk.corpus import brown 
import nltk 


target_text = [s for s in brown.fileids() 
        if s.startswith('ca01') or s.startswith('ca02')] 

data = [] 

total_text = [s for s in brown.fileids() 
        if s.startswith('ca01') or s.startswith('ca02') or s.startswith('cp01') or s.startswith('cp02')] 


for text in total_text: 

    if text in target_text: 
     tag = "pos" 
    else: 
     tag = "neg" 
    words=list(brown.sents(total_text))  
    data.extend([(tag, word) for word in words]) 

data

当我这样做，我得到的是这样的数据：

[('pos', 
    ['The', 
    'Fulton', 
    'County', 
    'Grand', 
    'Jury', 
    'said', 
    'Friday', 
    'an', 
    'investigation', 
    'of', 
    "Atlanta's", 
    'recent', 
    'primary', 
    'election', 
    'produced', 
    '``', 
    'no', 
    'evidence', 
    "''", 
    'that', 
    'any', 
    'irregularities', 
    'took', 
    'place', 
    '.']), 
('pos', 
    ['The', 
    'jury', 
    'further', 
    'said', 
    'in', 
    'term-end', 
    'presentments', 
    'that', 
    'the', 
    'City', 
    'Executive', 
    'Committee', 
    ',', 
    'which', 
    'had', 
    'over-all', 
    'charge', 
    'of', 
    'the', 
    'election', 
    ',', 
    '``', 
    'deserves', 
    'the', 
    'praise', 
    'and', 
    'thanks', 
    'of', 
    'the', 
    'City', 
    'of', 
    'Atlanta', 
    "''", 
    'for', 
    'the', 
    'manner', 
    'in', 
    'which', 
    'the', 
    'election', 
    'was', 
    'conducted', 
    '.'])

我需要的东西，看起来像：

[('pos', 'The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election ....'), ('pos', The jury further said in term-end presentments that the City...)]

有什么方法可以解决这个问题吗？此项目正在走的路比我预想的要长。

来源

2017-04-13 Elizabeth

根据the docs,方法返回字符串（单词）列表（句子）列表。

如果你想重构句子，你可以试着用空格连接它们。但是，这不会真的工作，由于标点符号：

data.extend([(tag, ' '.join(word)) for word in words])

你会得到这样的事情：

'the', 
'election', 
',', 
'``', 
'deserves', 
'the',

该地图：

the election , `` deserves the

由于加入不了解标点符号。 nltk是否包含某种标点感知格式化程序？

来源

2017-04-13 02:59:43

如何将单词转换为句子字符串 - 文本分类

回答

相关问题