2017-04-13 33 views
0

所以我目前正在与布朗语料库合作,而且我遇到了一个小问题。为了应用标记化特征,我首先需要将布朗语料库加入句子。这是我到目前为止有:如何将单词转换为句子字符串 - 文本分类

from nltk.corpus import brown 
import nltk 


target_text = [s for s in brown.fileids() 
        if s.startswith('ca01') or s.startswith('ca02')] 

data = [] 

total_text = [s for s in brown.fileids() 
        if s.startswith('ca01') or s.startswith('ca02') or s.startswith('cp01') or s.startswith('cp02')] 


for text in total_text: 

    if text in target_text: 
     tag = "pos" 
    else: 
     tag = "neg" 
    words=list(brown.sents(total_text))  
    data.extend([(tag, word) for word in words]) 

data 

当我这样做,我得到的是这样的数据:

[('pos', 
    ['The', 
    'Fulton', 
    'County', 
    'Grand', 
    'Jury', 
    'said', 
    'Friday', 
    'an', 
    'investigation', 
    'of', 
    "Atlanta's", 
    'recent', 
    'primary', 
    'election', 
    'produced', 
    '``', 
    'no', 
    'evidence', 
    "''", 
    'that', 
    'any', 
    'irregularities', 
    'took', 
    'place', 
    '.']), 
('pos', 
    ['The', 
    'jury', 
    'further', 
    'said', 
    'in', 
    'term-end', 
    'presentments', 
    'that', 
    'the', 
    'City', 
    'Executive', 
    'Committee', 
    ',', 
    'which', 
    'had', 
    'over-all', 
    'charge', 
    'of', 
    'the', 
    'election', 
    ',', 
    '``', 
    'deserves', 
    'the', 
    'praise', 
    'and', 
    'thanks', 
    'of', 
    'the', 
    'City', 
    'of', 
    'Atlanta', 
    "''", 
    'for', 
    'the', 
    'manner', 
    'in', 
    'which', 
    'the', 
    'election', 
    'was', 
    'conducted', 
    '.']) 

我需要的东西,看起来像:

[('pos', 'The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election ....'), ('pos', The jury further said in term-end presentments that the City...)] 

有什么方法可以解决这个问题吗?此项目正在走的路比我预想的要长。

回答

1

根据the docs,方法返回字符串(单词)列表(句子)列表。

如果你想重构句子,你可以试着用空格连接它们。但是,这不会真的工作,由于标点符号:

data.extend([(tag, ' '.join(word)) for word in words]) 

你会得到这样的事情:

'the', 
'election', 
',', 
'``', 
'deserves', 
'the', 

该地图:

the election , `` deserves the 

由于加入不了解标点符号。 nltk是否包含某种标点感知格式化程序?

相关问题