通过自定义转储创建要由webnet使用的信息内容语料库

我使用Brown语料库ic-brown.dat来计算使用wordnet nltk库的单词的信息内容。但结果不好看。我想知道如何建立自己的custome.dat（信息内容文件）。通过自定义转储创建要由webnet使用的信息内容语料库

custom_ic = wordnet_ic.ic('custom.dat')

来源

2015-09-07 aman

在（...）/ nltk_data /全集/ wordnet_ic/你会发现IC-compute.sh包含一些Perl脚本一些调用生成IC从给定语料库DAT文件。我创建了棘手的指令，而且我没有Perl脚本，因此我决定通过分析dat文件结构和wordnet.ic（）函数来创建一个python脚本。

您可以通过在语料库阅读器对象上调用wordnet.ic（）函数来计算自己的IC计数。实际上，你只需要一个带有word（）函数的对象，它返回语料库中的所有单词。有关更多详细信息，请检查文件..../nltk/corpus/reader/wordnet.py中的ic函数（第1729至1789行）。

例如，对于BNC语料库的XML版本（2007年）：

reader_bnc = nltk.corpus.reader.BNCCorpusReader(root='../Corpus/2554/2554/download/Texts/', fileids=r'[A-K]/\w*/\w*\.xml') 
bnc_ic = wn.ic(reader_bnc, False, 0.0)

要生成我创建了以下功能的.dat文件：

def is_root(synset_x): 
    if synset_x.root_hypernyms()[0] == synset_x: 
     return True 
    return False 

def generate_ic_file(IC, output_filename): 
    """Dump in output_filename the IC counts. 
    The expected format of IC is a dict 
    {'v':defaultdict, 'n':defaultdict, 'a':defaultdict, 'r':defaultdict}""" 
    with codecs.open(output_filename, 'w', encoding='utf-8') as fid: 
     # Hash code of WordNet 3.0 
     fid.write("wnver::eOS9lXC6GvMWznF1wkZofDdtbBU"+"\n") 

     # We only stored nouns and verbs because those are the only POS tags 
     # supported by wordnet.ic() function 
     for tag_type in ['v', 'n']:#IC: 
      for key, value in IC[tag_type].items(): 
       if key != 0: 
        synset_x = wn.of2ss(of="{:08d}".format(key)+tag_type) 
        if is_root(synset_x): 
         fid.write(str(key)+tag_type+" "+str(value)+" ROOT\n") 
        else: 
         fid.write(str(key)+tag_type+" "+str(value)+"\n") 
    print("Done") 

generate_ic_file(bnc_ic, "../custom.dat")

然后，只需调用功能：

custom_ic = wordnet_ic.ic('../custom.dat')

所需的进口：

import nltk 
from nltk.corpus import wordnet as wn 
from nltk.corpus import wordnet_ic 
import codecs

来源

2017-06-05 18:03:58

通过自定义转储创建要由webnet使用的信息内容语料库

回答

相关问题