斯坦福NLP训练n-gram NER

最近我一直试图用斯坦福核心NLP训练n-gram实体。我遵循以下教程 - http://nlp.stanford.edu/software/crf-faq.shtml#b 斯坦福NLP训练n-gram NER

使用此功能，我只能指定单字符标记及其所属的类。任何人都可以引导我，让我可以将它扩展到n-gram。我试图从聊天数据集中提取已知的实体，如电影名称。

如果我错误地解释了斯坦福教程并且可以用于n-gram培训，请指导我。

什么我坚持的是下列财产

#structure of your training file; this tells the classifier 
#that the word is in column 0 and the correct answer is in 
#column 1 
map = word=0,answer=1

这里的第一列是字（单gram），第二列是实体，例如

CHAPTER O 
I O 
Emma PERS 
Woodhouse PERS

现在，我需要培训像绿巨人,泰坦尼克等已知实体（比如电影名称）作为电影，这种方法很容易。但如果我需要训练我知道你去年夏天做了什么或宝宝出门，最好的方法是什么？

来源

2013-03-25 Arun A K

尊敬的@Arun您是否成功地培训NER为n-grams？我想培养像科学硕士：教育，电子博士学位：教育。你能指导我吗？谢谢 – 2017-01-19 13:43:27

@KhalidUsman，感谢您的支持。我已经在下面的答案中使用了LingPipe来实现这一点。训练数据集体积相当不错。任何模型都可以正常工作，这取决于你提供的数据集有多好。 – 2017-01-19 16:48:32

在这里等待答案已经很久了。我一直无法想出使用斯坦福核心来完成它的方式。然而任务完成。我已经使用了LingPipe NLP库。在这里引用答案是因为我认为别人可以从中受益。

如果您是开发人员或研究人员，或在任何情况下进行实施，请先查看Lingpipe licencing。

Lingpipe提供了各种NER方法。

1）基于字典的NER

2）统计NER（HMM基于）

3）基于规则的NER等

我已经使用了字典以及所述统计方法。

第一个是直接查找方法，第二个是基于培训。

为基于字典NER的例子可以发现here

的statstical方法需要培训档案。我已经使用了以下格式的文件 -

<root> 
<s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX> to be trained</s> 
... 
<s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX> annotated </s> 
</root>

然后我使用下面的代码来训练实体。

import java.io.File; 
import java.io.IOException; 

import com.aliasi.chunk.CharLmHmmChunker; 
import com.aliasi.corpus.parsers.Muc6ChunkParser; 
import com.aliasi.hmm.HmmCharLmEstimator; 
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory; 
import com.aliasi.tokenizer.TokenizerFactory; 
import com.aliasi.util.AbstractExternalizable; 

@SuppressWarnings("deprecation") 
public class TrainEntities { 

    static final int MAX_N_GRAM = 50; 
    static final int NUM_CHARS = 300; 
    static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior 

    public static void main(String[] args) throws IOException { 
     File corpusFile = new File("inputfile.txt");// my annotated file 
     File modelFile = new File("outputmodelfile.model"); 

     System.out.println("Setting up Chunker Estimator"); 
     TokenizerFactory factory 
      = IndoEuropeanTokenizerFactory.INSTANCE; 
     HmmCharLmEstimator hmmEstimator 
      = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION); 
     CharLmHmmChunker chunkerEstimator 
      = new CharLmHmmChunker(factory,hmmEstimator); 

     System.out.println("Setting up Data Parser"); 
     Muc6ChunkParser parser = new Muc6ChunkParser(); 
     parser.setHandler(chunkerEstimator); 

     System.out.println("Training with Data from File=" + corpusFile); 
     parser.parse(corpusFile); 

     System.out.println("Compiling and Writing Model to File=" + modelFile); 
     AbstractExternalizable.compileTo(chunkerEstimator,modelFile); 
    } 

}

，并测试我用下面的类的NER

import java.io.BufferedReader; 
import java.io.File; 
import java.io.FileReader; 
import java.util.ArrayList; 
import java.util.Set; 

import com.aliasi.chunk.Chunk; 
import com.aliasi.chunk.Chunker; 
import com.aliasi.chunk.Chunking; 
import com.aliasi.util.AbstractExternalizable; 

public class Recognition { 
    public static void main(String[] args) throws Exception { 
     File modelFile = new File("outputmodelfile.model"); 
     Chunker chunker = (Chunker) AbstractExternalizable 
       .readObject(modelFile); 
     String testString="my test string"; 
      Chunking chunking = chunker.chunk(testString); 
      Set<Chunk> test = chunking.chunkSet(); 
      for (Chunk c : test) { 
       System.out.println(testString + " : " 
         + testString.substring(c.start(), c.end()) + " >> " 
         + c.type()); 

     } 
    } 
}

代码提供者：谷歌:)

来源

2013-04-15 14:15:43

http://tech.groups.yahoo.com/group/LingPipe/message/68提供了有关语料库准备的更多信息。 – 2013-05-10 05:50:20

我也试过相同的代码。你能否提一下你是如何准备训练集的？我把它作为一个文本文件添加进去了，并试图添加我自己的实体但它不起作用...... plz帮助我。我不知道我是否误解了训练集 – lulu 2014-04-19 17:28:53

的美国航空乘务员在作出短飞行夏洛特，飞机的后 NC，不停地偷看在第21行的一个座位，使得9个月大的笑声变成了9个月大的笑脸。 – lulu 2014-04-19 17:32:23

答案基本上是在引用的例子给出，其中“艾玛伍德豪斯”是一个名字。我们提供的默认模型使用IO编码，并假定相同类的相邻标记是同一个实体的一部分。在很多情况下，这几乎总是如此，并且保持模型更简单。但是，如果你不想这样做，你可以训练与其他标签编码，如常用的IOB编码，在那里你会代替标签的东西NER型号：

Emma B-PERSON 
Woodhouse I-PERSON

再将相同的相邻的标记可以表示类别但不是相同的实体。

来源

2013-07-10 03:40:49

谢谢@Chris，让我尝试用这种编码格式创建一个新模型。 – 2013-07-11 06:19:13

@ChristopherManning如何在NER中启用IOB编码？ Thx – 2014-01-30 21:54:17

我在这个问题的答案中提供了IOB编码选项的讨论：http://stackoverflow.com/questions/21469082/how-do-i-use-iob-tags-with-stanford-ner – 2014-02-23 03:58:46

我面临着为automative domain标记ngram短语的相同挑战。我一直在寻找一种高效的关键字映射，可用于在稍后阶段创建培训文件。我最终在NLP管道中使用了regexner，提供了一个带有正则表达式（ngram组件术语）和它们相应标签的映射文件。请注意，在这种情况下没有实现NER机器学习。希望这些信息有助于某人！

来源

2016-10-04 04:40:34

斯坦福NLP训练n-gram NER

回答

相关问题