斯坦福核心NLP - 理解指代消解

我遇到一些麻烦了解在斯坦福NLP工具的最后一个版本到COREF解析器所做的更改。作为一个例子，下面是一个句子和相应的CorefChainAnnotation：斯坦福核心NLP - 理解指代消解

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons. 

{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

我不知道我理解这些数字的含义。查看源代码也没有任何帮助。

谢谢

来源

2011-07-04 pnsilva

第一个数字是（表示令牌，这代表了相同实体）的群集ID，见SieveCoreferenceSystem#coref(Document)源代码。这对数字outout CorefChain＃的toString（）的：

public String toString(){ 
    return position.toString(); 
}

，其中位置是一组实体现在的位置是对提的（让他们使用CorefChain.getCorefMentions()）。下面是一个完整的代码（groovy），这表明如何从位置标记的例子：

class Example { 
    public static void main(String[] args) { 
     Properties props = new Properties(); 
     props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); 
     props.put("dcoref.score", true); 
     pipeline = new StanfordCoreNLP(props); 
     Annotation document = new Annotation("The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons."); 

     pipeline.annotate(document); 
     Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class); 

     println aText 

     for(Map.Entry<Integer, CorefChain> entry : graph) { 
      CorefChain c = entry.getValue();     
      println "ClusterId: " + entry.getKey(); 
      CorefMention cm = c.getRepresentativeMention(); 
      println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex); 

      List<CorefMention> cms = c.getCorefMentions(); 
      println "Mentions: "; 
      cms.each { it -> 
       print aText.subSequence(it.startIndex, it.endIndex) + "|"; 
      }   
     } 
    } 
}

输出（我不明白其中“s”来自）：

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons. 
ClusterId: 1 
Representative Mention: he 
Mentions: he|atom |s| 
ClusterId: 6 
Representative Mention: basic unit 
Mentions: basic unit | 
ClusterId: 8 
Representative Mention: unit 
Mentions: unit | 
ClusterId: 10 
Representative Mention: it 
Mentions: it |

来源

2011-07-06 12:42:35 Skarab

ps。我认为默认设置（模型）不适用于您的域。 stanford核心nlp似乎更适合从新闻，文章等中提取语义。例如，Stanford NER--核心NLP的一部分 - 在CoNLL 2002和2003语料库上进行了训练并进行了测试。 – Skarab

这个算法是部分有用的，并且使我找到了正确的算法，但是这里的输出对于句子来说是不正确的，在句子或者“s”中没有“他”，并且“it”恰好映射到它本身，共同决议的重点。 – user1084563

我认为你认为'startIndex'和'endIndex'就好像它们是字符索引（从0开始），但它们是标记索引（从1开始）。另外，你没有定义'aText'。假设你的意思是注解中的文字，而不是“他”（字符1和2），你应该有“原子”（单词1和2）等。 –

我一直在与共参照依赖关系图，我开始利用对方的回答了这个问题。过了一段时间，虽然我意识到上述算法并不完全正确。它产生的输出甚至与我所修改的版本差不多。

对于使用这篇文章的其他人来说，这里是我结束的算法，它也过滤掉了自引用，因为每个代表性的提示也提到了自身，很多提到的只是引用自己。

Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class); 

for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) { 
    CorefChain c = entry.getValue(); 

    //this is because it prints out a lot of self references which aren't that useful 
    if(c.getCorefMentions().size() <= 1) 
     continue; 

    CorefMention cm = c.getRepresentativeMention(); 
    String clust = ""; 
    List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class); 
    for(int i = cm.startIndex-1; i < cm.endIndex-1; i++) 
     clust += tks.get(i).get(TextAnnotation.class) + " "; 
    clust = clust.trim(); 
    System.out.println("representative mention: \"" + clust + "\" is mentioned by:"); 

    for(CorefMention m : c.getCorefMentions()){ 
     String clust2 = ""; 
     tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class); 
     for(int i = m.startIndex-1; i < m.endIndex-1; i++) 
      clust2 += tks.get(i).get(TextAnnotation.class) + " "; 
     clust2 = clust2.trim(); 
     //don't need the self mention 
     if(clust.equals(clust2)) 
      continue; 

     System.out.println("\t" + clust2); 
    } 
}

并为您的例句最终输出如下：

representative mention: "a basic unit of matter" is mentioned by: 
The atom 
it

通常的“原子”，最终被代表提及，但在情况下，它不会令人惊讶。输出结果稍微更精确的另一个例子如下：

革命战争发生在1700年代，这是在美国的第一场战争。

产生以下输出：

representative mention: "The Revolutionary War" is mentioned by: 
it 
the first war in the United States

来源

2011-12-16 13:43:58 user1084563

这些是从注释器最近的结果。

[1,1] 1所述的原子
[1,2] 1物质的一个基本单元
[1，3] 1它
[1,6] 6个带负电荷的电子
[1，5] 5带负电荷的电子云

该标记如下：

[Sentence number,'id'] Cluster_no Text_Associated

属于同一群集的文本引用相同的上下文。

来源

2017-07-18 07:00:50 Purvanshi

斯坦福核心NLP - 理解指代消解

回答

相关问题