2015-06-22 46 views
0

我需要找到一个文本中最常见的术语。环顾我创建了自己的Analyzer子类,并推翻其createComponents方法。分析仪的TokenStream中抛出StackOverflowError

@Override 
protected TokenStreamComponents createComponents(String fieldName, Reader reader) { 

    Tokenizer source = new NGramTokenizer(Version.LUCENE_47, reader, 12, 12); 
    TokenStream filter = new LowerCaseFilter(Version.LUCENE_47, source); 

    try { 

     TokenStream tokenStream = tokenStream(fieldName, reader); 
     OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class); 
     CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); 
     tokenStream.reset(); 
     System.out.println("tokenStream " + tokenStream); 
     while (tokenStream.incrementToken()) { 
      //int startOffset = offsetAttribute.startOffset(); 
      //int endOffset = offsetAttribute.endOffset(); 
      String term = charTermAttribute.toString(); 
      System.out.println("term = " + term); 
     }     

    } catch(Exception e) { 
     e.printStackTrace(); 
    } 

    return new TokenStreamComponents(source, filter); 
} 

这是我怎样,我称之为:

Directory index = new RAMDirectory(); 
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, rma); 

StringReader sr = new StringReader(descProd1); 
IndexWriter w = new IndexWriter(index, config); 
LuceneUtil.addDoc(w, descProd1, "193398817"); 

rma.createComponents("content", sr); 
w.close(); 
rma.close(); 

addDoc方法:

public static void addDoc(IndexWriter w, String title, String isbn) throws IOException { 
    Document doc = new Document(); 
    doc.add(new TextField("title", title, Field.Store.YES)); 

    doc.add(new StringField("isbn", isbn, Field.Store.YES)); 
    w.addDocument(doc); 
} 

当我运行这一点,与java.lang.StackOverflowError在此行炸毁

TokenStream tokenStream = tokenStream(fieldName, reader); 

我是新来Lucene所以我不知道如果我正确的道路上。我是吗?

回答

0

tokenStream来电createComponents,你的执行createComponents来电tokenStream !!所以你处于无限循环中!

为什么你读的createComponents流?只要做到:

@Override 
protected TokenStreamComponents createComponents(String fieldName, Reader reader) { 

    Tokenizer source = new NGramTokenizer(Version.LUCENE_47, reader, 12, 12); 
    TokenStream filter = new LowerCaseFilter(Version.LUCENE_47, source); 

    return new TokenStreamComponents(source, filter); 
} 

,然后配置你的作家配置中使用您的分析仪,一切都将场景下进行。

+0

酷,这消除了无限循环,但如何我现在得到的最常见的术语出TokenStreamComponents的? – Eddy

0

我是OP并且是新Lucene我是不是在正确的轨道,在我的问题的代码上。继续搜索我拼凑在一起寻找最高频率项的代码。那就是:

// create an analyzer: 
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_47); 

// create an index and add the text (strings) you want to analyze: 
Directory index = new RAMDirectory(); 
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, analyzer); 
IndexWriter w = new IndexWriter(index, config);   
addDoc(w, text1, ""); 
addDoc(w, text2, ""); 
addDoc(w, text3, ""); 
w.close(); 

// a comparator is needed for the HighFreqTerms.getHighFreqTerms method: 
Comparator<TermStats> comparator = new Comparator<TermStats>() {     
    @Override 
    public int compare(TermStats o1, TermStats o2) { 
     if(o1.totalTermFreq > o2.totalTermFreq) { 
      return 1; 
     } else if(o2.totalTermFreq > o1.totalTermFreq) { 
      return -1; 
     } 
     return 0; 
    } 
}; 

// find the highest frequency terms: 
try { 
    TermStats ts[] = HighFreqTerms.getHighFreqTerms(reader, 50, fieldName, comparator);        
    for(int i=0; i<ts.length; i++) { 
     System.out.println(ts[i]); 
    } 
} catch(Exception e) { 
    e.printStackTrace(); 
}