我需要找到一个文本中最常见的术语。环顾我创建了自己的Analyzer
子类,并推翻其createComponents
方法。分析仪的TokenStream中抛出StackOverflowError
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
Tokenizer source = new NGramTokenizer(Version.LUCENE_47, reader, 12, 12);
TokenStream filter = new LowerCaseFilter(Version.LUCENE_47, source);
try {
TokenStream tokenStream = tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
System.out.println("tokenStream " + tokenStream);
while (tokenStream.incrementToken()) {
//int startOffset = offsetAttribute.startOffset();
//int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
System.out.println("term = " + term);
}
} catch(Exception e) {
e.printStackTrace();
}
return new TokenStreamComponents(source, filter);
}
这是我怎样,我称之为:
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, rma);
StringReader sr = new StringReader(descProd1);
IndexWriter w = new IndexWriter(index, config);
LuceneUtil.addDoc(w, descProd1, "193398817");
rma.createComponents("content", sr);
w.close();
rma.close();
的addDoc
方法:
public static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
当我运行这一点,与java.lang.StackOverflowError
在此行炸毁
TokenStream tokenStream = tokenStream(fieldName, reader);
我是新来Lucene
所以我不知道如果我正确的道路上。我是吗?
酷,这消除了无限循环,但如何我现在得到的最常见的术语出TokenStreamComponents的? – Eddy