Lucene.NET：骆驼案例标记器？

我已经开始使用Lucene.NET，我写了一个简单的测试方法来对源代码文件进行索引和搜索。问题在于标准分析器/标记器将整个驼峰案例源代码标识符名称视为单个标记。Lucene.NET：骆驼案例标记器？

我正在寻找一种方式来对待像MaxWidth骆驼个案标识分为三个令牌：maxwidth，max和width。我找过这样一个标记器，但我找不到它。在写我自己的之前：这方面有什么东西吗？或者有没有比从零开始编写标记器更好的方法？

更新：最后我决定把我的手弄脏，我自己写了一个CamelCaseTokenFilter。我会在博客上写一篇关于它的文章，我会更新这个问题。

来源

2010-09-10 Igor Brejc

Solr有一个WordDelimiterFactory它生成一个类似于你所需要的分词器。也许你可以将源代码翻译成C＃。

来源

2010-09-10 21:23:17

是的，我已经注意到了这一点，尽管它并没有真正做我正在寻找的东西。最后我自己写了CamelCaseTokenFilter。但我会接受你的答案。 – 2010-09-11 06:13:19

下面的链接可能是有益的定制标记生成器写...

http://karticles.com/NoSql/lucene_custom_tokenizer.html

来源

2012-02-27 16:10:44 vrluckyin

这是我实现：

package corp.sap.research.indexing; 

import java.io.IOException; 

import org.apache.lucene.analysis.TokenFilter; 
import org.apache.lucene.analysis.TokenStream; 
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; 

public class CamelCaseFilter extends TokenFilter { 

    private final CharTermAttribute _termAtt; 

    protected CamelCaseScoreFilter(TokenStream input) { 
     super(input); 
     this._termAtt = addAttribute(CharTermAttribute.class); 
    } 

    @Override 
    public boolean incrementToken() throws IOException { 
     if (!input.incrementToken()) 
      return false; 
     CharTermAttribute a = this.getAttribute(CharTermAttribute.class); 
     String spliettedString = splitCamelCase(a.toString()); 
     _termAtt.setEmpty(); 
     _termAtt.append(spliettedString); 
     return true; 

    } 


    static String splitCamelCase(String s) { 
      return s.replaceAll(
       String.format("%s|%s|%s", 
       "(?<=[A-Z])(?=[A-Z][a-z])", 
       "(?<=[^A-Z])(?=[A-Z])", 
       "(?<=[A-Za-z])(?=[^A-Za-z])" 
      ), 
       " " 
      ); 
     } 
}

来源

2012-03-19 16:48:54

Adir这似乎很好。这里是我在Python中实现它的核心： 're.sub（'（（？？= [AZ]）（？= [AZ] [az]）|（？<= [^ AZ]）（？ = [AZ]）|（？<= [A-Za-z]）（？= [^ A-Za-z]））''，''，“CamelCaseAWordFORMe” – 2012-08-23 11:33:48

Lucene.NET：骆驼案例标记器？

回答

相关问题