2014-09-28 16 views
0

所以,我有我的基本代码我怎样写一个LuceneFilter标准化文本

public static final Pattern DIACRITICS_AND_FRIENDS 
     = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+"); 


private static String stripDiacritics(String str) { 
    str = Normalizer.normalize(str, Normalizer.Form.NFD); 
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll(""); 
    return str; 
} 

但我怎么把这个变成一个TokenFilter,我用NormalizeCharMap前但那只是修改字符串文字好,即时通讯使用Lucene 4

回答

0

你需要重写incrementToken()方法,在其中将更新CharTermAttribute

public final class DiacriticFilter extends TokenFilter { 
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); 

    @Override 
    public final boolean incrementToken() throws IOException { 
     if (input.incrementToken()) { 
      String result = stripDiacritics(new String(termAtt.buffer())); 
      char[] newBuffer = result.toCharArray(); 
      termAtt.copyBuffer(newBuffer, 0, newBuffer.length) 
      termAtt.setLength(newBuffer.length); 
      return true; 
     } else { 
      return false; 
     } 
    } 

    private static String stripDiacritics(String str) { 
     str = Normalizer.normalize(str, Normalizer.Form.NFD); 
     str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll(""); 
     return str; 
    } 
} 
+0

非常感谢你 – 2014-09-30 09:05:50

+0

实际上,在调用strip时有一个bug,需要创建字符串到正确的长度或者可以包含上一个字符的更长的字符 - 即String result = stripDiacritics(new String(termAtt.buffer())。substring(0, termAtt.length())); – 2014-09-30 09:59:08

相关问题