2011-02-11 49 views
0

我需要搜索单词“I + D”,并且我的分析仪不支持+(加号)和-(减号)符号。我如何搜索它?如何在lucene中索引/搜索+和 - 符号?

我个人分析:

/** 
* Copyright (c) 2006 Hugo Zaragoza and Jose R. P�rez-Ag�era 
* All rights reserved. 
* 
* Redistribution and use in source and binary forms, with or without 
* modification, are permitted provided that the following conditions 
* are met: 
* 1. Redistributions of source code must retain the above copyright 
* notice, this list of conditions and the following disclaimer. 
* 2. Redistributions in binary form must reproduce the above copyright 
* notice, this list of conditions and the following disclaimer in the 
* documentation and/or other materials provided with the distribution. 
* 3. Neither the name of copyright holders nor the names of its 
* contributors may be used to endorse or promote products derived 
* from this software without specific prior written permission. 
* 
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 
* ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 
* TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 
* PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL COPYRIGHT HOLDERS OR CONTRIBUTORS 
* BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 
* CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 
* SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 
* INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 
* CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
* ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 
* POSSIBILITY OF SUCH DAMAGE. 
*/ 
import java.io.BufferedReader; 
import java.io.FileInputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.io.Reader; 
import java.util.ArrayList; 
import java.util.Set; 


import org.apache.lucene.analysis.Analyzer; 
import org.apache.lucene.analysis.LowerCaseFilter; 
import org.apache.lucene.analysis.StopFilter; 
import org.apache.lucene.analysis.TokenStream; 
import org.apache.lucene.analysis.standard.StandardFilter; 
import org.apache.lucene.analysis.standard.StandardTokenizer; 

/** 
* Spanish Lucene analyzer 
* @author Hugo Zaragoza and Jose R. P�rez-Ag�era 
*/ 
public class SpanishAnalyzer extends Analyzer { 

    private Set stopSet; 

    /** 
    * Creates the Lucene Spanish Analyzer 
    * @throws IOException 
    */ 
    public SpanishAnalyzer() throws IOException { 
     super(); 
     stopSet = StopFilter.makeStopSet(loadStopWords()); 
    } 

    /** Constructs a {@link StandardTokenizer} filtered by a {@link 
    StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */ 
    public TokenStream tokenStream(String fieldName, Reader reader) { 
     TokenStream result = new StandardTokenizer(reader); 
     result = new StandardFilter(result); 
     result = new LowerCaseFilter(result); 
     result = new StopFilter(result, stopSet); 
     result = new SpanishStemmerFilter(result); 
     return result; 
    } 

    /** 
    * Loads the spanish stop-words list 
    * @throws IOException 
    */ 
    private static String[] loadStopWords() throws IOException { 

     InputStream inputStream = new FileInputStream("stopwords-spanish.txt"); 
     //InputStream inputStream = new FileInputStream("/home/becario/Escritorio/CVTKAxel/lib/stopwords-spanish.txt"); 
     Reader reader = new InputStreamReader(inputStream); 
     BufferedReader br = new BufferedReader(reader); 
     String line = br.readLine(); 
     ArrayList<String> list = new ArrayList<String>(); 
     while (line != null) { 
      list.add(line.trim()); 
      line = br.readLine(); 
     } 
     String stopWords[] = new String[list.toArray().length]; 
     for (int i = 0; i < list.toArray().length; i++) { 
      stopWords[i] = (String) list.get(i); 
     } 

     return stopWords; 
    } 
} 
+0

你在哪里执行代码中的搜索操作? – maks 2011-02-11 12:14:48

回答

1

你是什么意思与 “它不工作”?分析仪应能正常处理这些字符。你是否指的是QueryParser?如果是这样,您可以绕过它并手动创建查询,例如TermQuery

Query q = new TermQuery(new Term("field", "I+D")); 

或者,你指的是StandardTokenizer分裂非单词字符标记的事实(例如,一个“+”或“ - ”)?如果是这样,您可以简单地使用一个不同的(例如WhitespaceTokenizer)或实施您自己的。

+0

对不起,延误了。是的,我希望通过空间分开令牌,并在分析器中执行此操作,但仍然无效。 我改变了我的分析器的这一行: TokenStream result = new StandardTokenizer(reader); 对此: TokenStream result = new WhitespaceTokenizer(reader); 例如,不搜索“I + D”。 感谢您的回复。 – bonsai 2011-02-15 08:40:25