2012-01-06 18 views
1

我有一个包含Lucene索引文件这样的:Lucene的查询(带状疱疹?)

_id  |   Name   |  Alternate Names  | Population 

123  Bosc de Planavilla    (some names here in   5000 
345  Planavilla      other languages)    20000 
456  Bosc de la Planassa           1000 
567  Bosc de Plana en Blanca          100000 

什么是我应该用最好的Lucene的查询类型,我应该如何构建它考虑到我需要以下条件:

  1. 如果一个用户查询: “近博斯克德Planavilla意大利餐厅”我想,因为它包含了文档的名称完全匹配要返回ID为123的文件。

  2. 如果一个用户查询: 我想ID 345文件,因为查询包含完全匹配,它具有最高的人口“近Planavilla意大利餐厅”。

  3. 如果用户查询“Bosc附近的意大利餐厅” 我想要567,因为查询包含“Bosc”,而3“Bosc”,它具有最高的流行音乐。

可能有许多其它使用情况......但你得到的是什么,我需要的感觉...

什么样的查询会做这种形式我呢? 我应该生成单词N克(带状疱疹)并使用带状疱疹创建ORed布尔查询,然后应用自定义评分?或将一个常规的短语查询会做什么?我也看到了DisjunctionMaxQuery,但不知道它是什么即时通讯寻找...

这个想法,你现在可能已经理解了,是找到用户在他的查询中隐含的确切位置。从那我可以开始我的地理搜索,并添加一些进一步的查询。

什么是最好的方法?

在此先感谢。

回答

1

下面是用于分选以及代码。尽管我认为在考虑城市规模的基础上增加一个自定义评分会更有意义,而不是强迫人口排序。另请注意,这使用FieldCache,这可能不是关于内存使用情况的最佳解决方案。

public class ShingleFilterTests { 
    private Analyzer analyzer; 
    private IndexSearcher searcher; 
    private IndexReader reader; 
    private QueryParser qp; 
    private Sort sort; 

    public static Analyzer createAnalyzer(final int shingles) { 
     return new Analyzer() { 
      @Override 
      public TokenStream tokenStream(String fieldName, Reader reader) { 
       TokenStream tokenizer = new WhitespaceTokenizer(reader); 
       tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en")); 
       if (shingles > 0) { 
        tokenizer = new ShingleFilter(tokenizer, shingles); 
       } 
       return tokenizer; 
      } 
     }; 
    } 

    public class PopulationComparatorSource extends FieldComparatorSource { 
     @Override 
     public FieldComparator newComparator(String fieldname, int numHits, int sortPos, boolean reversed) throws IOException { 
      return new PopulationComparator(fieldname, numHits); 
     } 

     private class PopulationComparator extends FieldComparator { 
      private final String fieldName; 
      private Integer[] values; 
      private int[] populations; 
      private int bottom; 

      public PopulationComparator(String fieldname, int numHits) { 
       values = new Integer[numHits]; 
       this.fieldName = fieldname; 
      } 

      @Override 
      public int compare(int slot1, int slot2) { 
       if (values[slot1] > values[slot2]) return -1; 
       if (values[slot1] < values[slot2]) return 1; 
       return 0; 
      } 

      @Override 
      public void setBottom(int slot) { 
       bottom = values[slot]; 
      } 

      @Override 
      public int compareBottom(int doc) throws IOException { 
       int value = populations[doc]; 
       if (bottom > value) return -1; 
       if (bottom < value) return 1; 
       return 0; 
      } 

      @Override 
      public void copy(int slot, int doc) throws IOException { 
       values[slot] = populations[doc]; 
      } 

      @Override 
      public void setNextReader(IndexReader reader, int docBase) throws IOException { 
       /* XXX uses field cache */ 
       populations = FieldCache.DEFAULT.getInts(reader, "population"); 
      } 

      @Override 
      public Comparable value(int slot) { 
       return values[slot]; 
      } 
     } 
    } 

    @Before 
    public void setUp() throws Exception { 
     Directory dir = new RAMDirectory(); 
     analyzer = createAnalyzer(3); 

     IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED); 
     ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa", 
                   "Bosc de Plana en Blanca"); 
     ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000); 

     for (int id = 0; id < cities.size(); id++) { 
      Document doc = new Document(); 
      doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED)); 
      doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED)); 
      doc.add(new Field("population", String.valueOf(populations.get(id)), 
            Field.Store.YES, Field.Index.NOT_ANALYZED)); 
      writer.addDocument(doc); 
     } 
     writer.close(); 

     qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0)); 
     sort = new Sort(new SortField("population", new PopulationComparatorSource())); 
     searcher = new IndexSearcher(dir); 
     searcher.setDefaultFieldSortScoring(true, true); 
     reader = searcher.getIndexReader(); 
    } 

    @After 
    public void tearDown() throws Exception { 
     searcher.close(); 
    } 

    @Test 
    public void testShingleFilter() throws Exception { 
     System.out.println("shingle filter"); 

     printSearch("city:\"Bosc de Planavilla\""); 
     printSearch("city:Planavilla"); 
     printSearch("city:Bosc"); 
    } 

    private void printSearch(String query) throws ParseException, IOException { 
     Query q = qp.parse(query); 
     System.out.println("query " + q); 
     TopDocs hits = searcher.search(q, null, 4, sort); 
     System.out.println("results " + hits.totalHits); 
     int i = 1; 
     for (ScoreDoc dc : hits.scoreDocs) { 
      Document doc = reader.document(dc.doc); 
      System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population")); 
     } 
     System.out.println(); 
    } 
} 

这得出以下结果:

query city:"Bosc Planavilla" 
results 1 
1. doc=0 score=1.143841[5000] "Bosc de Planavilla" population: 5000 

query city:Planavilla 
results 2 
1. doc=1 score=1.287682[20000] "Planavilla" population: 20000 
2. doc=0 score=0.643841[5000] "Bosc de Planavilla" population: 5000 

query city:Bosc 
results 3 
1. doc=3 score=0.375[100000] "Bosc de Plana en Blanca" population: 100000 
2. doc=0 score=0.5[5000] "Bosc de Planavilla" population: 5000 
3. doc=2 score=0.5[1000] "Bosc de la Planassa" population: 1000 
+0

非常感谢!你的方法与我最后的方法类似,并且产生良好的结果。但它不完美...在300万文档索引上,我得到的响应时间高达1秒(在一台机器上)。此外,我经常会遇到一些古怪的情况,比如在寻找“印度酒吧巴黎”时,它会返回“Rich Bar Indian Reserve”,这实际上并不是我想要的:)。如果可能的话,我会尝试使用评分和索引时间提升来改进这一点,具体取决于功能类型。感谢您的热心帮助 ! – azpublic 2012-01-13 02:31:56

+0

3百万份文件的1秒钟声音听起来太多了。你如何排序?你可以使用探查器来检查CPU的进展情况。我正在搜索4000万个文档索引,其中包含复杂的查询和大约70毫秒的分面和自定义排序。 – wesen 2012-01-13 08:31:06

1

你如何标记字段?你把它们存储为完整的字符串?另外,你如何解析查询?

好的,所以我正在玩这个。我一直在使用StopFilter去除la,en,de。然后,我使用木瓦过滤器来获得多种组合,以完成“完全匹配”。因此,例如Bosc de Planavilla被标记为[Bosc] [Bosc Planavilla],Bosc de Plana en Blanca被标记为[Bosc] [Bosc Plana] [Plana Blanca] [Bosc Plana Blanca]。这样可以使查询的某些部分具有“完全匹配”。

我然后查询用户通过精确的字符串,虽然可能会有一些适应那里。我用简单的例子来解决问题,使结果更符合您的需求。

这里是代码我使用(lucene的3.0.3):

public class ShingleFilterTests { 
    private Analyzer analyzer; 
    private IndexSearcher searcher; 
    private IndexReader reader; 

    public static Analyzer createAnalyzer(final int shingles) { 
     return new Analyzer() { 
      @Override 
      public TokenStream tokenStream(String fieldName, Reader reader) { 
       TokenStream tokenizer = new WhitespaceTokenizer(reader); 
       tokenizer = new StopFilter(false, tokenizer, ImmutableSet.of("de", "la", "en")); 
       if (shingles > 0) { 
        tokenizer = new ShingleFilter(tokenizer, shingles); 
       } 
       return tokenizer; 
      } 
     }; 
    } 

    @Before 
    public void setUp() throws Exception { 
     Directory dir = new RAMDirectory(); 
     analyzer = createAnalyzer(3); 

     IndexWriter writer = new IndexWriter(dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED); 
     ImmutableList<String> cities = ImmutableList.of("Bosc de Planavilla", "Planavilla", "Bosc de la Planassa", 
                   "Bosc de Plana en Blanca"); 
     ImmutableList<Integer> populations = ImmutableList.of(5000, 20000, 1000, 100000); 

     for (int id = 0; id < cities.size(); id++) { 
      Document doc = new Document(); 
      doc.add(new Field("id", String.valueOf(id), Field.Store.YES, Field.Index.NOT_ANALYZED)); 
      doc.add(new Field("city", cities.get(id), Field.Store.YES, Field.Index.ANALYZED)); 
      doc.add(new Field("population", String.valueOf(populations.get(id)), 
            Field.Store.YES, Field.Index.NOT_ANALYZED)); 
      writer.addDocument(doc); 
     } 
     writer.close(); 

     searcher = new IndexSearcher(dir); 
     reader = searcher.getIndexReader(); 
    } 

    @After 
    public void tearDown() throws Exception { 
     searcher.close(); 
    } 

    @Test 
    public void testShingleFilter() throws Exception { 
     System.out.println("shingle filter"); 

     QueryParser qp = new QueryParser(Version.LUCENE_30, "city", createAnalyzer(0)); 

     printSearch(qp, "city:\"Bosc de Planavilla\""); 
     printSearch(qp, "city:Planavilla"); 
     printSearch(qp, "city:Bosc"); 
    } 

    private void printSearch(QueryParser qp, String query) throws ParseException, IOException { 
     Query q = qp.parse(query); 

     System.out.println("query " + q); 
     TopDocs hits = searcher.search(q, 4); 
     System.out.println("results " + hits.totalHits); 
     int i = 1; 
     for (ScoreDoc dc : hits.scoreDocs) { 
      Document doc = reader.document(dc.doc); 
      System.out.println(i++ + ". " + dc + " \"" + doc.get("city") + "\" population: " + doc.get("population")); 
     } 
     System.out.println(); 
    } 
} 

我现在寻找到人均排序。

此打印出:

query city:"Bosc Planavilla" 
results 1 
1. doc=0 score=1.143841 "Bosc de Planavilla" population: 5000 

query city:Planavilla 
results 2 
1. doc=1 score=1.287682 "Planavilla" population: 20000 
2. doc=0 score=0.643841 "Bosc de Planavilla" population: 5000 

query city:Bosc 
results 3 
1. doc=0 score=0.5 "Bosc de Planavilla" population: 5000 
2. doc=2 score=0.5 "Bosc de la Planassa" population: 1000 
3. doc=3 score=0.375 "Bosc de Plana en Blanca" population: 100000 
+0

谢谢来回回复wesen。实际上,名称字段使用标准令牌过滤器,标准令牌过滤器,小写令牌过滤器和停止令牌过滤器进行索引。但这很容易改变。我的问题其实也是我应该如何索引和查询解析? – azpublic 2012-01-12 23:14:35