在Lucene文档中添加字段

你好，我有一个32MB的文件。这是一个简单的字典文件，编码1250，其中280万行。每行只有一个唯一字：在Lucene文档中添加字段

cat 
dog 
god 
...

我想使用Lucene搜索特定单词字典中的每个anagram。例如：

我要查的单词狗和Lucene的每一个字谜应搜索我的字典，并返回狗和神。在我的webapp我有一个词实体：

public class Word { 
    private Long id; 
    private String word; 
    private String baseLetters; 
    private String definition; 
}

和baseLetters是按字母顺序排列的字母进行搜索，例如字谜[神与狗的话会具有相同的baseLetters：DGO]变量。我成功地从我的数据库中使用此baseLetters变量在不同的服务中搜索这样的字母，但我有问题来创建我的字典文件的索引。我知道我必须添加到域：

字和baseLetters，但我不知道该怎么做:(有人能告诉我一些方向，以实现这一目标

现在我只有类似的东西？：

public class DictionaryIndexer { 

private static final Logger logger = LoggerFactory.getLogger(DictionaryIndexer.class); 

@Value("${dictionary.path}") 
private String dictionaryPath; 

@Value("${lucene.search.indexDir}") 
private String indexPath; 

public void createIndex() throws CorruptIndexException, LockObtainFailedException { 
    try { 
     IndexWriter indexWriter = getLuceneIndexer(); 
     createDocument();   
    } catch (IOException e) { 
     logger.error(e.getMessage(), e); 
    }  
} 

private IndexWriter getLuceneIndexer() throws CorruptIndexException, LockObtainFailedException, IOException { 
    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); 
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_36, analyzer); 
    indexWriterConfig.setOpenMode(OpenMode.CREATE_OR_APPEND); 
    Directory directory = new SimpleFSDirectory(new File(indexPath)); 
    return new IndexWriter(directory, indexWriterConfig); 
} 

private void createDocument() throws FileNotFoundException { 
    File sjp = new File(dictionaryPath); 
    Reader reader = new FileReader(sjp); 

    Document dictionary = new Document(); 
    dictionary.add(new Field("word", reader)); 
} 

}

PS：？还有一个问题，如果我注册DocumentIndexer就像在Spring bean的将索引创建/附加每次我重新部署我的web应用程序的时间和相同的将是与未来DictionarySearcher

来源

2012-12-21 Mariusz Grodek

Lucene不知道文件，它需要索引字符串。因此，您需要逐行读取文件，并为每行创建一个“Document”对象，每个对象有两个字段。另外，每个文档都需要添加到索引编写器中。 –

函数createDocument（）应该是

private void createDocument() throws FileNotFoundException { 
    File sjp = new File(dictionaryPath); 
    BufferedReader reader = new BufferedReader(new FileReader(sjp)); 

    String readLine = null; 
    while((readLine = reader.readLine() != null)) { 
     readLine = readLine.trim(); 
     Document dictionary = new Document(); 
     dictionary.add(new Field("word", readLine)); 
     // toAnagram methods sorts the letters in the word. Also makes it 
     // case insensitive. 
     dictionary.add(new Field("anagram", toAnagram(readLine))); 
     indexWriter.addDocument(dictionary); 
    } 
}

。

你也可以用每个anagram组一个条目为你的索引建模。

{"anagram" : "scare", "words":["cares", "acres"]} 
{"anagram" : "shoes", "words":["hoses"]} 
{"anagram" : "spore", "words":["pores", "prose", "ropes"]}

这将需要在处理字典文件时更新索引中的现有文档。在这种情况下，Solr会帮助更高级别的API。例如，IndexWriter does not support updating documents。 Solr支持更新。

这样的索引会给每个agram搜索一个结果文件。

希望它有帮助。

来源

2013-01-01 11:11:36 krishnakumarp

非常感谢。我只是想了解Lucene，所以我选择你的解决方案。我的项目处于早期阶段，未来我很可能会为Apache Lucene提供更多的功能。 –

？

Lucene不是最好的工具这是因为你没有进行搜索：你正在做一个查询。所有真正的工作都发生在“索引器”中，然后你只保存所有工作的结果。在任何散列类型的存储机制中，查找可以是O（1）。

这是你的索引应该做的：

阅读整个字典成一个简单的结构像一个SortedSet或String[]
创建一个空HashMap<String,List<String>>（可能是相同的尺寸，性能），用于存储结果
迭代通过字典字母（真正的任何命令都可以工作，只是确保你打的所有条目）
1. 排序的字母在单词
2. 查找存储集合中的排序字母
3. 如果查找成功，请将当前单词添加到列表中;否则，创建一个包含单词一个新的列表，并把它放到存储Map
如果以后需要这个地图，存储在磁盘上的地图;否则，保持它在内存中
丢弃字典

这是你的查找过程应该做的：

排序的字母样品字
查找被分拣信件中您的存储集合
打印从查找（或空值）返回的List，注意省略输出中的示例字

如果要节省堆空间，请考虑使用DAWG。你会发现你可以用几百千字节而不是32MiB来代表整个英文单词的字典。我将把它作为读者的练习。

祝你好运，你的家庭作业。如果您在使用Lucene了很多的功能，使用Apache Solr，建立在Lucene之上的一个搜索平台考虑

来源

2012-12-28 16:35:19

你好，谢谢你的解决方案。起初我想说，这不是一项家庭作业，而是我的项目中的一个现实生活中的问题，我只考虑哪种方式更好地解决它。这个字典文件不是我的，只是一个来自互联网的资源。我想用[KISS]（http://en.wikipedia.org/wiki/KISS_principle）原理解决这个问题，我认为我可以使用Lucene进行搜索或查找。你真的认为你的解决方案比Lucene更有优势吗？这个查询将是我的项目的基本功能，并将被广泛使用。 –

Lucene的功能在于从复杂的基于文本的数据创建复杂的索引。你的数据并不复杂（单个单词），而你的索引并不复杂（单个单词的变体）。你当然可以使用Lucene，但它对于你想要的更通用。你想在单个字段上进行精确匹配。你也可以使用RDBMS，但你已经知道这很愚蠢......如果你的需求在你的问题中得到充分体现，那么使用Lucene同样很愚蠢。 –

在Lucene文档中添加字段

回答

相关问题