索引文件夹中的文件

如何索引特定文件夹中的所有文档文件？假设我有mydocuments文件夹，其中包含doc和docx文件。我需要索引该文件夹中的所有文件以进行高效搜索。你可以建议为doc文件建立索引文件夹吗？注意：我查找了狮身人面像，但它似乎只索引xml和mssql。索引文件夹中的文件

来源

2013-03-08 torayeff

您使用的是哪个版本的solr？你看过https://wiki.apache.org/solr/ExtractingRequestHandler还是SolrCell？有了它们，您可以索引doc文件。 – jpee 2013-03-08 19:40:50

我的回答适用于Lucene。

Lucene不“直接”提供了一个API来索引文件或文件夹的内容。我们要做的是

解析文件。您可以使用支持解析各种文件的Apache Tika。
用该信息填充Lucene Document对象。
将该文档传递给IndexWriter.addDocument（）
对每个文件（即索引中的每个不同条目）重复上述步骤。

直接索引的问题即使存在，也会损失字段创建的灵活性以及选择特定文档中该字段的内容。

下面是一个很好的教程，你可以找到示例代码：Lucene in 5 minutes

来源

2013-03-08 19:45:06 phani

我认为你的问题是索引是在某个文件夹中的文本文件列表。所以，这是一个示例代码来索引它们。但是，如果您要索引word文档，则需要更改getDocument方法来解析和填充Lucene文档。

的关键点是：

创建的IndexWriter。
使用dir.listFiles（）方法获取文件夹中的文件列表。
迭代遍历文件并创建它们的Lucene文档一个在时间
将Lucene文档添加到索引。
一旦完成添加文档，然后提交更改并关闭indexWriter。

如果您正在寻找解析和阅读word文档或PDF文件，那么您需要使用Apache POI和PDFBox库。

请注意我只使用RAMDirectory类进行演示，您需要改为使用FSDirectory。

我希望能够解决您的问题。

import java.io.File; 
import java.io.FileNotFoundException; 
import java.io.IOException; 
import java.util.Scanner; 

import org.apache.lucene.analysis.Analyzer; 
import org.apache.lucene.analysis.standard.StandardAnalyzer; 
import org.apache.lucene.document.Document; 
import org.apache.lucene.document.Field; 
import org.apache.lucene.index.IndexWriter; 
import org.apache.lucene.index.IndexWriterConfig; 
import org.apache.lucene.store.Directory; 
import org.apache.lucene.store.RAMDirectory; 
import org.apache.lucene.util.Version; 


public class IndexFolders { 

    public static void main(String[] args) throws FileNotFoundException, IOException{ 
     String path = args[0]; 
     File dir = new File(path); 

     Directory indexDir = new RAMDirectory(); 
     Version version = Version.LUCENE_40; 
     Analyzer analyzer = new StandardAnalyzer(version); 
     IndexWriterConfig config = new IndexWriterConfig(version, analyzer); 
     IndexWriter indexWriter = new IndexWriter(indexDir, config); 

     for (File file : dir.listFiles()){ 
      indexWriter.addDocument(getDocument(file)); 
     } 

     indexWriter.commit(); 
     indexWriter.close(); 
    } 


    public static Document getDocument(File file) throws FileNotFoundException 
    { 
     Scanner input = new Scanner(file); 
     StringBuilder builder = new StringBuilder(); 

     while(input.hasNext()){ 
      builder.append(input.nextLine()); 
     } 

     Document document = new Document(); 
     document.add(new Field("text", builder.toString(),org.apache.lucene.document.TextField.TYPE_STORED)); 
     return document; 
    } 


}

来源

2013-03-08 20:18:40 ameertawfik

而不是只发布代码，尝试至少包括一个解释的句子。这意味着不仅仅是对OP的参考，而且也是针对同样问题来到这里的其他人的参考。没有解释它可以帮助更少的人。谢谢！ – Jason 2013-03-08 20:38:06

@Jason感谢您的评论。我已经做到了。 – ameertawfik 2013-03-09 05:29:16

索引文件夹中的文件

回答

相关问题