2014-04-22 81 views
0

我正在使用Apache Lucene库为我的网站创建搜索功能。该网站正在从Sharepoint RSSFeeds获取所有内容,因此每次必须浏览所有RSSFeed网址并阅读内容。以使得搜索功能,更快我创建了一个计划任务做索引每隔一小时:更新Apache Lucene索引文件

<bean id="rssIndexerService" class="com.lloydsbanking.webmi.service.RSSIndexerService" /> 
<task:scheduled-tasks> <task scheduled ref="rssIndexerService" method="indexUrls" cron="0 0 * * * MON-FRI" /></task:scheduled-tasks> 

的问题是,如果我创建一个新的内容,那么搜索犯规显示新的内容,而服务器运行和调度任务被调用后,如果我删除了一个条目,它也不显示从索引文件中删除的调用。这里是索引代码:

@Service 
public class RSSIndexerService extends RSSReader { 

    @Autowired 
    private RSSFeedUrl rssFeedUrl; 

    private IndexWriter indexWriter = null; 

    private String indexPath = "C:\\MI\\index"; 

    Logger log = Logger.getLogger(RSSIndexerService.class.getName()); 

    public void indexUrls() throws IOException { 
     Date start = new Date(); 
     IndexWriter writer = getIndexWriter(); 
     log.info("Reading all the Urls in the Sharepoint");  
     Iterator<Entry<String, String>> entries = rssFeedUrl.getUrlMap().entrySet().iterator(); 
     try { 
      while (entries.hasNext()) { 
       Entry<String, String> mapEntry = entries.next(); 
       String url = mapEntry.getValue(); 
       SyndFeed feed = rssReader(url); 
       for (Object entry : feed.getEntries()) { 
        SyndEntry syndEntry = (SyndEntry) entry; 
        SyndContent desc = syndEntry.getDescription(); 
        if (desc != null) { 
         String text = desc.getValue(); 
         if ("text/html".equals(desc.getType())) { 
          Document doc = new Document(); 
          text = extractText(text); 
          Field fieldTitle = new StringField("title", syndEntry.getTitle(), Field.Store.YES); 
          doc.add(fieldTitle); 
          Field pathField = new StringField("path", url, Field.Store.YES); 
          doc.add(pathField); 
          doc.add(new TextField("contents", text, Field.Store.YES)); 

          // New index, so we just add the document (no old document can be there): 
          writer.addDocument(doc); 
         } 
        } 
       } 

      } 

     } finally { 

      // closeIndexWriter(); 
     } 
     Date end = new Date(); 
     log.info(end.getTime() - start.getTime() + " total milliseconds"); 
    } 

    public IndexWriter getIndexWriter() throws IOException { 

     if (indexWriter == null) { 
      Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47); 

      log.info("Indexing to directory '" + indexPath + "'..."); 
      Directory dir = FSDirectory.open(new File(indexPath)); 
      IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, analyzer); 

      config.setOpenMode(OpenMode.CREATE_OR_APPEND); 
      indexWriter = new IndexWriter(dir, config); 
     } 
     return indexWriter; 
    } 

    @PreDestroy 
    public void closeIndexWriter() throws IOException { 
     if (indexWriter != null) { 
      System.out.println("Done with indexing ..."); 
      indexWriter.close(); 
     } 
    } 

} 

我知道这个问题可能由config.setOpenMode(OpenMode.CREATE_OR_APPEND)造成的;​​但是我不知道我怎么能解决这个问题。

回答

0

好,我想出了检查的想法,如果该目录为空之前或没有,如果它不是那么删除以前的索引,然后每次做的OpenMode.Create索引:

File path = new File(System.getProperty("java.io.tmpdir")+"\\index"); 
     Directory dir = FSDirectory.open(path); 

     Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47); 
     IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_47, analyzer); 

     if (path.list() != null) { 
      log.info("Delete previous indexes ..."); 
      FileUtils.cleanDirectory(path); 
     } 
     config.setOpenMode(OpenMode.CREATE); 

那么我简单的使用addDocument():

if ("text/html".equals(desc.getType())) { 
         ... 
         // New index, so we just add the document (no old document can be there): 
         indexWriter.addDocument(doc); 
        }