带有html标记的Solr索引xml文件（带有DataImportHandler）

我有Solr 4.10.4，我想索引一个xml文件。 Somes xml标签包含html标签。带有html标记的Solr索引xml文件（带有DataImportHandler）

<?xml version='1.0' encoding='UTF-8' standalone='no' ?> 
<root> 
    <info> 
     <text> 
      <p>text 1</p> 
      <p>text 2</p> 
      <p>text 3</p> 
     </text> 
    </info> 
</root>

我用这个：

<charFilter class="solr.HTMLStripCharFilterFactory"/>

，但它不工作，我不知道什么是错。

来源

2016-09-27 Medley

** solr.HTMLStripCharFilterFactory **将去除索引数据中的html标签而不是存储值。你还想要转换存储的值吗？ –

HTMLStripCharFilterFactory是要剥去不从所存储的索引的数据的HTML塔格。
要在索引时去除html标签，您可以在dataimporthandler中使用HTMLStripTransformer。以下是同样的样本DIH。

<dataConfig> 
<dataSource name="fDS" type="FileDataSource" /> 
<document> 
    <entity name="tika-test" processor="XPathEntityProcessor" 
      url="${solr.install.dir}/example/exampledocs/content.xml" forEach="/root" dataSource="fDS"> 
      <field column="text" xpath="/root/info/text/p" /> 
    </entity> 
</document>

有这个变压器，stripHTML，这是一个布尔值（真/假）转换成信号，如果 HTMLStripTransformer应处理场或不一个属性。

来源

2016-09-27 12:47:20

xpath怎么样？ xpath =“/ root/info/text”是否正确？ – Medley

是的xpath将和你提到的一样。 –

当我从Web界面执行查询时，该字段只填充“\ n”字符。 – Medley

带有html标记的Solr索引xml文件（带有DataImportHandler）

回答

相关问题