2015-12-30 46 views
1

我拥有具有不同节点和属性的巨大xml文件。我用grep -c来计算具有特定类型的产品。这里是我迄今所做的:根据元素属性在命令行中拆分xml文件

grep -c "</products>" products.xml // output : 200023 

grep -c '<product type="cloths"' products.xml // output : 8039 

所以我需要提取与型布料如在new.xml文件树中的所有产品无所有其他属性,这样我可以导入new.xml文件导入数据库:

<?xml version="1.0"?> 
<!DOCTYPE catalog SYSTEM "catalog.dtd"> 
<catalog> 
    <product type="cloths" product_image="cardigan.jpg"> 
     <catalog_item gender="Men's"> 
     <item_number>QWZ5671</item_number> 
     <price>39.95</price> 
     <size description="Medium"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     <size description="Large"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     </catalog_item> 
     <catalog_item gender="Women's"> 
     <item_number>RRX9856</item_number> 
     <price>42.50</price> 
     <size description="Small"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     <size description="Medium"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     <size description="Large"> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     <size description="Extra Large"> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     </catalog_item> 
    </product> 
</catalog> 
+1

XSLT对此非常理想。这对你来说是一种选择吗? – kjhughes

+0

不幸的是,我没有XSLT的庞大的文件,我有。不知道是否有任何方法来生成这样的文件!对于XML世界来说很抱歉。谢谢 – Mtaly

+0

为您的任务编写XSLT会很简单。如果你有XSLT代码,你能运行XSLT吗? – kjhughes

回答

0

从显示0​​样品和应用的问题命令行标记,它看起来像你想要一个纯命令行的解决方案。一种选择是使用xmllint来运行XPath查询,该查询只选择类型为“布料”的产品。

> xmllint --xpath /catalog/product[@type=\"cloths\"] products.xml 
<product type="cloths" product_image="cardigan.jpg"> 
     <catalog_item gender="Men's"> 
     <item_number>QWZ5671</item_number> 
     <price>39.95</price> 
     <size description="Medium"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     <size description="Large"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     </catalog_item> 
     <catalog_item gender="Women's"> 
     <item_number>RRX9856</item_number> 
     <price>42.50</price> 
     <size description="Small"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     <size description="Medium"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     <size description="Large"> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     <size description="Extra Large"> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     </catalog_item> 
    </product> 

但请注意,这不会产生格式良好的XML文档。它只是一个包含XPath查询所选内容的节点集。但是,您可以将其包装在一些额外的脚本中以生成完整的XML文档。

printf '<?xml version="1.0"?>\n' > cloths.xml 
printf '<!DOCTYPE catalog SYSTEM "catalog.dtd">\n' >> cloths.xml 
printf '<catalog>\n' >> cloths.xml 
xmllint --xpath /catalog/product[@type=\"cloths\"] products.xml >> cloths.xml 
printf '\n</catalog>\n' >> cloths.xml 

我省略了这些命令之间的错误检查以简化代码示例。

您还提到输入XML文件很大。取决于多大,这种方法在内存消耗方面可能无法很好地扩展。如果这是一个问题,那么您可能需要采取更多的流式处理方法来解决问题,一次读取输入文档的一小部分并逐步处理。这可能会超出命令行领域和定制编码领域。流式XML API的一个例子是Java中的StAX。这是一个tutorial

+0

谢谢。但xmllint不起作用它显示'XPath集是空的 '我已经用它来找出问题是什么!没有运气。 – Mtaly

+0

@Maly,'XPath设置为空'表示XPath查询执行但没有匹配任何内容,所以结果集为空。如果您的XML与您在问题中给出的内容完全相同,则不会发生这种情况。看到这个[GitHub gist](https://gist.github.com/cnauroth/ffc95773560612e5bcb1)成功演示了该命令。脚本嵌入了我期待的作为heredoc的XML,所以您将获得可预测的结果。如果您的XML实际上与您描述的有所不同,那么您可能需要调整XPath查询。 –