当数据结构未知时排除某些子节点

编辑 - 我已经想出了解决我的问题，并发布Q & A here。当数据结构未知时排除某些子节点

我在寻找符合国会图书馆EAD标准的XML（找到here）。不幸的是，关于XML的结构，标准非常松散。

例如，<bioghist>标签可以在<archdesc>标签中存在，或<descgrp>标签内，或嵌套在另一个<bioghist>标签，或上述的组合中，或者可以完全省去。我发现选择我正在寻找的生物标签而不选择其他标签是非常困难的。

下面是我的XSLT可能要处理几个不同的可能EAD XML文档：

第一个例子

<ead> 
<eadheader> 
    <archdesc> 
     <bioghist>one</bioghist> 
     <dsc> 
      <c01> 
       <descgrp> 
        <bioghist>two</bioghist> 
       </descgrp> 
       <c02> 
        <descgrp> 
         <bioghist> 
          <bioghist>three</bioghist> 
         </bioghist> 
        </descgrp> 
       </c02> 
      </c01> 
     </dsc> 
    </archdesc> 
</eadheader> 
</ead>

第二个例子

<ead> 
<eadheader> 
    <archdesc> 
     <descgrp> 
      <bioghist> 
       <bioghist>one</bioghist> 
      </bioghist> 
     </descgrp> 
     <dsc> 
      <c01> 
       <c02> 
        <descgrp> 
         <bioghist>three</bioghist> 
        </descgrp> 
       </c02> 
       <bioghist>two</bioghist> 
      </c01> 
     </dsc> 
    </archdesc> 
</eadheader> 
</ead>

第三个例子

<ead> 
<eadheader> 
    <archdesc> 
     <descgrp> 
      <bioghist>one</bioghist> 
     </descgrp> 
     <dsc> 
      <c01> 
       <c02> 
        <bioghist>three</bioghist> 
       </c02> 
      </c01> 
     </dsc> 
    </archdesc> 
</eadheader> 
</ead>

正如您所看到的，EAD XML文件几乎可以在任何地方都有<bioghist>标签。我想要生产的实际产量太复杂，不能在这里发布。输出为上述三个EAD例子的简化例子可能是这样的：

输出为第一示例

<records> 
<primary_record> 
    <biography_history>first</biography_history> 
</primary_record> 
<child_record> 
    <biography_history>second</biography_history> 
</child_record> 
<granchild_record> 
    <biography_history>third</biography_history> 
</granchild_record> 
</records>

输出，用于第二个例子

<records> 
<primary_record> 
    <biography_history>first</biography_history> 
</primary_record> 
<child_record> 
    <biography_history>second</biography_history> 
</child_record> 
<granchild_record> 
    <biography_history>third</biography_history> 
</granchild_record> 
</records>

输出为第三示例

<records> 
<primary_record> 
    <biography_history>first</biography_history> 
</primary_record> 
<child_record> 
    <biography_history></biography_history> 
</child_record> 
<granchild_record> 
    <biography_history>third</biography_history> 
</granchild_record> 
</records>

如果我想拉动“第一”生物学家的价值，并把它放在<primary_record>，我不能简单地<xsl:apply-templates select="/ead/eadheader/archdesc/bioghist"，因为该标签可能不是<archdesc>标签的直接后裔。它可能被<descgrp>或<bioghist>或其组合包裹。而且我不能select="//bioghist"，因为那会拉全部的<bioghist>标签。我什至不能select="//bioghist[1]"，因为实际上可能不会有<bioghist>标记，然后我将拉下<c01>，这是“第二”下面的值，应该稍后处理。

这已经是一篇很长的文章，但另外一个缺点是可以有无限多的<cxx>节点，嵌套深达12层。我目前正在递归处理它们。我尝试将我正在处理的节点（例如<c01>）保存为一个名为'RN'的变量，然后运行<xsl:apply-templates select=".//bioghist [name(..)=name($RN) or name(../..)=name($RN)]">。这适用于某些形式的EAD，其中<bioghist>标签没有嵌套得太深，但如果它必须处理由某个热爱包装标签的人在其他标签中创建的EAD文件，它将会失败（根据EAD标准）。

我很乐意为好歹说

得到任何<bioghist>标签当前节点下的任何地方，但
不越挖越深，如果你打一个<c??>标签

我的希望我已经明确了情况。请让我知道，如果我留下任何含糊不清的东西。任何援助，你可以提供将不胜感激。谢谢。

来源

2012-06-27 aarondev

我自己制定了一个解决方案，并将其发布在Q&A上，因为解决方案对于特定的XML标准非常具体，并且似乎超出了这个问题的范围。如果人们觉得最好也把它发布在这里，我可以用一个副本更新这个答案。

来源

2012-07-11 18:38:08 aarondev

由于要求比较模糊，任何答案只反映了作者所做的猜测。

这里是矿：

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
xmlns:my="my:my" exclude-result-prefixes="my"> 
<xsl:output omit-xml-declaration="yes" indent="yes"/> 
<xsl:strip-space elements="*"/> 

<my:names> 
    <n>primary_record</n> 
    <n>child_record</n> 
    <n>grandchild_record</n> 
</my:names> 

<xsl:variable name="vNames" select="document('')/*/my:names/*"/> 

<xsl:template match="/"> 
    <xsl:apply-templates select= 
    "//bioghist[following-sibling::node()[1] 
           [self::descgrp] 
       ]"/> 
</xsl:template> 

<xsl:template match="bioghist"> 
    <xsl:variable name="vPos" select="position()"/> 

    <xsl:element name="{$vNames[position() = $vPos]}"> 
    <xsl:value-of select="."/> 
    </xsl:element> 
</xsl:template> 

<xsl:template match="text()"/> 
</xsl:stylesheet>

当这个变换所提供的XML文档应用：

<ead> 
    <eadheader> 
     <archdesc> 
      <bioghist>first</bioghist> 
      <descgrp> 
       <bioghist>first</bioghist> 
       <bioghist> 
        <bioghist>first</bioghist></bioghist> 
      </descgrp> 
      <dsc> 
       <c01> 
        <bioghist>second</bioghist> 
        <descgrp> 
         <bioghist>second</bioghist> 
         <bioghist> 
          <bioghist>second</bioghist></bioghist> 
        </descgrp> 
        <c02> 
         <bioghist>third</bioghist> 
         <descgrp> 
          <bioghist>third</bioghist> 
          <bioghist> 
           <bioghist>third</bioghist></bioghist> 
         </descgrp> 
        </c02> 
       </c01> 
      </dsc> 
     </archdesc> 
    </eadheader> 
</ead>

有用结果产生：

<primary_record>first</primary_record> 
<child_record>second</child_record> 
<grandchild_record>third</grandchild_record>

来源

2012-06-28 03:39:44

我对道歉的要求是模糊的。正确的EAD xml文档包含30或40种不同的信息，每种信息都有自己的标签。我生成的输出使用了所有这些不同的标签，并且我认为简化的输入/输出可能最适合用来表达问题的本质。你的xslt比我熟悉的更先进一点，但我想我已经算出了几件。匹配生物学家的模板只能运行三次，每次创建一个不同名称的元素，都是正确的？现在我的问题是为什么模板只运行3次。 – aarondev

@aarondev：答案很简单：提供的XML文档中只有三个元素与模板匹配。该模板匹配XML文档中的任何'bioghist'，其第一个后续兄弟节点是'descgrp'元素 - 在提供的XML文档中恰好有三个这样的'bioghist'元素。 –

因此，以下兄弟姐妹匹配所有兄弟节点。然后你用[1]选择那些同胞中的第一个，对吧？ self :: descgrp位仍然让我困惑。这是否使当前节点成为descgrp节点？ – aarondev

当数据结构未知时排除某些子节点

回答

相关问题