2013-08-22 50 views
2

我最近问了这个问题,但是意识到我没有很清楚地解释它。 我有一个很大的.csv文件(8000+行),由发票组成,每个发票有多行。我将其解析为XML结构,如下所示(简化)。XSLT将大的单个父节点拆分成较小的子节点

输入1 - $ XMLInput

<?xml version="1.0" encoding="UTF-8"?> 
<root> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-1</invoiceText> 
     <position>1<position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-2</invoiceText> 
     <position>2<position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-1</invoiceText> 
     <position>3<position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-2</invoiceText> 
     <position>4<position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-1</invoiceText> 
     <position>5<position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-2</invoiceText> 
     <position>6<position> 
     ... 
    </row> 
</roow> 

输入2 - $ maxBatchSize 描述:中断到下一批次它变得比这个尺寸(常数)

输入较大的后3 - $ listOfInvoices 描述:文档中唯一发票编号的重复变量。例如:

<root> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
    </row> 
    <row> 
     <invoiceNumber>2</invoiceNumber> 
    </row> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
    </row> 
</root> 

为了提高性能时间,我需要组这些元件由invoiceNumber,成批不大于x的每个节点(变量要导入)。从那里我将每批发送到一个子处理器,而不是一次处理整个原始文档。例如,在上面的例子中的XML文档,如果批量大小可能不大于3,我需要以下XML输出:

输出1 - $ XMLOutput

<root> 
    <batch> 
     <row> 
      <invoiceNumber>1</invoiceNumber> 
      <invoiceText>invoice 1-1</invoiceText> 
      <position>1<position> 
      ... 
     </row> 
     <row> 
      <invoiceNumber>1</invoiceNumber> 
      <invoiceText>invoice 1-2</invoiceText> 
      <position>2<position> 
      ... 
     </row> 
     <row> 
      <invoiceNumber>2</invoiceNumber> 
      <invoiceText>invoice 2-1</invoiceText> 
      <position>3<position> 
      ... 
     </row> 
     <row> 
      <invoiceNumber>2</invoiceNumber> 
      <invoiceText>invoice 2-2</invoiceText> 
      <position>4<position> 
      ... 
     </row> 
    </batch> 
    <batch> 
     <row> 
      <invoiceNumber>3</invoiceNumber> 
      <invoiceText>invoice 3-1</invoiceText> 
      <position>5<position> 
      ... 
     </row> 
     <row> 
      <invoiceNumber>3</invoiceNumber> 
      <invoiceText>invoice 3-2</invoiceText> 
      <position>6<position> 
      ... 
     </row> 
    </batch> 
</root> 

这是一个要求,即所有发票的行在同一批中发送。我最初的XSLT尝试是低于(2.0),我尝试模拟一个while循环,通过递归调用模板,将发票组附加到当前节点。当达到最大批量时,我递归地调用批处理模板来创建一个新的批处理。我在每次递归调用之间传递发票和批处理计数器。

编辑:感谢肯的帮助我越来越近。我确实需要每次按行数划分发票,而不是明确发票的数量。理论上,如果以下工作,我不知道如何确保发票号码不存在于前面的兄弟节点中。

<?xml version="1.0" encoding="UTF-8"?> 
<xsl:stylesheet version="2.0" xmlns:bpws="http://schemas.xmlsoap.org/ws/2003/03/business-process/" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> 
<xsl:variable name="batch-size" select="40" as="xs:integer"/> 
<xsl:variable name="input" select="bpws:getVariableData('sortedInvoicesByBU')"/> 
<xsl:key name="invoice-lines-by-invoice-number" match="row" use="invoiceNumber4z"/> 

<xsl:template match="/"> 
    <xsl:element name="batches"> 
     <!--establish batches from possible non-contiguous invoice numbers--> 
     <xsl:for-each-group select="$input/*:UPSData/*:row" group-by="(position() - 1) idiv $batch-size"> 
      <xsl:for-each select="distinct-values($input/*:UPSData/*:row/*:invoiceNumber4z)[not(.=preceding-sibling::item)]"> 
       <xsl:element name="UPSData"> 
        <xsl:for-each select="current()"> 
         <xsl:for-each select="key('invoice-lines-by-invoice-number',.,$input)"> 
          <!--copy rows as they are--> 
          <xsl:copy-of select="."/> 
         </xsl:for-each> 
        </xsl:for-each> 
       </xsl:element> 
      </xsl:for-each> 
     </xsl:for-each-group> 
    </xsl:element> 
</xsl:template> 
</xsl:stylesheet> 

回答

4

我告诉我的学生,可以折磨一个样式多达必要终于得到它的工作,但是这并不能使它维护,甚至做的事情以正确的方式。我希望你会接受这样的分析,即你将XSLT视为一种命令式编程语言,这种语言没有公正性,只会让你相信尝试在C和Java中执行的事情更加容易,冗长和尴尬。

但是,如果您按照设计的方式使用XSLT,则它比命令式语言更容易,并且启动它都基于XML,您可以在其中显示所需的结果。因为它更短,维护起来更容易。当你理解正在使用的声明性指令时,你不必尝试解开一个强制性的算法。 XSLT处理器可以优化声明式方法,但如果它遵循书面的命令式方法而没有机会对其进行优化,则它有义务缓慢工作。

在下面的解决方案中,您将精确地生成您的Output1结果,请注意我如何确定唯一的发票号码,然后通过有效的方式对它们进行过滤。然后我根据批量大小(这是一个参数)对这些进行批量处理。没有被调用的模板,没有任何类型的计数器......使用XSLT 2.0的内置工具的解决方案。

而且不包括全局参数和变量和意见的声明,这只是5个元素长:<root><xsl:for-each-group><batch><xsl:for-each><xsl:copy-of>

至于你的问题你为什么不工作,我不知道......你采取的方法并不像“XSLT”那样“感觉”......它感觉像是某种程序化命令式方法的XSLT表达式。

t:\ftemp>type numbers.xml 
<root> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
    </row> 
    <row> 
     <invoiceNumber>2</invoiceNumber> 
    </row> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
    </row> 
</root> 

t:\ftemp>type invoices.xml 
<?xml version="1.0" encoding="UTF-8"?> 
<root> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-1</invoiceText> 
     <position>1</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-2</invoiceText> 
     <position>2</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-1</invoiceText> 
     <position>3</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-2</invoiceText> 
     <position>4</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-1</invoiceText> 
     <position>5</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-2</invoiceText> 
     <position>6</position> 
     ... 
    </row> 
</root> 

t:\ftemp>call xslt2 invoices.xml invoices.xsl 
<?xml version="1.0" encoding="UTF-8"?> 
<root> 
    <batch> 
     <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-1</invoiceText> 
     <position>1</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-2</invoiceText> 
     <position>2</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-1</invoiceText> 
     <position>3</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-2</invoiceText> 
     <position>4</position> 
     ... 
    </row> 
    </batch> 
    <batch> 
     <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-1</invoiceText> 
     <position>5</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-2</invoiceText> 
     <position>6</position> 
     ... 
    </row> 
    </batch> 
</root> 

t:\ftemp>type invoices.xsl 
<?xml version="1.0" encoding="US-ASCII"?> 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
       version="2.0"> 

<xsl:output indent="yes"/> 

<xsl:param name="batch-size" select="2"/> 

<xsl:variable name="valid-numbers" 
       select="doc('numbers.xml')/root/row/invoiceNumber"/> 

<xsl:template match="/"> 
    <xsl:variable name="invoiceLines" select="root/row"/> 
    <root> 
    <!--establish batches from possible non-contiguous invoice numbers--> 
    <xsl:for-each-group group-by="(position() - 1) idiv $batch-size" 
     select="distinct-values($invoiceLines/invoiceNumber)[.=$valid-numbers]"> 
     <!--create a batch using all invoice lines for all numbers in group--> 
     <batch> 
     <xsl:for-each select="$invoiceLines[invoiceNumber=current-group()]"> 
      <!--copy rows as they are--> 
      <xsl:copy-of select="."/> 
     </xsl:for-each> 
     </batch> 
    </xsl:for-each-group> 
    </root> 
</xsl:template> 

</xsl:stylesheet> 
t:\ftemp>rem Done! 

我编辑这个答案补充下面,因为你的状态的替代你有800万个的输入记录我想用一个键查找表将执行比我的简单的变量断言更好。它会在模板中生成一个额外的XSLT指令的相同结果(可以在不添加它的情况下完成,但我认为这更易读)并删除不再需要的变量。

<?xml version="1.0" encoding="US-ASCII"?> 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
       version="2.0"> 

<xsl:output indent="yes"/> 

<xsl:param name="batch-size" select="2"/> 

<xsl:variable name="valid-numbers" 
       select="doc('numbers.xml')/root/row/invoiceNumber"/> 

<xsl:key name="invoice-lines-by-invoice-number" 
     match="row" use="invoiceNumber"/> 

<xsl:variable name="input" select="/"/> 

<xsl:template match="/"> 
    <root> 
    <!--establish batches from possible non-contiguous invoice numbers--> 
    <xsl:for-each-group group-by="(position() - 1) idiv $batch-size" 
     select="distinct-values(root/row/invoiceNumber)[.=$valid-numbers]"> 
     <!--create a batch using all invoice lines for all numbers in group--> 
     <batch> 
     <xsl:for-each select="current-group()"> 
      <xsl:for-each 
        select="key('invoice-lines-by-invoice-number',.,$input)"> 
      <!--copy rows as they are--> 
      <xsl:copy-of select="."/> 
      </xsl:for-each> 
     </xsl:for-each> 
     </batch> 
    </xsl:for-each-group> 
    </root> 
</xsl:template> 

</xsl:stylesheet> 
+0

再次谢谢你。我绝对同意我试图采取一种程序化的方法,并迫使XSLT相应地适应它,只需要开始学习如何用功能语言进行思考。至于为什么我尝试的编程方法不起作用,我会说这是因为这不是它的设计方式。 – rwolters3

+0

现在我想这是你帮助过的两个问题,我一定会下载和学习你的书,看看你是否在我的领域有任何讲座/课程,或者以前的讲座的缓存版本。另外,我写了一个小的错字,我打算说8000或8千条记录,而不是800万,处理时间要快得多。 – rwolters3

+0

我将我的StackOverflow配置文件更新为即将发布的讲座系列或http://www.CraneSoftwrights.com/schedule.htm#calendar提供的信息。 在http://www.CraneSoftwrights.com/links/udemy-ptux-online.htm上,XSLT/XPath上有5个小时的免费流视频讲座,您甚至不需要设置用户名即可只是自由观看。 Udemy拥有可通过http://www.CraneSoftwrights.com/training/ptux/ptux-video.htm页面购买的DVD流媒体版本。两者都有完整答案的练习。独立书没有练习。 –

0

请不要将此标记为答案,因为我的上一个答案回答了原始问题。

下面的代码回答了如何按发票总行数进行批量处理的辅助问题,而不会在两个批次之间打破发票。

我无法想象一种声明式的方式,所以下面的答案是一个必要的递归解决方案,但是这样编写,使得实现尾递归的XSLT处理器不会占用堆栈空间。我还利用原生XSLT功能(关键表和序列),这些功能在其他语言中很难使用。

代码非常紧凑,只有一个部分实际写出了一批发票......没有更多的批量写入代码块。我很满意这是怎么发生的。

我欢迎任何有关改进的建议或者比这更紧密的替代解决方案。

t:\ftemp>type numbers.xml 
<root> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
    </row> 
    <row> 
     <invoiceNumber>2</invoiceNumber> 
    </row> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
    </row> 
    <row> 
     <invoiceNumber>4</invoiceNumber> 
    </row> 
    <row> 
     <invoiceNumber>5</invoiceNumber> 
    </row> 
</root> 

t:\ftemp>type invoices.xml 
<?xml version="1.0" encoding="UTF-8"?> 
<root> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-1</invoiceText> 
     <position>1</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-2</invoiceText> 
     <position>2</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-1</invoiceText> 
     <position>3</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-2</invoiceText> 
     <position>4</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-1</invoiceText> 
     <position>5</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-2</invoiceText> 
     <position>6</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-1</invoiceText> 
     <position>7</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-2</invoiceText> 
     <position>8</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-3</invoiceText> 
     <position>9</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-4</invoiceText> 
     <position>10</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-5</invoiceText> 
     <position>11</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-6</invoiceText> 
     <position>12</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>5</invoiceNumber> 
     <invoiceText>invoice 5-1</invoiceText> 
     <position>13</position> 
     ... 
    </row> 
    <row> 
     <invoiceNumber>5</invoiceNumber> 
     <invoiceText>invoice 5-2</invoiceText> 
     <position>14</position> 
     ... 
    </row> 
</root> 

t:\ftemp>call xslt2 invoices.xml invoices.xsl 
<?xml version="1.0" encoding="UTF-8"?> 
<root> 
    <!--Batch max lines: 5--> 
    <batch> 
    <!--invoice numbers: 1 2--> 
    <!--total line count: 4--> 
    <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-1</invoiceText> 
     <position>1</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>1</invoiceNumber> 
     <invoiceText>invoice 1-2</invoiceText> 
     <position>2</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-1</invoiceText> 
     <position>3</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>2</invoiceNumber> 
     <invoiceText>invoice 2-2</invoiceText> 
     <position>4</position> 
     ... 
    </row> 
    </batch> 
    <batch> 
    <!--invoice numbers: 3--> 
    <!--total line count: 2--> 
    <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-1</invoiceText> 
     <position>5</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>3</invoiceNumber> 
     <invoiceText>invoice 3-2</invoiceText> 
     <position>6</position> 
     ... 
    </row> 
    </batch> 
    <batch> 
    <!--invoice numbers: 4--> 
    <!--total line count: 6--> 
    <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-1</invoiceText> 
     <position>7</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-2</invoiceText> 
     <position>8</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-3</invoiceText> 
     <position>9</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-4</invoiceText> 
     <position>10</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-5</invoiceText> 
     <position>11</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>4</invoiceNumber> 
     <invoiceText>invoice 4-6</invoiceText> 
     <position>12</position> 
     ... 
    </row> 
    </batch> 
    <batch> 
    <!--invoice numbers: 5--> 
    <!--total line count: 2--> 
    <row> 
     <invoiceNumber>5</invoiceNumber> 
     <invoiceText>invoice 5-1</invoiceText> 
     <position>13</position> 
     ... 
    </row> 
     <row> 
     <invoiceNumber>5</invoiceNumber> 
     <invoiceText>invoice 5-2</invoiceText> 
     <position>14</position> 
     ... 
    </row> 
    </batch> 
</root> 

t:\ftemp>type invoices.xsl 
<?xml version="1.0" encoding="US-ASCII"?> 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
       version="2.0"> 

<xsl:output indent="yes"/> 

<xsl:param name="batch-size" select="5"/> 

<xsl:variable name="valid-numbers" 
       select="doc('numbers.xml')/root/row/invoiceNumber"/> 

<xsl:key name="invoice-lines-by-invoice-number" 
     match="row" use="invoiceNumber"/> 

<xsl:variable name="input" select="/"/> 

<xsl:template match="/"> 
    <root> 
    <xsl:text>&#xa; </xsl:text> 
    <xsl:comment select="'Batch max lines:',$batch-size"/> 
    <xsl:text>&#xa; </xsl:text> 
    <xsl:call-template name="next-batch"> 
     <xsl:with-param name="remaining-numbers" 
     select="distinct-values(root/row/invoiceNumber)[.=$valid-numbers]"/> 
    </xsl:call-template> 
    </root> 
</xsl:template> 

<xsl:template name="next-batch"> 
    <xsl:param name="this-batch-lines" select="0"/> 
    <xsl:param name="this-batch-numbers" select="()"/> 
    <xsl:param name="remaining-numbers" required="yes"/> 
    <xsl:variable name="this-invoice" select="$remaining-numbers[1]"/> 
    <xsl:variable name="this-invoice-lines" 
    select="count(key('invoice-lines-by-invoice-number',$this-invoice,$input))"/> 

    <xsl:choose> 
    <xsl:when test="not($this-invoice) and not($this-batch-lines)"> 
     <!--nothing to clean up and nothing more to do--> 
    </xsl:when> 
    <xsl:when test="not($this-invoice) (:last invoice complete:) or 
        ($this-batch-lines + $this-invoice-lines > $batch-size) 
         (:this invoice exceeds limit:)"> 
     <!--clean up previous unfinished batch--> 
     <batch> 
     <xsl:text>&#xa; </xsl:text> 
     <xsl:comment select="'invoice numbers:',$this-batch-numbers"/> 
     <xsl:text>&#xa; </xsl:text> 
     <xsl:comment select="'total line count:',$this-batch-lines"/> 
     <xsl:text>&#xa; </xsl:text> 
     <xsl:copy-of select="for $num in $this-batch-numbers return 
         key('invoice-lines-by-invoice-number',$num,$input)"/> 
     </batch> 
     <xsl:if test="$this-invoice"> 
     <!--continue with the next batch comprised of this invoice only--> 
     <xsl:call-template name="next-batch"> 
      <xsl:with-param name="this-batch-lines" 
          select="$this-invoice-lines"/> 
      <xsl:with-param name="this-batch-numbers" 
          select="$this-invoice"/> 
      <xsl:with-param name="remaining-numbers" 
          select="$remaining-numbers[position()>1]"/> 
     </xsl:call-template> 
     </xsl:if> 
     <!--the cleaned up batch was the last batch, template recursion ends--> 
    </xsl:when> 
    <xsl:otherwise> 
     <!--a batch limit has not been exceeded; add this invoice to batch--> 
     <xsl:call-template name="next-batch"> 
     <xsl:with-param name="this-batch-lines" 
         select="$this-batch-lines + $this-invoice-lines"/> 
     <xsl:with-param name="this-batch-numbers" 
         select="($this-batch-numbers,$this-invoice)"/> 
     <xsl:with-param name="remaining-numbers" 
          select="$remaining-numbers[position()>1]"/> 
     </xsl:call-template> 
    </xsl:otherwise> 
    </xsl:choose> 
</xsl:template> 

</xsl:stylesheet> 
相关问题