2015-08-17 31 views
1

我试图在正则表达式中使用反向引用向组中递归捕获多个组。尽管我正在使用Pattern和Matcher以及“while(matcher.find())”循环,但它仍然只捕获最后一个实例,而不是所有实例。在我的情况下,唯一可能的标签<SM>,<PO>,<POF>,<POS>,<POI>,<POL>,<poif>,<离题>。由于这些格式标记,我需要捕获:在JAVA中使用反向引用捕获正则表达式的递归组

  1. 标签以外(任何文字,使我可以格式化为“正常”的文字,而我通过在标签之前捕获任何文本要对这个一个组,而我在另一个组中捕获标签本身,并且在遍历事件时,我删除了从原始字符串中捕获的所有内容;如果最后还剩下任何文本,则将其格式化为“普通”文本)
  2. 的标签用“名”,让我知道我怎么会有 来格式化文本,在标签内
  3. 将相应格式的标签名称及其关联RUL标签的文本内容ES

这是我的示例代码:

 String currentText = "the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po><poil>for out of man this one has been taken.”</poil>"; 
     String remainingText = currentText; 

     //first check if our string even has any kind of xml tag, because if not we will just format the whole string as "normal" text 
     if(currentText.matches("(?su).*<[/]{0,1}(?:sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1}>.*")) 
     {     
      //an opening or closing tag has been found, so let us start our pattern captures 
      //I am using a backreference \\2 to make sure the closing tag is the same as the opening tag 
      Pattern pattern1 = Pattern.compile("(.*)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS); 
      Matcher matcher1 = pattern1.matcher(currentText);     
      int iteration = 0; 
      while(matcher1.find()){ 
       System.out.print("Iteration "); 
       System.out.println(++iteration); 
       System.out.println("group1:"+matcher1.group(1)); 
       System.out.println("group2:"+matcher1.group(2)); 
       System.out.println("group3:"+matcher1.group(3)); 
       System.out.println("group4:"+matcher1.group(4)); 

       if(matcher1.group(1) != null && matcher1.group(1).isEmpty() == false) 
       { 
        m_xText.insertString(xTextRange, matcher1.group(1), false); 
        remainingText = remainingText.replaceFirst(matcher1.group(1), ""); 
       } 
       if(matcher1.group(4) != null && matcher1.group(4).isEmpty() == false) 
       { 
        switch (matcher1.group(2)) { 
         case "pof": [...] 
         case "pos": [...] 
         case "poif": [...] 
         case "po": [...] 
         case "poi": [...] 
         case "pol": [...] 
         case "poil": [...] 
         case "sm": [...] 
        } 
        remainingText = remainingText.replaceFirst("<"+matcher1.group(2)+">"+matcher1.group(4)+"</"+matcher1.group(2)+">", ""); 
       } 
      } 

的的System.out.println仅在我的控制台输出一次,用这些结果:

Iteration 1: 
    group1:the man said:<pof>“This one, at last, is bone of my bones</pof><poi>and flesh of my flesh;</poi><po>This one shall be called ‘woman,’</po>; 
    group2:poil 
    group3:po 
    group4:for out of man this one has been taken.” 

第3组是要忽略,唯一有用的群体是1,2和4(群体3是群体2的一部分)。为什么这只捕获最后一个标签实例“poil”,而没有捕获前面的“pof”,“poi”和“po”标签?

我想看看会是这样的输出:

Iteration 1: 
    group1:the man said: 
    group2:pof 
    group3:po 
    group4:“This one, at last, is bone of my bones 

Iteration 2: 
    group1: 
    group2:poi 
    group3:po 
    group4:and flesh of my flesh; 

Iteration 3: 
    group1: 
    group2:po 
    group3:po 
    group4:This one shall be called ‘woman,’ 

Iteration 3: 
    group1: 
    group2:poil 
    group3:po 
    group4:for out of man this one has been taken.” 

回答

1

我刚刚发现了这个问题,它只是需要在第一捕获非贪婪量词的答案,就像我在第四捕获组。这完全按照需要工作:

Pattern pattern1 = Pattern.compile("(.*?)<((sm|po)[f|l|s|i|3]{0,1}[f|l]{0,1})>(.*?)</\\2>",Pattern.UNICODE_CHARACTER_CLASS); 
相关问题