2016-05-13 69 views
1

如果第一个左括号后面跟着关键字array,如何替换一组匹配的开启/关闭括号?正则表达式可以帮助解决这类问题吗?RegExp替换嵌套结构中的匹配括号

为了更具体,我想解决这个使用JavaScript或PHP

// input 
$data = array(
    'id' => nextId(), 
    'profile' => array(
     'name' => 'Hugo Hurley', 
     'numbers' => (4 + 8 + 15 + 16 + 23 + 42)/108 
    ) 
); 

// desired output 
$data = [ 
    'id' => nextId(), 
    'profile' => [ 
     'name' => 'Hugo Hurley', 
     'numbers' => (4 + 8 + 15 + 16 + 23 + 42)/108 
    ] 
]; 
+0

这不是语言阿格诺就正则表达式而言,这是至关重要的。没有太多的正则表达式支持你需要的递归。那么,您正在使用哪个正则表达式引擎? –

+0

不幸的是,正则表达式语法和功能不是语言不可知的。 –

+0

通过函数调用的平衡文本可以在PCRE/Perl中完成。平衡组(计数)可以在Dot-Net中完成。那些是你的选择。对于pcre(我认为是Dot-Net),你只能得到整个文本的嵌套。为了进一步细分,你必须递归地解析它的内容。 – sln

回答

3

Tim Pietzcker给出了Dot-Net计数版本。
它与下面的PCRE(php)版本具有相同的元素。

所有的注意事项都是一样的。特别是,非数组括号必须
应平衡,因为它们使用与分隔符相同的右括号。

所有文本都必须解析(或应该)。
外组1,2,3,4让你得到部分
内容
CORE-1 array()
CORE-2的任何()
例外

每场比赛让你的这些外的事情之一并相互排斥。

诀窍是定义一个解析CORE的php 函数parse(core)
该函数的内部是while (regex.search(core) { .. }循环。

每次任一CORE-1或2的基团匹配,调用函数parse(core)传递
该芯的组的内容到它。

而在循环内部,只需取下内容并将其分配给散列。

明显地,应该使用构造来替换调用(?&content)的组1构造以获得像变量数据那样的散列。

在详细的范围内,这可能非常乏味。
通常情况下,你必须考虑每一个字符才能正确地解析整个事物。

(?is)(?:((?&content))|(?>\barray\s*\()((?=.)(?&core)|)\)|\(((?=.)(?&core)|)\)|(\barray\s*\(|[()]))(?(DEFINE)(?<core>(?>(?&content)|(?>\barray\s*\()(?:(?=.)(?&core)|)\)|\((?:(?=.)(?&core)|)\))+)(?<content>(?>(?!\barray\s*\(|[()]).)+)) 

的扩展性

# 1: CONTENT 
# 2: CORE-1 
# 3: CORE-2 
# 4: EXCEPTIONS 

(?is) 

(?: 
     (         # (1), Take off CONTENT 
      (?&content) 
    ) 
    |         # OR ----------------------------- 
     (?>        # Start 'array(' 
      \b array \s* \(
    ) 
     (         # (2), Take off 'array(CORE-1)' 
      (?= .) 
      (?&core) 
     | 
    ) 
     \)         # End ')' 
    |         # OR ----------------------------- 
     \(        # Start '(' 
     (         # (3), Take off '(any CORE-2)' 
      (?= .) 
      (?&core) 
     | 
    ) 
     \)         # End ')' 
    |         # OR ----------------------------- 
     (         # (4), Take off Unbalanced or Exceptions 
      \b array \s* \(
     | [()] 
    ) 
) 

# Subroutines 
# --------------- 

(?(DEFINE) 

     # core 
     (?<core> 
      (?> 
       (?&content) 
      | 
       (?> \b array \s* \() 
       # recurse core of array() 
       (?: 
        (?= .) 
        (?&core) 
        | 
       ) 
       \) 
      | 
       \(
       # recurse core of any () 
       (?: 
        (?= .) 
        (?&core) 
        | 
       ) 
       \) 
      )+ 
    ) 

     # content 
     (?<content> 
      (?> 
       (?! 
        \b array \s* \(
        | [()] 
       ) 
       . 
      )+ 
    ) 
) 

输出

** Grp 0   - (pos 0 , len 11) 
some_var = 
** Grp 1   - (pos 0 , len 11) 
some_var = 
** Grp 2   - NULL 
** Grp 3   - NULL 
** Grp 4 [core] - NULL 
** Grp 5 [content] - NULL 

----------------------- 

** Grp 0   - (pos 11 , len 153) 
array(
    'id' => nextId(), 
    'profile' => array(
     'name' => 'Hugo Hurley', 
     'numbers' => (4 + 8 + 15 + 16 + 23 + 42)/108 
    ) 
) 
** Grp 1   - NULL 
** Grp 2   - (pos 17 , len 146) 

    'id' => nextId(), 
    'profile' => array(
     'name' => 'Hugo Hurley', 
     'numbers' => (4 + 8 + 15 + 16 + 23 + 42)/108 
    ) 

** Grp 3   - NULL 
** Grp 4 [core] - NULL 
** Grp 5 [content] - NULL 

------------------------------------- 

** Grp 0   - (pos 164 , len 3) 
; 

** Grp 1   - (pos 164 , len 3) 
; 

** Grp 2   - NULL 
** Grp 3   - NULL 
** Grp 4 [core] - NULL 
** Grp 5 [content] - NULL 

别的东西前世,让使用的想法

# Perl code: 
# 
#  use strict; 
#  use warnings; 
#  
#  use Data::Dumper; 
#  
#  $/ = undef; 
#  my $content = <DATA>; 
#  
#  # Set the error mode on/off here .. 
#  my $BailOnError = 1; 
#  my $IsError = 0; 
#  
#  my $href = {}; 
#  
#  ParseCore($href, $content); 
#  
#  #print Dumper($href); 
#  
#  print "\n\n"; 
#  print "\nBase======================\n"; 
#  print $href->{content}; 
#  print "\nFirst======================\n"; 
#  print $href->{first}->{content}; 
#  print "\nSecond======================\n"; 
#  print $href->{first}->{second}->{content}; 
#  print "\nThird======================\n"; 
#  print $href->{first}->{second}->{third}->{content}; 
#  print "\nFourth======================\n"; 
#  print $href->{first}->{second}->{third}->{fourth}->{content}; 
#  print "\nFifth======================\n"; 
#  print $href->{first}->{second}->{third}->{fourth}->{fifth}->{content}; 
#  print "\nSix======================\n"; 
#  print $href->{six}->{content}; 
#  print "\nSeven======================\n"; 
#  print $href->{six}->{seven}->{content}; 
#  print "\nEight======================\n"; 
#  print $href->{six}->{seven}->{eight}->{content}; 
#  
#  exit; 
#  
#  
#  sub ParseCore 
#  { 
#   my ($aref, $core) = @_; 
#   my ($k, $v); 
#   while ($core =~ /(?is)(?:((?&content))|(?><!--block:(.*?)-->)((?&core)|)<!--endblock-->|(<!--(?:block:.*?|endblock)-->))(?(DEFINE)(?<core>(?>(?&content)|(?><!--block:.*?-->)(?:(?&core)|)<!--endblock-->)+)(?<content>(?>(?!<!--(?:block:.*?|endblock)-->).)+))/g) 
#   { 
#   if (defined $1) 
#   { 
#    # CONTENT 
#    $aref->{content} .= $1; 
#   } 
#   elsif (defined $2) 
#   { 
#    # CORE 
#    $k = $2; $v = $3; 
#    $aref->{$k} = {}; 
#  #   $aref->{$k}->{content} = $v; 
#  #   $aref->{$k}->{match} = $&; 
#     
#    my $curraref = $aref->{$k}; 
#    my $ret = ParseCore($aref->{$k}, $v); 
#    if ($BailOnError && $IsError) { 
#     last; 
#    } 
#    if (defined $ret) { 
#     $curraref->{'#next'} = $ret; 
#    } 
#   } 
#   else 
#   { 
#    # ERRORS 
#    print "Unbalanced '$4' at position = ", $-[0]; 
#    $IsError = 1; 
#  
#    # Decide to continue here .. 
#    # If BailOnError is set, just unwind recursion. 
#    # ------------------------------------------------- 
#    if ($BailOnError) { 
#     last; 
#    } 
#   } 
#   } 
#   return $k; 
#  } 
#  
#  #================================================ 
#  __DATA__ 
#  some html content here top base 
#  <!--block:first--> 
#   <table border="1" style="color:red;"> 
#   <tr class="lines"> 
#    <td align="left" valign="<--valign-->"> 
#   <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a> 
#   <!--hello--> <--again--><!--world--> 
#   some html content here 1 top 
#   <!--block:second--> 
#    some html content here 2 top 
#    <!--block:third--> 
#     some html content here 3 top 
#     <!--block:fourth--> 
#      some html content here 4 top 
#      <!--block:fifth--> 
#       some html content here 5a 
#       some html content here 5b 
#      <!--endblock--> 
#     <!--endblock--> 
#     some html content here 3a 
#     some html content here 3b 
#    <!--endblock--> 
#    some html content here 2 bottom 
#   <!--endblock--> 
#   some html content here 1 bottom 
#  <!--endblock--> 
#  some html content here1-5 bottom base 
#  
#  some html content here 6-8 top base 
#  <!--block:six--> 
#   some html content here 6 top 
#   <!--block:seven--> 
#    some html content here 7 top 
#    <!--block:eight--> 
#     some html content here 8a 
#     some html content here 8b 
#    <!--endblock--> 
#    some html content here 7 bottom 
#   <!--endblock--> 
#   some html content here 6 bottom 
#  <!--endblock--> 
#  some html content here 6-8 bottom base 
# 
# Output >> 
# 
#  Base====================== 
#  some html content here top base 
#  
#  some html content here1-5 bottom base 
#  
#  some html content here 6-8 top base 
#  
#  some html content here 6-8 bottom base 
#  
#  First====================== 
#  
#   <table border="1" style="color:red;"> 
#   <tr class="lines"> 
#    <td align="left" valign="<--valign-->"> 
#   <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a> 
#   <!--hello--> <--again--><!--world--> 
#   some html content here 1 top 
#   
#   some html content here 1 bottom 
#  
#  Second====================== 
#  
#    some html content here 2 top 
#    
#    some html content here 2 bottom 
#   
#  Third====================== 
#  
#     some html content here 3 top 
#     
#     some html content here 3a 
#     some html content here 3b 
#    
#  Fourth====================== 
#  
#      some html content here 4 top 
#      
#     
#  Fifth====================== 
#  
#       some html content here 5a 
#       some html content here 5b 
#      
#  Six====================== 
#  
#   some html content here 6 top 
#   
#   some html content here 6 bottom 
#  
#  Seven====================== 
#  
#    some html content here 7 top 
#    
#    some html content here 7 bottom 
#   
#  Eight====================== 
#  
#     some html content here 8a 
#     some html content here 8b 
#   
+0

虽然这还有很多,但为了示例的目的,它是虚构的。在正则表达式解决方案中制作一个合适的解析器是一件很麻烦的事情,而不是心灵的佯攻。 – sln

2

如何以下(使用.NET正则表达式引擎):

resultString = Regex.Replace(subjectString, 
    @"\barray\(   # Match 'array(' 
    (      # Capture in group 1: 
    (?>     # Start a possessive group: 
     (?:     # Either match 
     (?!\barray\(|[()]) # only if we're not before another array or parens 
     .     # any character 
    )+     # once or more 
    |      # or 
     \((?<Depth>)  # match '(' (and increase the nesting counter) 
    |      # or 
     \) (?<-Depth>)  # match ')' (and decrease the nesting counter). 
    )*     # Repeat as needed. 
    (?(Depth)(?!))  # Assert that the nesting counter is at zero. 
    )      # End of capturing group. 
    \)      # Then match ')'.", 
    "[$1]", RegexOptions.IgnorePatternWhitespace | RegexOptions.Singleline); 

这个正则表达式匹配array(...)其中...可能包含除了另一个array(...)之外的任何东西(因此它只匹配最深嵌套的事件)。它确实允许...中的其他嵌套(和正确平衡的)圆括号,但是如果它们是语义圆括号,或者它们包含在字符串或注释中,则不会执行任何检查。

换句话说,像

array(
    'name' => 'Hugo (((Hurley', 
    'numbers' => (4 + 8 + 15 + 16 + 23 + 42)/108 
) 

将无法​​匹配(正确)。

您需要迭代应用该正则表达式,直到它不再修改其输入 - 在您的示例中,两次迭代就足够了。

相关问题