2016-05-17 21 views
-1

我有两个正则表达式匹配[value]和另一个匹配html属性,但我需要将它们组合成一个单一的正则表达式。PHP的preg_replace在html中找到匹配,但如果它的html属性不匹配

这是我的工作正则表达式找到[value]

$tagregexp = '[a-zA-Z_\-][0-9a-zA-Z_\-\+]{2,}'; 

    $pattern = 
      '\\['        // Opening bracket 
     . '(\\[?)'       // 1: Optional second opening bracket for escaping shortcodes: [[tag]] 
     . "($tagregexp)"      // 2: Shortcode name 
     . '(?![\\w-])'      // Not followed by word character or hyphen 
     . '('        // 3: Unroll the loop: Inside the opening shortcode tag 
     .  '[^\\]\\/]*'     // Not a closing bracket or forward slash 
     .  '(?:' 
     .   '\\/(?!\\])'    // A forward slash not followed by a closing bracket 
     .   '[^\\]\\/]*'    // Not a closing bracket or forward slash 
     .  ')*?' 
     . ')' 
     . '(?:' 
     .  '(\\/)'      // 4: Self closing tag ... 
     .  '\\]'       // ... and closing bracket 
     . '|' 
     .  '\\]'       // Closing bracket 
     .  '(?:' 
     .   '('      // 5: Unroll the loop: Optionally, anything between the opening and closing shortcode tags 
     .    '[^\\[]*+'    // Not an opening bracket 
     .    '(?:' 
     .     '\\[(?!\\/\\2\\])' // An opening bracket not followed by the closing shortcode tag 
     .     '[^\\[]*+'   // Not an opening bracket 
     .    ')*+' 
     .   ')' 
     .   '\\[\\/\\2\\]'    // Closing shortcode tag 
     .  ')?' 
     . ')' 
     . '(\\]?)';       // 6: Optional second closing bracket for escaping shortcodes: [[tag]] 

example here

此正则表达式(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?属性和值相匹配。 example here

我想正则表达式来匹配在下面的例子

  • <div [value] ></div>
  • <div>[value]</div>

但不[value]找到匹配在这个例子中

  • <input attr="attribute[value]"/>

只是需要将它做成一个单一的正则表达式中使用我的preg_replace_callback

preg_replace_callback($pattern, replace_matches, $html); 
+0

你有没有考虑使用一个解析器呢? – chris85

+0

它是PHP字符串,而不是Java字符串,你不需要全部转义。使用x修饰符(如果可以使用nowdoc字符串),而不是使用连接。如果你想处理html(或xml),忘记regex并使用DOMDocument(最终DOMXPath)。 –

+0

其他的事情,关闭方括号不是一个特殊的字符,你不需要逃避它。字符类中的方括号没有什么特别之处,你可以写'[^ []'而不是'[^ \\ []''。 *(你甚至可以写'[^]]和'[]]',因为在第一个位置,方括号被看作是一个文字字符。)* –

回答

1

Foreward

在它看起来像你试图解析HTML代码与常规的表面表达。我觉得有必要指出,由于可能会出现所有可能的模糊边缘情况,因此使用正则表达式来解析HTML是不可取的,但似乎您对HTML有一些控制权,因此您应该能够避免使用许多正则表达式警察哭了。

说明

<\w+\s(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\[(?<DesiredValue>[^\]]*)\]) 
| 
<\w+\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*> 
(?:(?!<\/div>)(?!\[).)*\[(?<DesiredValue>[^\]]*)\] 

Regular expression visualization

这个正则表达式将执行以下操作:方括号[some value]

  • [value]

    • 捕获子是在一个标签
    • [value]是不是一个标签
    • 提供子串的属性区域内没有嵌套在另一个值的ttributes <input attrib=" [value] ">
  • 捕获的子串将不包括包裹方括号
  • 允许任何标签名,或与所需的标签名称
  • 允许value是任何字符串替换\w
  • 难以避免边缘情况

注:这个表达式最好用下列标志使用:

  • 全球
  • 点匹配新行
  • 忽略表达空白
  • 允许重复的命名捕获组

例子

现场演示

https://regex101.com/r/tT0bN5/1

示例文字

<div [value 1] ></div> 
<div>[value 2]</div> 
but not find a match in this example 

<div attr="attribute[value 3]"/> 
<img [value 4]> 
<a href="http://[value 5]">[value 6]</a> 

样品匹配

MATCH 1 
DesiredValue [6-13] `value 1` 
MATCH 2 
DesiredValue [29-36] `value 2` 
MATCH 3 
DesiredValue [121-128] `value 4` 
MATCH 4 
DesiredValue [159-166] `value 6` 

说明

NODE      EXPLANATION 
---------------------------------------------------------------------- 
    <div      '<div' 
---------------------------------------------------------------------- 
    \s      whitespace (\n, \r, \t, \f, and " ") 
---------------------------------------------------------------------- 
    (?=      look ahead to see if there is: 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more 
          times (matching the least amount 
          possible)): 
---------------------------------------------------------------------- 
     [^>=]     any character except: '>', '=' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     ='      '=\'' 
---------------------------------------------------------------------- 
     [^']*     any character except: ''' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     '      '\'' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     ="      '="' 
---------------------------------------------------------------------- 
     [^"]*     any character except: '"' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
     "      '"' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
     =      '=' 
---------------------------------------------------------------------- 
     [^'"]     any character except: ''', '"' 
---------------------------------------------------------------------- 
     [^\s>]*     any character except: whitespace (\n, 
           \r, \t, \f, and " "), '>' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
    )*?      end of grouping 
---------------------------------------------------------------------- 
    \[      '[' 
---------------------------------------------------------------------- 
    (      group and capture to \1: 
---------------------------------------------------------------------- 
     [^\]]*     any character except: '\]' (0 or more 
           times (matching the most amount 
           possible)) 
---------------------------------------------------------------------- 
    )      end of \1 
---------------------------------------------------------------------- 
    \]      ']' 
---------------------------------------------------------------------- 
)      end of look-ahead 
---------------------------------------------------------------------- 
|      OR 
---------------------------------------------------------------------- 
    <div      '<div' 
---------------------------------------------------------------------- 
    \s?      whitespace (\n, \r, \t, \f, and " ") 
          (optional (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more times 
          (matching the most amount possible)): 
---------------------------------------------------------------------- 
    [^>=]     any character except: '>', '=' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    ='      '=\'' 
---------------------------------------------------------------------- 
    [^']*     any character except: ''' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    '      '\'' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    ="      '="' 
---------------------------------------------------------------------- 
    [^"]*     any character except: '"' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
    "      '"' 
---------------------------------------------------------------------- 
    |      OR 
---------------------------------------------------------------------- 
    =      '=' 
---------------------------------------------------------------------- 
    [^'"]     any character except: ''', '"' 
---------------------------------------------------------------------- 
    [^\s>]*     any character except: whitespace (\n, 
          \r, \t, \f, and " "), '>' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
)*      end of grouping 
---------------------------------------------------------------------- 
    >      '>' 
---------------------------------------------------------------------- 
    (?:      group, but do not capture (0 or more times 
          (matching the most amount possible)): 
---------------------------------------------------------------------- 
    (?!      look ahead to see if there is not: 
---------------------------------------------------------------------- 
     <      '<' 
---------------------------------------------------------------------- 
     \/      '/' 
---------------------------------------------------------------------- 
     div>      'div>' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
    (?!      look ahead to see if there is not: 
---------------------------------------------------------------------- 
     \[      '[' 
---------------------------------------------------------------------- 
    )      end of look-ahead 
---------------------------------------------------------------------- 
    .      any character 
---------------------------------------------------------------------- 
)*      end of grouping 
---------------------------------------------------------------------- 
    \[      '[' 
---------------------------------------------------------------------- 
    (      group and capture to \2: 
---------------------------------------------------------------------- 
    [^\]]*     any character except: '\]' (0 or more 
          times (matching the most amount 
          possible)) 
---------------------------------------------------------------------- 
)      end of \2 
---------------------------------------------------------------------- 
    \]      ']' 
+0

令人难以置信的答案,我很欣赏投入到答案中的时间和精力。我仍然没有完全解决它,但这应该有很大的帮助。 – TarranJones

+0

让我知道这个答案是缺少的,或者我可以帮忙。 –