Foreward
在它看起来像你试图解析HTML代码与常规的表面表达。我觉得有必要指出,由于可能会出现所有可能的模糊边缘情况,因此使用正则表达式来解析HTML是不可取的,但似乎您对HTML有一些控制权,因此您应该能够避免使用许多正则表达式警察哭了。
说明
<\w+\s(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\[(?<DesiredValue>[^\]]*)\])
|
<\w+\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>
(?:(?!<\/div>)(?!\[).)*\[(?<DesiredValue>[^\]]*)\]
这个正则表达式将执行以下操作:方括号[some value]
- 是
[value]
内
- 捕获子是在一个标签
- 是
[value]
是不是一个标签
- 提供子串的属性区域内没有嵌套在另一个值的ttributes
<input attrib=" [value] ">
- 捕获的子串将不包括包裹方括号
- 允许任何标签名,或与所需的标签名称
- 允许
value
是任何字符串替换\w
- 难以避免边缘情况
注:这个表达式最好用下列标志使用:
- 全球
- 点匹配新行
- 忽略表达空白
- 允许重复的命名捕获组
个
例子
现场演示
https://regex101.com/r/tT0bN5/1
示例文字
<div [value 1] ></div>
<div>[value 2]</div>
but not find a match in this example
<div attr="attribute[value 3]"/>
<img [value 4]>
<a href="http://[value 5]">[value 6]</a>
样品匹配
MATCH 1
DesiredValue [6-13] `value 1`
MATCH 2
DesiredValue [29-36] `value 2`
MATCH 3
DesiredValue [121-128] `value 4`
MATCH 4
DesiredValue [159-166] `value 6`
说明
NODE EXPLANATION
----------------------------------------------------------------------
<div '<div'
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^\]]* any character except: '\]' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\] ']'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
<div '<div'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
div> 'div>'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
. any character
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\[ '['
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
[^\]]* any character except: '\]' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\] ']'
你有没有考虑使用一个解析器呢? – chris85
它是PHP字符串,而不是Java字符串,你不需要全部转义。使用x修饰符(如果可以使用nowdoc字符串),而不是使用连接。如果你想处理html(或xml),忘记regex并使用DOMDocument(最终DOMXPath)。 –
其他的事情,关闭方括号不是一个特殊的字符,你不需要逃避它。字符类中的方括号没有什么特别之处,你可以写'[^ []'而不是'[^ \\ []''。 *(你甚至可以写'[^]]和'[]]',因为在第一个位置,方括号被看作是一个文字字符。)* –