2013-03-19 17 views
1

我正在浏览别人的旧代码,并有一些麻烦理解它。PHP的正则表达式和preg_replace问题

他:

explode(' ', strtolower(preg_replace('/[^a-z0-9-]+/i', ' ', preg_replace('/\&#?[a-z0-9]{2,4}\;/', ' ', preg_replace('/<[^>]+>/', ' ', $texts))))); 

我认为第一个正则表达式排除a-z0-9,我不知道第二个正则表达式做什么,但。第三个匹配'< >'里面任何东西,除了'>'

结果将输出,并在$texts变量的每一个字的阵列,但是,我只是不知道如何代码产生这样。我明白了什么preg_replace等功能做什么,只是不知道如何处理工作

+1

这许多嵌套的preg_replace电话仅仅是将导致混乱 – Scuzzy 2013-03-19 23:30:51

+1

它分解成三个独立的语句,使用临时变量的处理顺序。然后它变得更容易遵循。 – mario 2013-03-19 23:31:15

回答

4

表达/[^a-z0-9-]+/i将匹配(并随后与空白代替)的任何字符除了 A-Z和0-9。 ^ in [^...]表示否定其中包含的字符集。

  • [^a-z0-9]任何字母数字字符
  • +指一种或多种的前述
  • /i使得它匹配不区分大小写

表达/\&#?[a-z0-9]{2,4}\;/匹配&随后任选地匹配#,后面是两到四个字母和数字,以结尾这将match HTML entities like&nbsp;&#39;

  • &#?比赛要么因为?&&#,使前#可选&实际上并不需要逃跑。
  • [a-z0-9]{2,4}两个和四个字母数字字符匹配
  • ;是文字分号。它实际上并不需要转义。

部分是因为你怀疑,最后一个将取代像<tagname><tagname attr='value'></tagname>任何代码与一个空的空间。请注意,它与整个标签相匹配,而不仅仅是<>的内部内容。

  • <是文字字符
  • [^>]+是每个字符直到但不包括下一个>
  • >是文字字符

我真的建议重写这三个单独的呼叫到preg_replace()而不是嵌套它们。

// Strips tags. 
// Would be better done with strip_tags()!! 
$texts = preg_replace('/<[^>]+>/', ' ', $texts); 
// Removes HTML entities 
$texts = preg_replace('/&#?[a-z0-9]{2,4};/', ' ', $texts); 
// Removes remainin non-alphanumerics 
$texts = preg_replace('/[^a-z0-9-]+/i', ' ', $texts); 
$array = explode(' ', $texts); 
+0

...匹配一个'&',后面可以跟'#'? – 2013-03-19 23:32:43

+0

@JanTuroň已经被claraified。 – 2013-03-19 23:33:16

2

这段代码看起来像它...

  1. 条HTML/XML标签
  2. 那么任何与&或&#开始,为2-4(任何<和>之间)字符长(字母数字)
  3. 然后剥离任何非字母数字或破折号的东西

在嵌套

/<[^>]+>/ 

Match the character “<” literally «<» 
Match any character that is NOT a “>” «[^>]+» 
    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» 
Match the character “>” literally «>» 


/\&#?[a-z0-9]{2,4}\;/ 

Match the character “&” literally «\&» 
Match the character “#” literally «#?» 
    Between zero and one times, as many times as possible, giving back as needed (greedy) «?» 
Match a single character present in the list below «[a-z0-9]{2,4}» 
    Between 2 and 4 times, as many times as possible, giving back as needed (greedy) «{2,4}» 
    A character in the range between “a” and “z” «a-z» 
    A character in the range between “0” and “9” «0-9» 
Match the character “;” literally «\;» 


/[^a-z0-9-]+/i 

Options: case insensitive 

Match a single character NOT present in the list below «[^a-z0-9-]+» 
    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+» 
    A character in the range between “a” and “z” «a-z» 
    A character in the range between “0” and “9” «0-9» 
    The character “-” «-»