为什么功能正则表达式使用PHP的preg_match_all（）失败？

我在PHP脚本下面的正则表达式为什么功能正则表达式使用PHP的preg_match_all（）失败？

$total_matches = preg_match_all('{ 

     <a\shref=" 
     (?<link>[^"]+) 
     "(?:(?!src=).)+src=" 
     (?<image>[^"]+) 
     (?:(?!designer-name">).)+designer-name"> 
     (?<brand>[^<]+) 
     (?:(?!title=).)+title=" 
     (?<title>((?!">).)+) 
     (?:(?!"price">).)+"price">\$ 
     (?<price>[\d.,]+) 

}xsi',$output,$all_matches,PREG_SET_ORDER);

此正则表达式解析似乎以下（通过PHP或使用分析器在regexr.com（与不区分大小写设置相同的选项时，做工精细，扩展，治疗换行符为空格）：

<a href="http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title= 
    "DORDOGNE 120 PLATEAU SANDALEN" class="product-image"> 
    <img class="image1st" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/small_ image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-STANDARD.jpg" 
    width="230" height="260" 
    alt= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" 
    title= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> 
<img class="image2nd" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-DETAIL_2.jpg" 
width="230" height="260" alt= 
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" title= 
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> <span class= 
"availability"><strong>available sizes</strong><br /></span></a> 

<div style="margin-left: 2em" class="available-sizes"> 
<h2 class="designer-name">Christian Louboutin</h2> 

<div class="product-buttons"> 
    <div class="product-button"> 
    NEW ARRIVAL 
    </div> 

    <div class="clearer"></div> 
</div> 

<h3 class="product-name"><a href= 
"http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title= 
"DORDOGNE 120 SANDALS">DORDOGNE 120 SANDALS</a></h3> 

<div class="price-box"> 
    <span class="regular-price" id="product-price-114114"><span class= 
    "price">$805.00</span></span> 
</div>

如果我试图在一排来解析多个匹配，它的工作原理也无妨但是当我尝试解析完整的网页，这些匹配来自（我有许可证se this）

http://www.mytheresa.com/us_en/new-arrivals/what-s-new-this-week-1.html?limit=12

正则表达式失败（我实际上得到一个500错误）。我试过增加回溯限制使用

ini_set('pcre.backtrack_limit',100000000); 
ini_set('pcre.recursion_limit',100000000);

但这并不能解决问题。我想知道我在做什么错误，导致正则表达式通过PHP失败时，似乎是有效的，并匹配相关页面上的代码。摆弄它似乎表明负面的lookaheads（与页面长度一起）导致了问题，但我不确定我是如何搞砸他们的。我正在运行PHP 5.2.17。

来源

2011-08-10 jela

和使用必须使用有内容的许可？ – 2011-08-10 03:17:09

同时检查'PCRE_VERSION'常量。如果它合理过时，请尝试安装更新的'libpcre'。 '（？！..）。+）'断言可能是昂贵的。除非你想重写正则表达式或将它分解成preg_replace_callback，否则考虑使用像phpQuery或QueryPath这样的html工具包进行提取（更容易，而且通常不会显着变慢）。 – mario

@mario我的PCRE_VERSION是8.02 2010-03-19，我不确定它是否符合旧版本（它的4个版本过时）。我想我可能不得不重新修正这个正则表达式。我很惊讶这个lookaheads很贵，但我认为你可能是对的。如果我不能重写正则表达式，我会研究phpQuery和QueryPath。 – jela

你犯了一个经典失误！不要使用正则表达式来解析HTML！它打破了正则表达式！（这是在“绝不参与亚洲地区战争”和“当死亡在线时不要与西西里人对抗”）。

你应该使用SimpleXML或的DomDocument解析这个：

$dom = new DomDocument(); 
$dom->loadHTML('http://www.mytheresa.com/us_en/new-arrivals/'. 
       'what-s-new-this-week-1.html?limit=12'); 

$path = new DomXPath($dom); 
// this query is based on the link you provided, not your regex 
$nodes = $path->evaluate('//ul[class="products-grid first odd"]/li'); 
foreach($nodes as $node) 
{ 
    // children 0 = anchor tag you're looking for initially. 
    echo $node->children[0]->getAttribute("href"); 
    // iterate through the other children that way 
}

来源

2011-08-10 03:49:51 cwallenpoole

我们需要一个新的“不可思议”徽章！ – Phil

来吧，它是*当然可以想象的*有时唯一的机会，如果你有巨大的传统frontpage cruft忍受。 – ZJR

@ZJR你错过了机会说：“这个词，我不认为这意味着你的想法。” – cwallenpoole

那些消极的向前看符号是聪明的，但后来......稍微太聪明。

我同意，你使用太多，没有得到副作用。

无法看到哪一个是猖獗的权利，但把一个重复.这样...总是势必会给你贪婪问题。

这个例如，肯定是不必要的：

title=" 
(?<title>((?!">).)

，你可以写它

title="(?<title>.*?)">

...还有更多喜欢它。我会改变他们。

一般情况下，正则表达式调试意味着的改写，再而三又一次改写它，使用不同的结构，直到找到正确的平衡和之间功能 mantainability。

另一件事：我会用<a\s+而不是<a\s，只要稍微更加灵活。
保持略微灵活，它支付。

也：title=可以显示自己title\s*=\s*

来源

2011-08-10 04:34:10 ZJR

对于标题来说这是一个有趣的案例，因为从技术角度来看，这个lookahead是多余的。问题是，有时编写html的人无法正确编码标题中的双引号，这意味着我不能相信双引号本身意味着标题的结尾。无论如何，我会开始用懒惰的星星替换负面的lookahead，看看会发生什么。您肯定要添加空格。 – jela

为什么功能正则表达式使用PHP的preg_match_all（）失败？

回答

相关问题