preg_match_all：除html标记外，在引号内部获取文本

我最近使用了一种模式来替换双/双引号的双引号。preg_match_all：除html标记外，在引号内部获取文本

$string = preg_replace('/(\")([^\"]+)(\")/','“$2”',$string);

当$ string是句子，甚至是段落时，它工作正常。

但是......

我的函数可以调用到工作的HTML代码块，并且它不工作为例外了：

$string = preg_replace('/(\")([^\"]+)(\")/','“$2”','<a href="page.html">Something "with" quotes</a>');

回报

<a href=“page.html”>Something “with” quotes</a>

而且这是一个问题...

所以我认为我可以做到两遍：提取文本w ithin标签，然后替换引号。

我想这

$pattern='/<[^>]+>(.*)<\/[^>]+>/';

而且它的工作原理例如，如果字符串是

$string='<a href="page.html">Something "with" quotes</a>';

但它不与像字符串：

$string='Something "with" quotes <a href="page.html">Something "with" quotes</a>';

任何想法？

伯特兰

来源

2013-09-25 Bertrand Fourrier

[小马HE COMES]（HTTP ：//stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454） –

@Kolink我知道这会出现。这就是为什么我会建议使用simplexml，只将其应用于文本而不应用于属性。 – Christoph

我必须“清理”的字符串是90％的案例中的文本字段的值，并且在某些情况下，您可以在内部使用“代码”。这就是解析不合适的原因。 –

通常最好的回答我猜...因为它已经被pointed out，你不应该通过正则表达式解析HTML。你可以看看PHP Simple DOM Parse来提取文本并应用你已经说过的正则表达式，它似乎工作得很好。

This教程应该把你放在正确的方向。

来源

2013-09-25 14:27:17 npinti

谢谢，但我需要解析一些代码时使用解析器。在这种情况下，解析代码不会帮助我替换其他人的某些字符。 –

我敢肯定，这将在火焰战争结束，但这个工程：

echo do_replace('<a href="page.html">Something "with" quotes</a>')."\n"; 
echo do_replace('Something "with" quotes <a href="page.html">Something "with" quotes</a>')."\n"; 

function do_replace($string){ 
    preg_match_all('/<([^"]*?|"[^"]*")*>/', $string, $matches); 
    $matches = array_flip($matches[0]); 

    $uuid = md5(mt_rand()); 
    while(strpos($string, $uuid) !== false) $uuid = md5(mt_rand()); 
    // if you want better (time) garanties you could build a prefix tree and search it for a string not in it (would be O(n) 

    foreach($matches as $key => $value) 
     $matches[$key] = $uuid.$value; 

    $string = str_replace(array_keys($matches), $matches, $string); 
    $string = preg_replace('/\"([^\"<]+)\"/','&ldquo;$1&rdquo;', $string); 
    return str_replace($matches, array_keys($matches), $string); 
}

输出（I替换& ldquo;并且& rdquo;的与“和”）：

<a href="page.html">Something “with” quotes</a> 
Something “with” quotes <a href="page.html">Something “with” quotes</a>

有了一个costum状态机，你甚至可以在没有第一次替换的情况下完成它，而不是替换回来。无论如何，我建议使用解析器。

来源

2013-09-25 15:27:29 Christoph

我试了一下，它的工作原理。谢谢。问题是，在90％的时间内，它只是一个我得到的字符串（来自文本输入的值），并且使用解析器来处理字符串或少数标记实际上需要更多的工作。这个正则表达式并不意味着用于完整的html页面。 –

随意投票和/或接受，如果它是正确的。 – Christoph

我终于找到了一个方法：

提取文本，可以是内部或外部（前，后）任何标记（如果有的话）
使用回调通过对找到的报价和替换它们。

代码

$string = preg_replace_callback('/[^<>]*(?!([^<]+)?>)/sim', create_function('$matches', 'return preg_replace(\'/(\")([^\"]+)(\")/\', \'“$2”\', $matches[0]);'), $string);

来源

2013-09-26 09:35:51

伯特兰，复活这个问题，因为它有一个简单的解决方案，可以让你一气呵成，无需回调替换。（发现你的问题而做一些研究的一般问题有关how to exclude patterns in regex）

下面是我们简单的regex：

<[^>]*>(*SKIP)(*F)|"([^"]*)"

交替的左侧匹配完整<tags>然后故意失败。右侧匹配双引号字符串，并且我们知道它们是正确的字符串，因为它们不与左侧的表达式匹配。

此代码显示如何使用正则表达式（见结果在online demo的底部）：

<?php 
$regex = '~<[^>]*>(*SKIP)(*F)|"([^"]*)"~'; 
$subject = 'Something "with" quotes <a href="page.html">Something "with" quotes</a>'; 
$replaced = preg_replace($regex,"“$1”",$subject); 
echo $replaced."<br />\n"; 
?>

参考

How to match (or replace) a pattern except in situations s1, s2, s3...

来源

2014-05-21 06:32:22 zx81

preg_match_all：除html标记外，在引号内部获取文本

回答

相关问题