为什么这个正则表达式没有给出预期的输出？

我有一个字符串，其中包含一些值，如下所示。我想用一些新文本替换包含特定customerId的html img标签。我想这是不是给我的期望output.here是节目信息小型的Java程序为什么这个正则表达式没有给出预期的输出？

我输入的字符串

String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/>" + "someText<img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456/> ..Ending here";

正则表达式是

String regex = "(?s)\\<img.*?customerId=3340.*?>";

新的文本，我想把里面输入串

编辑启动：

String newText = "<img src=\"getCustomerNew.do\">";

编辑完：

我现在做

String outputText = inputText.replaceAll(regex, newText);

输出

Starting here.. Replacing Text ..Ending here

但我预计产量

Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123/>someTextReplacing Text ..Ending here

请注意在我的预期输出中，只有包含customerId = 3340的img标签被替换文本替换。我没有得到为什么在输出我得到这两个img标签获得replced？

来源

2012-12-13 M Sach

你解析与正则表达式，只是从未工程完全的HTML（这是一般不是你regexing技能正则表达式的限制） –

你使用的是错误的tool..use HTML解析器 – Anirudha

@ Some1.Kill.The.DJ你能帮我一下，我怎样才能得到像jsoup这样的html解析器的预期结果？ –

正如其他人在评论中告诉你的，HTML不是一种常规语言，所以使用正则表达式来操纵它通常是痛苦的。您最好的选择是使用HTML解析器。我以前没有使用过Jsoup，但谷歌搜索一点点，似乎你需要的东西，如：

import org.jsoup.*; 
import org.jsoup.nodes.*; 
import org.jsoup.select.*; 

public class MyJsoupExample { 
    public static void main(String args[]) { 
     String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334&param1=123\"/></p>" 
      + "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340&param2=456\"/></p></body></html>"; 
     Document doc = Jsoup.parse(inputText); 
     Elements myImgs = doc.select("img[src*=customerId=3340"); 
     for (Element element : myImgs) { 
      element.replaceWith(new TextNode("my replaced text", "")); 
     } 
     System.out.println(doc.toString()); 
    } 
}

基本上代码获取img节点列表与src属性包含给定的字符串

Elements myImgs = doc.select("img[src*=customerId=3340");

然后遍历列表并用一些文本替换这些节点。

UPDATE

如果您不想替换文本整个img节点，而是你需要给一个新的价值，它的src属性，那么可以更换for循环与块：

element.attr("src", "my new value"));

，或者如果你想改变只是一个src值的部分，那么你可以这样做：

String srcValue = element.attr("src"); 
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));

这与我发布的in this thread非常相似。

来源

2012-12-13 19:52:33 Vicent

Vicent. It works good. But i am getting one issue.Instead of "my replaced text", Use "“jsoup作出这样< IMG SRC = " getCustomerNew.do "/>代替的元件; –

看起来是这样做的编码字符，如<，”我怎样才能停止？ –

所以你不想把整个img节点替换为src属性的值？ – Vicent

你有“通配符” /“任何”模式（.*）在那里，这将延长比赛时间最长的可能匹配的字符串，并且在模式的最后一个固定的文本是>字符，因此这匹配输入文本中的最后一个>字符，即最后一个！

您应该可以通过将.*零件更改为类似[^>]+的东西来解决此问题，以便匹配不会跨越第一个>字符。

用正则表达式解析HTML肯定会引起痛苦。

来源

2012-12-13 18:18:15

你是对的，但他使用'。*？'而不是'。*' – Anirudha

@Greg我可以通过jsoup库获得预期的输出吗？ –

'。*？'实际上与'。*'没有任何区别。零个或多个字符的零个或多个匹配是零个或多个字符，包括任意数目的'>'字符。 –

会发生什么事是你的正则表达式开始第一IMG标签相匹配，然后消耗的一切（无论是贪婪与否），直到它找到客户ID = 3340，然后继续消费的一切，直到它找到>。

如果你希望它仅消耗了IMG与客户ID = 3340想起了什么，使得从它可以匹配其他标签不同，这标签。

在这种特殊情况下，一种可能的解决方案是使用后视运算符（不会消耗匹配项）来查看标记后面的内容。此正则表达式将工作：

String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";

来源

2012-12-15 15:47:35

为什么这个正则表达式没有给出预期的输出？

回答

相关问题