替换&只在部分html文档中的链接

我试过几种方法（下面显示的jsoup）只在链接中将&amp转换为&。我遇到的困难表明我正在谈论这一切都是错误的。我怀疑在提供解决方案时我会面对面，但是也许好的旧正则表达式是最好的答案（因为我只需要在hrefs中进行替换），除非读者代码被修改了？替换&只在部分html文档中的链接

的解析库（也尝试NekoHTML）希望所有&秒值进行转换，以&所以我用他们连得真正链接的HREF与使用String的replace方法有问题。

输入：

String toParse = "The <a href=\"http://example.com?key=val&amp;another_key=val.pdf&amp;action=edit&happy=good\">Link with an encoded ampersand (&amp;)</a> is challenging."

所需的输出：

The <a href=\"http://example.com?key=val&another_key=val.pdf&action=edit&happy=good\">Link with an encoded ampersand (&amp;)</a> is challenging.

我遇到这种试图读取正在呈现<link> s的&代替&的RSS feed。

更新我结束了使用正则表达式来识别链接，然后使用replace插入到位一个与& s的解码的链接。 Pattern.quote()原来是很方便，但我不得不手动关闭并重新打开引述部分，所以我可以正则表达式或我符号条件：

final String cleanLink = StringUtils.strip(link).replaceAll(" ", "%20").replaceAll("'", "%27"); 
String regex = Pattern.quote(link); 
// end and re-start literal matching around my or condition 
regex = regex.replaceAll("&", "\\\\E(&amp;|&)\\\\Q"); 
final Pattern pattern = Pattern.compile(regex); 
final Matcher matcher = pattern.matcher(result); 

while (matcher.find()) { 
    int index = result.indexOf(matcher.group()); 
    while (index != -1) { 
     // this replaces the links with &amp; with the same links with & 
     // because cleanLink is from the DOM and has been properly decoded 
     result.replace(index, index + matcher.group().length(), cleanLink); 
     index += cleanLink.length(); 
     index = result.indexOf(matcher.group(), index); 
     linkReplaced = true; 
    } 
}

我并不感到这种做法，但我不得不处理太多条件而不使用DOM工具来识别链接。

来源

2015-06-24 eebbesen

在URL中拥有“&”实际上是标准。没有人像他们那样编写他们的URL，但作为一个URL没有任何错误，因此如此。 – Stewart

为什么你只想在'href's **'中替换'&'**？为什么不到处？另外，你可以显示你正在处理的整个文件/文件吗？ – Roman

至少在我的机器上，这个链接无法正确解决使用Safari，Chrome或Firerox：http://www.europarl.europa.eu/sides/getAllAnswers.do?reference=E-2015-006220 & language = EN，但这没关系：http://www.europarl.europa.eu/sides/getAllAnswers.do?reference=E-2015-006220&language=EN。所以对我来说正确处理＆符号是必要的。 – eebbesen

看看StringEscapeUtils。在String上尝试使用unescapeHtml()。

来源

2015-06-24 02:49:52 bphilipnyc

谢谢@ bphilipnyc！在'doc.body（）'上使用''将（（&）'（不在href中）'转换为'＆'。并且'attribute.setValue（StringEscapeUtils.unescapeHtml（attribute.getValue（）））;'也没有做我所需要的--dom对象中的所有东西仍然被强制转换为HTML。 – eebbesen

替换&只在部分html文档中的链接

回答

相关问题