如何从HTML中除去特殊标记除外的所有内容？

我想解析HTML字符串只提取<form> ... </form>。所有其他的东西不需要，我可以删除它。如何从HTML中除去特殊标记除外的所有内容？

今天我有一些助手通过replaceAll特殊标记的内容删除，如：

/** remove form */ 
    String newString = string.replaceAll("(?s)<form.*?</form>", "");

(?s)<form.*?</form>

删除form标签。但我需要反过来，删除除了form之外的所有内容。

我该如何解决？

见我Gskinner例如

来源

2013-07-10 Maxim Shoustin

一般情况下，它的解析与HTML DOM解析器是个好主意。 – Leri

是的，但有时网页上有错误，如没有结束标记，在这种情况下，这种做法是不好的主意 –

在这种情况下可以尝试：'字符串newString = string.replaceAll（“*（<形式*）。？。？” “$ 1”）;' – Leri

试试下面的代码。

import java.util.regex.Matcher; 
import java.util.regex.Pattern; 

public class Client { 

    private static final String PATTERN = "<form>(.+?)</form>"; 
    private static final Pattern REGEX = Pattern.compile(PATTERN); 

    private static final boolean ONLY_TAG = true; 

    public static void main(String[] args) { 

     String text = "Hello <form><span><table>Hello Rais</table></span></form> end"; 
     System.out.println(getValues(text, ONLY_TAG)); 
     System.out.println(getValues(text, !ONLY_TAG)); 

    } 

    private static String getValues(final String text, boolean flag) { 
     final Matcher matcher = REGEX.matcher(text); 
     String tagValues = null; 
     if (flag) { 
      if (matcher.find()) { 
       tagValues = "<form>" + matcher.group(1) + "</form>"; 
      } 

     } else { 
      tagValues = text.replaceAll(PATTERN, ""); 
     } 
     return tagValues; 
    } 
}

您将获得以下输出

<form><span><table>Hello Rais</table></span></form> 
Hello end

来源

2013-07-10 11:51:38

-1

下面的代码会给你你正在寻找一个方向：

String str = "<html><form>test form</form></html>"; 
String newString = str.replaceAll("[^<form</form>]+|((?s)<form.*?</form>)", "$1"); 
System.out.println(newString);

来源

2013-07-10 11:35:06 Mubin

你应该阅读[否定字符类]（http://www.regular-expressions.info/charclass.html）'[^ ...]'，它们不像你想象的那样行事。 – sp00m

如何从HTML中除去特殊标记除外的所有内容？

回答

相关问题