看看HTML代码是否代表可见的文本/图片

我有一个包含一些HTML代码的字符串。我想知道HTML代码是代表可见文本还是图片。我使用Java来解决这个问题，使用下面的正则表达式（我知道你不能使用RegExps解析HTML，但我认为我对RegExps的支持足够了）。看看HTML代码是否代表可见的文本/图片

public static String regex_html_tags_1 = "<\\s*br\\s*[/]?>"; 
public static String regex_html_tags_2 = "<\\s*([a-zA-Z0-9]+)\\s*([^=/>]+\\s*=\\s*[^/>]+\\s*)*\\s*/>"; 
public static String regex_html_tags_3 = "<\\s*([a-zA-Z0-9]+)\\s*([^=>]+\\s*=\\s*[^>]+\\s*)*\\s*>\\s*</\\s*\\1\\s*>"; 

public static String[] HTMLWhiteSpaces = {"&nbsp;", "&#160;"};

使用这些正则表达式的代码工作正常，串像

<h2></h2>

或相似。但一串

<img src="someImage.png"></img>

也被认为是空的。

有没有人比使用RegExps找出一些HTML代码实际上代表人类可读的文本，当它被浏览器解释时有更好的主意？或者你认为我的方法最终会成功吗？

非常感谢。

来源

2012-12-06 LaDude

您是否将'display：hidden'视为隐形？ – khachik

谁会创建一个不可读的页面？我不明白。 –

我在说的HTML不是一个（网页）页面。内容是描述“某物”属性的XML文件的一部分。如果此描述不可读，则该财产不应出现在显示“某物”属性的文档中。 – LaDude

尝试使用JSoup。它让你用css选择器解析HTML文档（jquery风格）。

一个很简单的例子来选择所有非空元素是：

Document doc = Jsoup.connect("http://my.awesome.site.com").get(); 
Elements nonEmpties = doc.select(":not(:empty)");

的全面爆发，当然解决方案将需要一些额外的工作要做，像

迭代以上元素列表
检查css样式（对于display或visibility或大小或覆盖元素）
检查src属性图像
等

，但它绝对是值得的。你将学习一个新的框架，发现隐藏HTML/CSS内容的可能性，以及 - 最重要的是 - 停止使用正则表达式进行HTML解析;-)

来源

2012-12-06 15:12:33 npe

谢谢你指点我JSoup。该图书馆看起来很有希望。但是，当我尝试你的代码，我得到以下异常异常在线程“主”org.jsoup.select.Selector $ SelectorParseException：无法解析查询'：空'：意外的令牌'：空' – LaDude

那么，以说实话，我没有测试这是否有效，不幸的是，[JSoup似乎不支持'：empty：'伪选择器]（http://jsoup.org/apidocs/org/jsoup/select/Selector。 HTML）。 – npe

我想出了以下代码，我不需要考虑不可见的因素。

// HTML white spaces that might occur in between tags; this list probably needs to be extended 
public static String[] HTML_WHITE_SPACES = {"&nbsp;", "&#160;"}; 

/** 
* check if the given HTML text contains visible text or images 
* 
* @param htmlText String the text that is checked for visibility 
* @return boolean (1) true if the htmlText contains some visible elements 
*     or (2) false in case (1) does not hold 
*/ 
public static boolean containsVisibleElements(String htmlText) { 

    // do not analyze the HTML text if it is blank already 
    if (StringUtil.isBlank(htmlText)) { 
     return false; 
    } 

    // the string from which all whitespaces are removed 
    String htmlTextRemovedWhiteSpaces = htmlText; 

    // first, remove white spaces from the string 
    for (String whiteSpace: HTML_WHITE_SPACES) { 
     htmlTextRemovedWhiteSpaces = htmlTextRemovedWhiteSpaces.replaceAll(whiteSpace, ""); 
    } 

    // the HTML text is blank 
    if (StringUtil.isBlank(htmlTextRemovedWhiteSpaces)) { 
     return false; 
    } 

    // parse the HTML text from which the white space have been removed 
    Document doc = Jsoup.parse(htmlTextRemovedWhiteSpaces); 

    // find real text within the body (and its children) 
    String text = doc.body().text(); 

    // there exists visible text 
    if (!StringUtil.isBlank(text.trim())) { 
     return true; 
    } 

    // now we know that there does not exist visible text and that the string 
    // htmlTextRemovedWhiteSpaces is not blank 

    // look for images as they are visible and not a text ;-) 
    Elements images = doc.select("img"); 

    // there do not exist any image elements 
    if (images.isEmpty()) { 
     return false; 
    }  

    // none of the above checks succeeded, so there must exist some visible elements such as text or images 
    return true; 
}

来源

2012-12-08 12:14:36 LaDude

看看HTML代码是否代表可见的文本/图片

回答

相关问题