2013-10-09 122 views
0

我又来了。jsoup不会删除元素

我有这段代码:

this.doc = Jsoup.parse(str); 
     Elements tables = doc.getElementsByTag("table"); 
     if(tables!=null){ 
      for(Element table : tables){ 
       if(table != null){ 
        Elements tds=table.getElementsByTag("td"); 
        if(tds!=null){ 
         for(Element td : tds){ 
          String[] text=td.text().trim().split("\\s+"); 
          if(text.length<2)td.remove(); 
         } 
        } 
       } 
      } 
     } 
     Elements hs = doc.getElementsByTag("h1, h2, h3, h4"); 
     if(hs!=null)for(Element h : hs)if(h != null)h.remove(); 
     Elements blocks = doc.getElementsByTag("div, center, li, p, address, aside, audio, blockquote, canvas, dd, dl, fieldset, figcaption, figure, footer, form, header, hr, hgroup, li, ol, noscript, output, pre, section"); 
     if(blocks!=null){ 
      System.out.println(blocks.size()); 
      for(Element block : blocks){ 
       if(block != null){ 
        String[] text=block.text().trim().split("\\s+"); 
        if(text.length<2)block.remove(); 
       } 
      } 
     } 
     Elements pdp = doc.getElementsByClass("pineDeletePoint"); 
     if(pdp!=null&&pdp.size()>0)pdp.remove(); 
     str = this.doc.outerHtml(); 

但是我仍然有具有小于这两个词在我的HTML中块元素。

为什么我不能删除它们?

非常感谢您的帮助......

+0

对不起我以前回答过正则表达式有点错误。更新它。 – Sage

回答

2

在您的代码:

Elements hs = doc.getElementsByTag("h1, h2, h3, h4"); 

我知道你想什么,但路过,分隔的多个代码不会随getElementsByTag()工作,它的工作原理功能如select()功能,doc.select("div, h1, h2")。但我可以想出一个解决方案使用伪选择器:matchesOwn(regex)与reges:^\s*\S+\s*$。下面是一个简短的工作例如:

String data = "<div> asd asd</div><span><p> asdd </p></span>"; 
    Document doc = Jsoup.parse(data); 
    Elements elms = doc.select(":matchesOwn(^\\s*\\S+\\s*$)"); 
      // do whatever you are going to do with elms 
    System.out.println(elms); // print the elements having less than two words 
    elms.remove(); // remove all elements from document which contains 
        // less than 2 words in their own text 
    System.out.println("\nprinting Document:\n"+doc); 

和输出:

 <p> asdd </p> 

printing Document: 
<html> 
<head></head> 
<body> 
    <div> 
    asd asd 
    </div> 
    <span></span> 
</body> 
</html>