如何使用HtmlUnit从HTML页面中提取元素

我在使用HtmlUnit解析html页面时遇到两个问题（实际存在问题）。我尝试了他们的“入门”以及搜索谷歌，但没有帮助。这是我的第一个问题。如何使用HtmlUnit从HTML页面中提取元素

1）我想提取网页

<b class="productPrice">Five Dollars</b>

2）我想提取整个文本（包括附加展或链接文字下面bold标签的文本，如果存在的话）在最后一个段落在以下结构中

<div class="alertContainer"> 
<p>Hello</p> 
<p>Haven't you registeret yet?</p> 
<p>Registrations will close on 3 July 2012.<span>So don't wait</span></p> 
</div>

你可以请单行代码片段我该怎么做？我是HtmlUnit的新手。

编辑：

HtmlUnit的getElementByName()和getElementById()，所以我们如何使用，如果我们想用类选择？

这将是我的第一个问题的答案。

来源

2014-01-22 Insane Coder

您是否尝试过'getElementsByAttribute（）'和'getOneHtmlElementByAttribute（）'？（其中attributeName是“class”） – MattR

其实，我建议你使用XPath和jtidy代替，这样

import java.io.IOException; 
import java.net.MalformedURLException; 
import java.util.List; 

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; 
import com.gargoylesoftware.htmlunit.WebClient; 
import com.gargoylesoftware.htmlunit.html.HtmlForm; 
import com.gargoylesoftware.htmlunit.html.HtmlItalic; 
import com.gargoylesoftware.htmlunit.html.HtmlOption; 
import com.gargoylesoftware.htmlunit.html.HtmlPage; 
import com.gargoylesoftware.htmlunit.html.HtmlRadioButtonInput; 
import com.gargoylesoftware.htmlunit.html.HtmlSelect; 
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput; 
import com.gargoylesoftware.htmlunit.html.HtmlTextArea; 
import com.gargoylesoftware.htmlunit.html.HtmlTextInput; 

public class WebScrapper { 

    private static final String TEXT = "some random text here"; 
    private static final String SWALLOW = "continental"; 
    private static final String COLOR = "indigo2"; 
    private static final String QUESTION = "why?"; 
    private static final String NAME = "Leo"; 

    /** 
    * @param args 
    * @throws IOException 
    * @throws MalformedURLException 
    * @throws FailingHttpStatusCodeException 
    */ 
    public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { 

     //to get the HTML Xpath, download and install firefox plugin Xpather from 
     //http://jassage.com/xpather-1.4.5b.xpi 
     // 
     //then right-click on any part of the html and choose "show in xpather" 
     // 
     //HtmlUnit is a suite for functional web app tests (headless) with a 
     //built-in "browser". Very useful for screen scraping. 
     // 
     //for HtmlUnit examples and usage, try 
     //http://htmlunit.sourceforge.net/gettingStarted.html 
     // 
     //sometimes, the HTML is malformed, so you'll need to "clean it" 
     //that's why I've also added JTidy to this project 

     WebClient webClient = new WebClient(); 

     HtmlPage page = webClient.getPage("http://cgi-lib.berkeley.edu/ex/simple-form.html"); 

//  System.out.println(page.asXml()); 

     HtmlForm form = (HtmlForm) page.getByXPath("/html/body/form").get(0); 

     HtmlTextInput name = form.getInputByName("name"); 
     name.setValueAttribute(NAME); 

     HtmlTextInput quest = form.getInputByName("quest"); 
     quest.setValueAttribute(QUESTION); 

     HtmlSelect color = form.getOneHtmlElementByAttribute("select", "name", "color"); 
     List<HtmlOption> options = color.getOptions(); 
     for(HtmlOption op:options){ 
      if (op.getValueAttribute().equals(COLOR)){ 
       op.setSelected(true); 
      } 
     } 

     HtmlTextArea text = form.getOneHtmlElementByAttribute("textarea", "name", "text"); 
     text.setText(TEXT); 

     //swallow 
     HtmlRadioButtonInput swallow = form.getInputByValue(SWALLOW); 
     swallow.click(); 

     HtmlSubmitInput submit = form.getInputByValue("here"); 

     //submit 
     HtmlPage page2 = submit.click(); 

//  System.out.println(page2.asXml()); 

     String color2 = ((HtmlItalic)page2.getByXPath("//dd[1]/i").get(0)).getTextContent(); 
     String name2 = ((HtmlItalic)page2.getByXPath("//dd[2]/i").get(0)).getTextContent(); 
     String quest2 = ((HtmlItalic)page2.getByXPath("//dd[3]/i").get(0)).getTextContent(); 
     String swallow2 = ((HtmlItalic)page2.getByXPath("//dd[4]/i").get(0)).getTextContent(); 
     String text2 = ((HtmlItalic)page2.getByXPath("//dd[5]/i").get(0)).getTextContent(); 

     System.out.println(COLOR.equals(color2) 
       && NAME.equals(name2) 
       && QUESTION.equals(quest2) 
       && SWALLOW.equals(swallow2) 
       && TEXT.equals(text2)); 

     webClient.closeAllWindows(); 

    } 

}

来源

2014-01-22 09:44:31

：我们可以在HtmlUnit中使用正则表达式吗？现在''div class ='fresh_article_832'>'现在'832'可能会改变，所以我可以使用'fresh_article_ [0- 9] *'在我的xpath中？或者有什么选择。 –

我不知道httpunit xpath是否支持fn：matches（），但你可以试试看:-) – 2014-01-22 10:22:16

：XPather是一个很好的。+ 1为那个 –

如何使用HtmlUnit从HTML页面中提取元素

回答

相关问题