2014-01-22 124 views
2

我在使用HtmlUnit解析html页面时遇到两个问题(实际存在问题)。我尝试了他们的“入门”以及搜索谷歌,但没有帮助。这是我的第一个问题。如何使用HtmlUnit从HTML页面中提取元素

1)我想提取网页

<b class="productPrice">Five Dollars</b> 

2)我想提取整个文本(包括附加展或链接文字下面bold标签的文本,如果存在的话)在最后一个段落在以下结构中

<div class="alertContainer"> 
<p>Hello</p> 
<p>Haven't you registeret yet?</p> 
<p>Registrations will close on 3 July 2012.<span>So don't wait</span></p> 
</div> 

你可以请单行代码片段我该怎么做?我是HtmlUnit的新手。

编辑:

HtmlUnit的getElementByName()getElementById(),所以我们如何使用,如果我们想用类选择?

这将是我的第一个问题的答案。

+0

您是否尝试过'getElementsByAttribute()'和'getOneHtmlElementByAttribute()'? (其中attributeName是“class”) – MattR

回答

6

其实,我建议你使用XPath和jtidy代替,这样

import java.io.IOException; 
import java.net.MalformedURLException; 
import java.util.List; 

import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; 
import com.gargoylesoftware.htmlunit.WebClient; 
import com.gargoylesoftware.htmlunit.html.HtmlForm; 
import com.gargoylesoftware.htmlunit.html.HtmlItalic; 
import com.gargoylesoftware.htmlunit.html.HtmlOption; 
import com.gargoylesoftware.htmlunit.html.HtmlPage; 
import com.gargoylesoftware.htmlunit.html.HtmlRadioButtonInput; 
import com.gargoylesoftware.htmlunit.html.HtmlSelect; 
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput; 
import com.gargoylesoftware.htmlunit.html.HtmlTextArea; 
import com.gargoylesoftware.htmlunit.html.HtmlTextInput; 

public class WebScrapper { 

    private static final String TEXT = "some random text here"; 
    private static final String SWALLOW = "continental"; 
    private static final String COLOR = "indigo2"; 
    private static final String QUESTION = "why?"; 
    private static final String NAME = "Leo"; 

    /** 
    * @param args 
    * @throws IOException 
    * @throws MalformedURLException 
    * @throws FailingHttpStatusCodeException 
    */ 
    public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException { 

     //to get the HTML Xpath, download and install firefox plugin Xpather from 
     //http://jassage.com/xpather-1.4.5b.xpi 
     // 
     //then right-click on any part of the html and choose "show in xpather" 
     // 
     //HtmlUnit is a suite for functional web app tests (headless) with a 
     //built-in "browser". Very useful for screen scraping. 
     // 
     //for HtmlUnit examples and usage, try 
     //http://htmlunit.sourceforge.net/gettingStarted.html 
     // 
     //sometimes, the HTML is malformed, so you'll need to "clean it" 
     //that's why I've also added JTidy to this project 

     WebClient webClient = new WebClient(); 

     HtmlPage page = webClient.getPage("http://cgi-lib.berkeley.edu/ex/simple-form.html"); 

//  System.out.println(page.asXml()); 

     HtmlForm form = (HtmlForm) page.getByXPath("/html/body/form").get(0); 

     HtmlTextInput name = form.getInputByName("name"); 
     name.setValueAttribute(NAME); 

     HtmlTextInput quest = form.getInputByName("quest"); 
     quest.setValueAttribute(QUESTION); 

     HtmlSelect color = form.getOneHtmlElementByAttribute("select", "name", "color"); 
     List<HtmlOption> options = color.getOptions(); 
     for(HtmlOption op:options){ 
      if (op.getValueAttribute().equals(COLOR)){ 
       op.setSelected(true); 
      } 
     } 

     HtmlTextArea text = form.getOneHtmlElementByAttribute("textarea", "name", "text"); 
     text.setText(TEXT); 

     //swallow 
     HtmlRadioButtonInput swallow = form.getInputByValue(SWALLOW); 
     swallow.click(); 

     HtmlSubmitInput submit = form.getInputByValue("here"); 

     //submit 
     HtmlPage page2 = submit.click(); 

//  System.out.println(page2.asXml()); 

     String color2 = ((HtmlItalic)page2.getByXPath("//dd[1]/i").get(0)).getTextContent(); 
     String name2 = ((HtmlItalic)page2.getByXPath("//dd[2]/i").get(0)).getTextContent(); 
     String quest2 = ((HtmlItalic)page2.getByXPath("//dd[3]/i").get(0)).getTextContent(); 
     String swallow2 = ((HtmlItalic)page2.getByXPath("//dd[4]/i").get(0)).getTextContent(); 
     String text2 = ((HtmlItalic)page2.getByXPath("//dd[5]/i").get(0)).getTextContent(); 

     System.out.println(COLOR.equals(color2) 
       && NAME.equals(name2) 
       && QUESTION.equals(quest2) 
       && SWALLOW.equals(swallow2) 
       && TEXT.equals(text2)); 

     webClient.closeAllWindows(); 

    } 

} 
+0

:我们可以在HtmlUnit中使用正则表达式吗?现在''div class ='fresh_article_832'>'现在'832'可能会改变,所以我可以使用'fresh_article_ [0- 9] *'在我的xpath中?或者有什么选择。 –

+0

我不知道httpunit xpath是否支持fn:matches(),但你可以试试看:-) – 2014-01-22 10:22:16

+0

:XPather是一个很好的。+ 1为那个 –

相关问题