2017-07-19 40 views
0

我在解析使用jsoup(Java)的HTML文档面临的一个问题。我要解析的HTML格式如下:解析HTML

..... 
<hr> 
    <a name="N1"> </a> Text 1<br> 
<hr> 
    <a name="N2"> </a> Text 2<br> 
<hr> 
    <a name="N3"> </a>Text 3<br> 
<hr> 
    <a name="N4"> </a> 
    <DIV style="margin-left: 36px"> 
    <div></div> 
    <img src=bullet.gif alt="Bullet point"> Text 
    </DIV><br> 
<hr> 
<a name="X5"> </a> 
<DIV style="margin-left: 36px"> 
    <div></div> 
    <img src=bullet.gif alt="Bullet point"> Text 
</DIV><br> 
<hr> 
    ... 

我想隔离两个“hr”标签之间的HTML文本。我正在尝试以下代码:

File input = new File("C:\\Users\\page.html"); 
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); 
Elements body = doc.select("body"); 
Elements hrs = body.select("hr"); 
ArrayList<String> objects = new ArrayList<String>(); 
for (Element hr : hrs) { 
    String textAfterHr = hr.nextSibling().toString(); 
    objects.add(textAfterHr); 
} 

System.out.println(objects);

但ArrayList中不包含我想要什么,我不知道如何解决它。 (可以将“hr”标签转换为“hr”文本“/ hr”标签)?

+0

是什么ArrayList中包含哪些内容?什么是预期的输出? –

+0

ArrayList中包含了所有两个标签之间的文本,@Pshemo我很感兴趣,在之间,女巫整个文本,我将分析得到或Div的 – HappyDAD

回答

0

在这里,您通过读取每个小时标签的儿童得到的结果。使用这个更好的解决方案。

ArrayList<String> objects = new ArrayList<String>(); 
Elements hrs = body.select("hr"); 
for(int i=0;i<hrs.size();i++){ 
Element hrElm=hrs.get(i); 
Elements childrens=hrElm.children(); 
    for(Element child: childrens){ 
    String text=child.text(); 
    objects.add(text); 
} 
} 
0
public static void main(String[] args) throws ParseException, IOException { 
    String html = ".....\n" + 
        "<hr>\n" + 
        " <a name=\"N1\"> </a> Text 1<br>\n" + 
        "<hr>\n" + 
        " <a name=\"N2\"> </a> Text 2<br>\n" + 
        "<hr>\n" + 
        " <a name=\"N3\"> </a>Text 3<br>\n" + 
        "<hr>\n" + 
        " <a name=\"N4\"> </a>\n" + 
        " <DIV style=\"margin-left: 36px\">\n" + 
        " <div></div>\n" + 
        " <img src=bullet.gif alt=\"Bullet point\"> Text\n" + 
        " </DIV><br>\n" + 
        "<hr>\n" + 
        " <a name=\"X5\"> </a>\n" + 
        " <DIV style=\"margin-left: 36px\">\n" + 
        " <div></div>\n" + 
        " <img src=bullet.gif alt=\"Bullet point\"> Text\n" + 
        " </DIV><br>\n" + 
        "<hr>\n" + 
        " ..."; 
    //Split your html string at each hr tag and keep the delimiter 
    String [] splited = (html.split("(?=<hr>)")); 
    //join it back to a string using a closing hr tag 
    html = String.join("</hr>\n",splited); 
    //use the jsoup xmlParser 
    Document doc = Jsoup.parse(html,"",Parser.xmlParser()); 
    Elements eles = doc.select("hr"); 
    for(Element e : eles){ 
     System.out.println(e.html()); 
     System.out.println("-----------------------"); 
    } 
}