解析HTML

我在解析使用jsoup（Java）的HTML文档面临的一个问题。我要解析的HTML格式如下：解析HTML

..... 
<hr> 
    <a name="N1"> </a> Text 1<br> 
<hr> 
    <a name="N2"> </a> Text 2<br> 
<hr> 
    <a name="N3"> </a>Text 3<br> 
<hr> 
    <a name="N4"> </a> 
    <DIV style="margin-left: 36px"> 
    <div></div> 
    <img src=bullet.gif alt="Bullet point"> Text 
    </DIV><br> 
<hr> 
<a name="X5"> </a> 
<DIV style="margin-left: 36px"> 
    <div></div> 
    <img src=bullet.gif alt="Bullet point"> Text 
</DIV><br> 
<hr> 
    ...

我想隔离两个“hr”标签之间的HTML文本。我正在尝试以下代码：

File input = new File("C:\\Users\\page.html"); 
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/"); 
Elements body = doc.select("body"); 
Elements hrs = body.select("hr"); 
ArrayList<String> objects = new ArrayList<String>(); 
for (Element hr : hrs) { 
    String textAfterHr = hr.nextSibling().toString(); 
    objects.add(textAfterHr); 
}

System.out.println（objects）;

但ArrayList中不包含我想要什么，我不知道如何解决它。（可以将“hr”标签转换为“hr”文本“/ hr”标签）？

来源

2017-07-19 HappyDAD

是什么ArrayList中包含哪些内容？什么是预期的输出？ –

你感兴趣的只是''这是后''

或整个文本

之间''直接放置？ – Pshemo

ArrayList中包含了所有两个标签之间的文本，@Pshemo我很感兴趣，在之间，女巫整个文本，我将分析得到或Div的 – HappyDAD

在这里，您通过读取每个小时标签的儿童得到的结果。使用这个更好的解决方案。

ArrayList<String> objects = new ArrayList<String>(); 
Elements hrs = body.select("hr"); 
for(int i=0;i<hrs.size();i++){ 
Element hrElm=hrs.get(i); 
Elements childrens=hrElm.children(); 
    for(Element child: childrens){ 
    String text=child.text(); 
    objects.add(text); 
} 
}

来源

2017-07-20 06:02:41

public static void main(String[] args) throws ParseException, IOException { 
    String html = ".....\n" + 
        "<hr>\n" + 
        " <a name=\"N1\"> </a> Text 1<br>\n" + 
        "<hr>\n" + 
        " <a name=\"N2\"> </a> Text 2<br>\n" + 
        "<hr>\n" + 
        " <a name=\"N3\"> </a>Text 3<br>\n" + 
        "<hr>\n" + 
        " <a name=\"N4\"> </a>\n" + 
        " <DIV style=\"margin-left: 36px\">\n" + 
        " <div></div>\n" + 
        " <img src=bullet.gif alt=\"Bullet point\"> Text\n" + 
        " </DIV><br>\n" + 
        "<hr>\n" + 
        " <a name=\"X5\"> </a>\n" + 
        " <DIV style=\"margin-left: 36px\">\n" + 
        " <div></div>\n" + 
        " <img src=bullet.gif alt=\"Bullet point\"> Text\n" + 
        " </DIV><br>\n" + 
        "<hr>\n" + 
        " ..."; 
    //Split your html string at each hr tag and keep the delimiter 
    String [] splited = (html.split("(?=<hr>)")); 
    //join it back to a string using a closing hr tag 
    html = String.join("</hr>\n",splited); 
    //use the jsoup xmlParser 
    Document doc = Jsoup.parse(html,"",Parser.xmlParser()); 
    Elements eles = doc.select("hr"); 
    for(Element e : eles){ 
     System.out.println(e.html()); 
     System.out.println("-----------------------"); 
    } 
}

来源

2017-07-20 10:43:46 Eritrean

回答

相关问题