2014-02-23 26 views
2

我遇到特殊字符和charset = iso-8859-1的问题。 我在这里使用的代码与UTF-8一起工作良好,所以我不明白我在做什么错。Jsoup - 解析带有字符集的HTML文件iso-8859-1

下面是代码:

File input = new File("https://stackoverflow.com/users/marcioapf/example.html"); 
Document doc = Jsoup.parse(input, "iso-8859-1", ""); 
Elements elements = doc.select("span.DEPUTADO") ; 
System.out.println(elements.toString()); 

这里是输出:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jo&atilde;ozinho Pereira</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulh&otilde;es</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">In&aacute;cio Loiola</span> 

这是应该的:

<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Joãozinho Pereira</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Isnaldo Bulhões</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Antonio Albuquerque</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Jeferson Morais</span> 
<span style="margin-left: 8px; width: auto !important;" class="DEPUTADO">Inácio Loiola</span> 

我怎样才能解决呢?

+0

如果首先将整个文件加载到内存中,然后用'Jsoup.parse(字符串)'方法处理它,会发生什么?此外,输出在技术上是正确的。 –

回答

1

使用EscapeMode.xhtml会给你没有实体的输出。 试试这个代码

File input = new File("https://stackoverflow.com/users/marcioapf/example.html"); 
    Document doc = Jsoup.parse(input, "iso-8859-1", ""); 
    doc.outputSettings().escapeMode(EscapeMode.xhtml); 
    Elements elements = doc.select("span.DEPUTADO") ; 
    System.out.println(elements.toString());