需要帮助获取Java中的网站的HTML

我从java httpurlconnection cutting off html得到了一些代码，我几乎是从Java中的网站获取html的代码。除了一个特定的网站，我无法再使用此代码的工作：需要帮助获取Java中的网站的HTML

我试图从该网站获得HTML：

http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289

但我不断收到垃圾字符。虽然它可以很好地与任何其他网站，如http://www.google.com。

这是我使用的代码：

public static String PrintHTML(){ 
    URL url = null; 
    try { 
     url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289"); 
    } catch (MalformedURLException e1) { 
     // TODO Auto-generated catch block 
     e1.printStackTrace(); 
    } 
    HttpURLConnection connection = null; 
    try { 
     connection = (HttpURLConnection) url.openConnection(); 
    } catch (IOException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"); 
    try { 
     System.out.println(connection.getResponseCode()); 
    } catch (IOException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    String line; 
    StringBuilder builder = new StringBuilder(); 
    BufferedReader reader = null; 
    try { 
     reader = new BufferedReader(new InputStreamReader(connection.getInputStream())); 
    } catch (IOException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    try { 
     while ((line = reader.readLine()) != null) { 
      builder.append(line); 
      builder.append("\n"); 
     } 
    } catch (IOException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    String html = builder.toString(); 
    System.out.println("HTML " + html); 
    return html; 
}

我不明白为什么它不与我上面提到的网址工作。

任何帮助将不胜感激。

来源

2010-08-04 bits

无论客户端的能力如何，该网站都会错误地回应响应。通常情况下，服务器只应在客户端支持的情况下gzip响应（由Accept-Encoding: gzip）。您需要使用GZIPInputStream来解压缩它。

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

请注意，我还将正确的字符集添加到InputStreamReader的构造函数中。通常情况下，您想从响应的Content-Type标题中提取它。

欲了解更多提示，另请参阅How to use URLConnection to fire and handle HTTP requests?如果您毕竟想要的是从HTML中解析/提取信息，那么我强烈建议您使用类似Jsoup的HTML parser。

来源

2010-08-04 14:06:46 BalusC

哇它的工作。感谢您的解释。并感谢该片段。我最初尝试使用HTMLCleaner作为我的解析器，但我遇到了同样的问题。现在我将把这个HTML字符串提供给HTMLCleaner。 – bits 2010-08-04 14:20:06

不客气。 – BalusC 2010-08-04 14:20:35

顺便说一句，当使用Jsoup.connect（url）.get（）时，jsoup（1.3.1）现在可以正确处理gzip的输出; – 2010-08-23 10:20:50

需要帮助获取Java中的网站的HTML

回答

相关问题