2011-12-23 118 views
9

我只想从Java获取任何网页的源代码。我发现很多的解决方案,到目前为止,但我无法找到下面的所有环节工作的任何代码:如何从Java获取网页的源代码

对我来说,主要问题是一些代码检索网页源代码,但缺少一些。例如下面的代码不适用于第一个链接。

InputStream is = fURL.openStream(); //fURL can be one of the links above 
BufferedReader buffer = null; 
buffer = new BufferedReader(new InputStreamReader(is, "iso-8859-9")); 

int byteRead; 
while ((byteRead = buffer.read()) != -1) { 
    builder.append((char) byteRead); 
} 
buffer.close(); 
System.out.println(builder.toString()); 
+1

请注意,您只能获得打开网址时最初传送的源代码。可能会有额外的内容通过AJAX加载,并且当您刚刚阅读初始流时,您不会看到该内容。 - 例如,在Firefox中打开http://demo.vaadin.com/sampler,然后打开页面源代码。您将无法看到所有显示内容的来源。 – Thomas

+0

@cerq:根据您对*“网页源代码”的定义*,您可以或不可以这样做。例如,可以认为,由* .jsp *生成的网页的“源代码”是* .jsp *文件本身,而不是**生成的HTML ...您要做什么是HTML,而不是“源代码”。在许多情况下,“源代码”位于服务器上,并且很少盗用服务器,您根本无法访问它。 – TacticalCoder

+0

@Thomas我认为我的问题是关于你所说的事情。那么有什么办法可以让所有显示的内容来源? – brtb

回答

22

尝试下面的代码与添加的请求属性:

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.net.URL; 
import java.net.URLConnection; 

public class SocketConnection 
{ 
    public static String getURLSource(String url) throws IOException 
    { 
     URL urlObject = new URL(url); 
     URLConnection urlConnection = urlObject.openConnection(); 
     urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"); 

     return toString(urlConnection.getInputStream()); 
    } 

    private static String toString(InputStream inputStream) throws IOException 
    { 
     try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"))) 
     { 
      String inputLine; 
      StringBuilder stringBuilder = new StringBuilder(); 
      while ((inputLine = bufferedReader.readLine()) != null) 
      { 
       stringBuilder.append(inputLine); 
      } 

      return stringBuilder.toString(); 
     } 
    } 
} 
+0

您的代码和我写的代码都无法工作链接http://www.cumhuriyet.com.tr?hn=298710请先测试您的代码。 – brtb

+2

System.out.println(getUrlSource(“http://cumhuriyet.com.tr/?hn=298710”));没关系 –

1
URL yahoo = new URL("http://www.yahoo.com/"); 
BufferedReader in = new BufferedReader(
      new InputStreamReader(
      yahoo.openStream())); 

String inputLine; 

while ((inputLine = in.readLine()) != null) 
    System.out.println(inputLine); 

in.close(); 
+0

我不想要一个适用于yahoo.com或google.com的代码,请检查我的帖子两次 – brtb

3

我相信你已经找到了在过去2年某处的解决方案,但下面是一个可行的解决方案为您所要求的网站提供服务

package javasandbox; 

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.net.HttpURLConnection; 
import java.net.MalformedURLException; 
import java.net.URL; 

/** 
* 
* @author Ryan.Oglesby 
*/ 
public class JavaSandbox { 

private static String sURL; 

/** 
* @param args the command line arguments 
*/ 
public static void main(String[] args) throws MalformedURLException, IOException { 
    sURL = "http://www.cumhuriyet.com.tr/?hn=298710"; 
    System.out.println(sURL); 
    URL url = new URL(sURL); 
    HttpURLConnection httpCon = (HttpURLConnection) url.openConnection(); 
    //set http request headers 
      httpCon.addRequestProperty("Host", "www.cumhuriyet.com.tr"); 
      httpCon.addRequestProperty("Connection", "keep-alive"); 
      httpCon.addRequestProperty("Cache-Control", "max-age=0"); 
      httpCon.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"); 
      httpCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36"); 
      httpCon.addRequestProperty("Accept-Encoding", "gzip,deflate,sdch"); 
      httpCon.addRequestProperty("Accept-Language", "en-US,en;q=0.8"); 
      //httpCon.addRequestProperty("Cookie", "JSESSIONID=EC0F373FCC023CD3B8B9C1E2E2F7606C; lang=tr; __utma=169322547.1217782332.1386173665.1386173665.1386173665.1; __utmb=169322547.1.10.1386173665; __utmc=169322547; __utmz=169322547.1386173665.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/8616781/how-to-get-a-web-pages-source-code-from-java; __gads=ID=3ab4e50d8713e391:T=1386173664:S=ALNI_Mb8N_wW0xS_wRa68vhR0gTRl8MwFA; scrElm=body"); 
      HttpURLConnection.setFollowRedirects(false); 
      httpCon.setInstanceFollowRedirects(false); 
      httpCon.setDoOutput(true); 
      httpCon.setUseCaches(true); 

      httpCon.setRequestMethod("GET"); 

      BufferedReader in = new BufferedReader(new InputStreamReader(httpCon.getInputStream(), "UTF-8")); 
      String inputLine; 
      StringBuilder a = new StringBuilder(); 
      while ((inputLine = in.readLine()) != null) 
       a.append(inputLine); 
      in.close(); 

      System.out.println(a.toString()); 

      httpCon.disconnect(); 
} 
} 
+0

帮助永远不会太晚。但是我尝试了你的代码,它在很多网页中都不起作用。 –

+1

我同意这部分不会针对所有网页,因为不同的网页以不同的格式返回数据,在某些情况下,您可能需要重新定向。在某些情况下,您可能会收到响应作为gzip响应,您可以按如下所示处理它:InputStream gzippedResponse = httpCon.getInputStream(); InputStream ungzippedResponse = new GZIPInputStream(gzippedResponse); InputStreamReader reader = new InputStreamReader(ungzippedResponse,“UTF-8”); StringWriter writer = new StringWriter();' – Roglesby