下载Java内的网页及其资源

我想下载给定新闻文章的所有评论（www.theguardian.com）我可以用Java获得文章并用Jsoup解析以获得评论的URL，但是当我尝试下载，我只得到一个默认页面和评论数量（50）。例如，对于注释的网址可能是 http://discussion.theguardian.com/discussion/p/2nzaq 下载Java内的网页及其资源

如果我在Firefox加载该页面，并与我的用户ID，我得到显示所有评论的选项和URL成为 .../P/2nzaq＃演出登录-all

，但仍然给出的java这个网址下载只默认50层相同的意见与当.../p/2nzaq？的OrderBy =最新& per_page = 50 & commentpage = 1

现在我想尝试的wget或aria2在命令提示符处（windows）或通过在java中执行shell命令来获得具有任何的注释这些网址和相同的默认评论页面和编号。 Firefox似乎没有问题显示和下载所有评论。我如何在java中自动执行此操作？下面

感谢

根据注释尝试HttpClient的与

public class DownloadFile { 

public static void getFile(String url, String filepath) throws ClientProtocolException, IOException { 
    HttpClient httpClient = new DefaultHttpClient();   
    HttpGet httpget = new HttpGet(url); 
    HttpResponse response = httpClient.execute(httpget); 
    HttpEntity entity = response.getEntity(); 
    if (entity != null) { 
     //long len = entity.getContentLength(); 
     InputStream inputStream = entity.getContent(); 
     BufferedInputStream bis = new BufferedInputStream(entity.getContent()); 
     BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(new File(filepath))); 
     int inByte; 
     while((inByte = bis.read()) != -1) bos.write(inByte); 
     bis.close(); 
     bos.close(); 
    } 
    } 

    public static void main(String[] args) throws IOException { 
     Integer ii = 3; 
     String MyUrl = "http://discussion.theguardian.com/discussion/p/2nzaq?orderby=newest&per_page=50&commentpage=" + ii.toString(); 
     String MyFilePath = "./testfile" + ii.toString() + ".htm"; 
     getFile(MyUrl,MyFilePath); 
}

}

也试图与类似 “.../P/2nzaq＃显示所有” 我没有发现HttpClient的教程是错误的，你不能实例化HttpClient httpClient = new HttpClient（）;这产生HttpClient是抽象的;不能实例化---我在另一个帖子里发现HttpClient httpClient = new DefaultHttpClient（）;是好的

来源

2014-05-24 Stephen Kauffman

你用'HttpClient'试过了吗？ –

＃之后的所有内容都被客户端Javascript忽略。 – immibis

用以下代码尝试了HttpClient，但仍然没有运气 –

我相信你需要一个浏览器来做到这一点。您可以使用Selenium从Java控制浏览器。设置它非常简单，需要几分钟时间，请参阅我的答案：Running a WebDriver Test without using ANT, Maven, JUnit or Eclipse。

在Selenium中打开该URL后，您将获得当前页面的所有注释，然后以编程方式单击下一个按钮并循环，直到到达最终页面。

来源

2014-05-24 06:10:14

下载Java内的网页及其资源

回答

相关问题