2016-07-28 79 views
1

我使用jSoup库在Java上从this link上刮取。我的源代码运行良好,我想问如何拆分我得到的每个元素?拆分jSoup抓取结果

这里我源

package javaapplication1; 

import java.io.IOException; 
import java.sql.SQLException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 

public class coba { 

    public static void main(String[] args) throws SQLException { 
    MasukDB db=new MasukDB();   
     try { 
      Document doc = null; 
      for (int page = 1; page < 2; page++) { 
       doc = Jsoup.connect("http://hackaday.com/page/" + page).get(); 
       System.out.println("title : " + doc.select(".entry-title>a").text() + "\n"); 
       System.out.println("link : " + doc.select(".entry-title>a").attr("href") + "\n"); 
       System.out.println("body : " + String.join("", doc.select(".entry-content p").text()) + "\n"); 
       System.out.println("date : " + doc.select(".entry-date>a").text() + "\n"); 
      } 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
    } 
} 

在结果中,网站的每一个页面变成一条线,如何拆分它的家伙?以及如何获得对每一篇文章的链接,我想在链接方面我的CSS选择器仍然是错误的 感谢队友

回答

0
doc.select(".entry-title>a").text() 

这将搜索整个文档,并且返回一个链接列表,从中你刮他们的文字节点。但是,您可能想要刮掉每篇文章,然后从每篇文章中获取相关数据。

Document doc; 
    for (int page = 1; page < 2; page++) { 

     doc = Jsoup.connect("http://hackaday.com/page/" + page).get(); 

     // get a list of articles on page 
     Elements articles = doc.select("main#main article"); 

     // iterate article list 
     for (Element article : articles) { 

      // find the article header, which includes title and date 
      Element header = article.select("header.entry-header").first(); 

      // find and scrape title/link from header 
      Element headerTitle = header.select("h1.entry-title > a").first(); 
      String title = headerTitle.text(); 
      String link = headerTitle.attr("href"); 

      // find and scrape date from header 
      String date = header.select("div.entry-meta > span.entry-date > a").text(); 

      // find and scrape every paragraph in the article content 
      // you probably will want to further refine the logic here 
      // there may be paragraphs you don't want to include 
      String body = article.select("div.entry-content p").text(); 

      // view results 
      System.out.println(
        MessageFormat.format(
          "title={0} link={1} date={2} body={3}", 
          title, link, date, body)); 
     } 
    } 

查看CSS Selectors了解更多关于如何刮取这类数据的例子。

+0

非常感谢你的伴侣,你的脚本工作得很好,它像我明智:) 它与我的scrapy使用python几乎相同:D再次感谢 – jethow