使用jsoup获取URL的子链接

考虑一个URl www.example.com它可能有大量的链接，有些可能是内部的，其他的可能是外部的。我想获得所有子链接的列表，而不是甚至是子子链接，但只有子链接。例如，如果有四个环节如下使用jsoup获取URL的子链接

1)www.example.com/images/main 
2)www.example.com/data 
3)www.example.com/users 
4)www.example.com/admin/data

然后出了四只2和3的是使用，因为它们是子链接不能细分子等环节。是有办法实现它通过j汤。如果这不能通过j汤实现，那么可以向我介绍一些其他的java API。还要注意，它应该是最初发送的父Url的链接（即www.example.com）

来源

2017-03-27 java fan

如果我能理解一个子链接可以包含一个斜杠，您可以尝试用此计数数字的斜线例如：

List<String> list = new ArrayList<>(); 
list.add("www.example.com/images/main"); 
list.add("www.example.com/data"); 
list.add("www.example.com/users"); 
list.add("www.example.com/admin/data");

for(String link : list){ 
    if((link.length() - link.replaceAll("[/]", "").length()) == 1){ 
     System.out.println(link); 
    } 
}

link.length()：计数
link.replaceAll("[/]", "").length() 字符数：计数斜线

的数量

如果差值等于1，那么右边的链接否则不是。

编辑

如何将我扫描子链接整个网站？

答案这与的robots.txt文件或Robots exclusion standard，所以在这个它定义网站例如https://stackoverflow.com/robots.txt的所有子链接，这样的想法是，要读这个文件，你可以提取该网址这里的子链接是一段代码，可以帮助你：

public static void main(String[] args) throws Exception { 

    //Your web site 
    String website = "http://stackoverflow.com"; 
    //We will read the URL https://stackoverflow.com/robots.txt 
    URL url = new URL(website + "/robots.txt"); 

    //List of your sub-links 
    List<String> list; 

    //Read the file with BufferedReader 
    try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) { 
     String subLink; 
     list = new ArrayList<>(); 

     //Loop throw your file 
     while ((subLink = in.readLine()) != null) { 

      //Check if the sub-link is match with this regex, if yes then add it to your list 
      if (subLink.matches("Disallow: \\/\\w+\\/")) { 
       list.add(website + "/" + subLink.replace("Disallow: /", "")); 
      }else{ 
       System.out.println("not match"); 
      } 
     } 
    } 

    //Print your result 
    System.out.println(list); 
}

这将告诉你：

[https://stackoverflow.com/posts/，https://stackoverflow.com/posts？， https://stackoverflow.com/search/，https://stackoverflow.com/search？， https://stackoverflow.com/feeds/，https://stackoverflow.com/feeds？， https://stackoverflow.com/unanswered/， https://stackoverflow.com/unanswered？，https://stackoverflow.com/u/， https://stackoverflow.com/messages/，https://stackoverflow.com/ajax/， https://stackoverflow.com/plugins/]

这里是一个Demo about the regex that i use。

希望这可以帮助你。

来源

2017-03-27 11:52:58

但是，我将如何扫描整个网站的子链接 –

你的实现将工作后，我会得到网站上的所有内部链接 –

检查我的编辑@javafan的想法是阅读** robots.txt **它包含网站的所有信息，所以你可以从那里提取子链接 –

要扫描网页上的链接，您可以使用JSoup库。如前面的回答表明可以用来

import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 

class read_data { 

    public static void main(String[] args) { 
     try { 
      Document doc = Jsoup.connect("**your_url**").get(); 
      Elements links = doc.select("a"); 
      List<String> list = new ArrayList<>(); 
      for (Element link : links) { 
       list.add(link.attr("abs:href")); 
      } 
     } catch (IOException ex) { 

     } 
    } 
}

列表。

阅读网站上所有链接的代码如下所示。我已使用http://stackoverflow.com/进行说明。我建议你先浏览公司的terms of use，然后再揪出网站。

import java.io.IOException; 
import java.util.HashSet; 
import java.util.Set; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.select.Elements; 

public class readAllLinks { 

    public static Set<String> uniqueURL = new HashSet<String>(); 
    public static String my_site; 

    public static void main(String[] args) { 

     readAllLinks obj = new readAllLinks(); 
     my_site = "stackoverflow.com"; 
     obj.get_links("http://stackoverflow.com/"); 
    } 

    private void get_links(String url) { 
     try { 
      Document doc = Jsoup.connect(url).get(); 
      Elements links = doc.select("a"); 
      links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> { 
       boolean add = uniqueURL.add(this_url); 
       if (add && this_url.contains(my_site)) { 
        System.out.println(this_url); 
        get_links(this_url); 
       } 
      }); 

     } catch (IOException ex) { 

     } 

    } 
}

您将获得uniqueURL字段中所有链接的列表。

来源

2017-03-28 11:03:39

感谢您的帮助，但让我告诉你，我不想简单地在网页上获取链接，我想要获得整个网站的链接。 –

你可以看到[this]（http://stackoverflow.com/questions/32299871/java-get-every-webpage-associated-with-domain-name-programmatically）。让我知道如果这不适合你。 –

我接受的答案也是一样的 –

使用jsoup获取URL的子链接

回答

相关问题