请勿在特定链接中抓取某个页面（排除某些网址无法抓取）

这是MyCrawler.java中的以下代码，它正在抓取我在href.startsWith中提供的所有链接，但假设如果我不想抓取这个特定页面http://inv.somehost.com/people/index.html话，我怎么能在我的代码做..请勿在特定链接中抓取某个页面（排除某些网址无法抓取）

public MyCrawler() { 
    } 

    public boolean shouldVisit(WebURL url) { 

     String href = url.getURL().toLowerCase(); 


    if (href.startsWith("http://www.somehost.com/") || href.startsWith("http://inv.somehost.com/") || href.startsWith("http://jo.somehost.com/")) { 
//And If I do not want to crawl this page http://inv.somehost.com/data/index.html then how it can be done.. 


        return true; 
       } 
       return false; 
      } 


    public void visit(Page page) { 

     int docid = page.getWebURL().getDocid(); 

     String url = page.getWebURL().getURL();   
     String text = page.getText(); 
     List<WebURL> links = page.getURLs(); 
     int parentDocid = page.getWebURL().getParentDocid(); 

     try { 
      URL url1 = new URL(url); 
      System.out.println("URL:- " +url1); 
      URLConnection connection = url1.openConnection(); 

      Map responseMap = connection.getHeaderFields(); 
      Iterator iterator = responseMap.entrySet().iterator(); 
      while (iterator.hasNext()) 
      { 
       String key = iterator.next().toString(); 

       if (key.contains("text/html") || key.contains("text/xhtml")) 
       { 
        System.out.println(key); 
        // Content-Type=[text/html; charset=ISO-8859-1] 
        if (filters.matcher(key) != null){ 
         System.out.println(url1); 
         try { 
          final File parentDir = new File("crawl_html"); 
          parentDir.mkdir(); 
          final String hash = MD5Util.md5Hex(url1.toString()); 
          final String fileName = hash + ".txt"; 
          final File file = new File(parentDir, fileName); 
          boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt 


          System.out.println("hash:-" + hash); 

            System.out.println(file); 
          // Create file if it does not exist 



           // File did not exist and was created 
           FileOutputStream fos = new FileOutputStream(file, true); 

           PrintWriter out = new PrintWriter(fos); 

           // Also could be written as follows on one line 
           // Printwriter out = new PrintWriter(new FileWriter(args[0])); 

              // Write text to file 
           Tika t = new Tika(); 
           String content= t.parseToString(new URL(url1.toString())); 


           out.println("==============================================================="); 
           out.println(url1); 
           out.println(key); 
           //out.println(success); 
           out.println(content); 

           out.println("==============================================================="); 
           out.close(); 
           fos.flush(); 
           fos.close(); 



         } catch (FileNotFoundException e) { 
          // TODO Auto-generated catch block 
          e.printStackTrace(); 
         } catch (IOException e) { 
          // TODO Auto-generated catch block 

          e.printStackTrace(); 
         } catch (TikaException e) { 
          // TODO Auto-generated catch block 
          e.printStackTrace(); 
         } 


         // http://google.com 
        } 
       } 


      } 



     } catch (MalformedURLException e) { 
      e.printStackTrace(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 



     System.out.println("============="); 
    }

这是从哪儿MyCrawler获取调用我Controller.java代码..

public class Controller { 
    public static void main(String[] args) throws Exception { 
      CrawlController controller = new CrawlController("/data/crawl/root"); 
      controller.addSeed("http://www.somehost.com/"); 
      controller.addSeed("http://inv.somehost.com/"); 
      controller.addSeed("http://jo.somehost.com/"); 
      controller.start(MyCrawler.class, 20); 
      controller.setPolitenessDelay(200); 
      controller.setMaximumCrawlDepth(2); 
    } 
}

任何建议将赞赏..

来源

2011-07-13 ferhan

如何添加一个属性来告诉你想排除哪个url。

将您不希望它们抓取的所有页面添加到您的排除列表中。

下面是一个例子：

public class MyCrawler extends WebCrawler { 


     List<Pattern> exclusionsPatterns; 

     public MyCrawler() { 
      exclusionsPatterns = new ArrayList<Pattern>(); 
      //Add here all your exclusions using Regular Expresssions 
      exclusionsPatterns.add(Pattern.compile("http://investor\\.somehost\\.com.*")); 
     } 

     /* 
     * You should implement this function to specify 
     * whether the given URL should be visited or not. 
     */ 
     public boolean shouldVisit(WebURL url) { 
       String href = url.getURL().toLowerCase(); 

       //Iterate the patterns to find if the url is excluded. 
       for (Pattern exclusionPattern : exclusionsPatterns) { 
        Matcher matcher = exclusionPattern.matcher(href); 
        if (matcher.matches()) { 
         return false; 
        } 
       } 

       if (href.startsWith("http://www.ics.uci.edu/")) { 
         return true; 
       } 
       return false; 
     } 
}

在这个例子中，我们是在告诉与http://investor.somehost.com开始的所有URL不应该被抓取。

所以这些不会被抓取：

http://investor.somehost.com/index.html 
http://investor.somehost.com/something/else

我建议你阅读regular expresions。

来源

2011-07-13 17:59:52

那么我们该如何做到这一点。这就是我所要求的。任何想法？ – ferhan

您可以通过使用列表排除项来完成此操作;并添加您想要排除的网址。您将需要检查该列表以确定是否应处理页面。 –

任何与我的代码示例将不胜感激。而且应该在我的MyCrawler.java文件中的位置。 – ferhan

请勿在特定链接中抓取某个页面（排除某些网址无法抓取）

回答

相关问题