2011-07-15 53 views
0

我正在写一个搜寻器,并且在那个抓取器中,我不想抓取某个网页(排除一些链接以使其不抓取)。所以我写了那个页面的排除。什么毛病此代码。由于这个http://www.host.com/technology/网址获取调用尽管写排除。我不希望这个网址http://www.host.com/technology/来抓取的..开头的URL排除某些网址被抓取

public class MyCrawler extends WebCrawler { 

    Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" 
      + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" 
      + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); 

List<String> exclusions; 


    public MyCrawler() { 

     exclusions = new ArrayList<String>(); 
     //Add here all your exclusions 
//I do not want this url to get crawled.. 
     exclusions.add("http://www.host.com/technology/"); 

    } 

    public boolean shouldVisit(WebURL url) { 
     String href = url.getURL().toLowerCase(); 
     System.out.println(href); 
     if (filters.matcher(href).matches()) { 
      System.out.println("noooo"); 
      return false; 
     } 

     if (exclusions.contains(href)) {//why this loop is not working?? 
     System.out.println("Yes2"); 
      return false; 
    } 

     if (href.startsWith("http://www.host.com/")) { 
      System.out.println("Yes1"); 
      return true; 
     } 



     System.out.println("No"); 
     return false; 
    } 

    public void visit(Page page) { 
     int docid = page.getWebURL().getDocid(); 
     String url = page.getWebURL().getURL();   
     String text = page.getText(); 
     List<WebURL> links = page.getURLs(); 
     int parentDocid = page.getWebURL().getParentDocid(); 

     System.out.println("============="); 
     System.out.println("Docid: " + docid); 
     System.out.println("URL: " + url); 
     System.out.println("Text length: " + text.length()); 
     System.out.println("Number of links: " + links.size()); 
     System.out.println("Docid of parent page: " + parentDocid); 
     System.out.println("============="); 
    } 
} 

回答

2

如果你不想要抓取与排除开始任何URL,你必须做这样的事情:

for(String exclusion : exclusions){ 
    if(href.startsWith(exclusion)){ 
     return false; 
    } 
} 

此外,if语句不是循环。

+0

感谢您回复。我在做什么错误..您可以让我知道吗.. – ferhan

+0

您看到整个网址是否在排除列表(exclusions.contains(href))中,而不是查看是否该URL以任何排除项开始(我的示例)。 – Jeffrey

+0

感谢您的回答和解释... – ferhan