通过抓取其内容类型不是文本/ html的URL获取URL

我可以获取其内容/类型为text/html的所有url，但是如果我想要那些内容/类型不是text/html的url。那我们怎么检查一下。而对于字符串，我们可以使用contains方法，但它并没有像notcontains东西..任何建议，可以理解的。而也通过抓取其内容类型不是文本/ html的URL获取URL

The key variable contains: 

Content-Type=[text/html; charset=ISO-8859-1]

这是下面的代码检查text/html的我也尝试了不是text/html的内容类型，但它也打印出内容类型也是text/html的内容类型。

try { 
      URL url1 = new URL(url); 
      System.out.println("URL:- " +url1); 
      URLConnection connection = url1.openConnection(); 

      Map responseMap = connection.getHeaderFields(); 
      Iterator iterator = responseMap.entrySet().iterator(); 
      while (iterator.hasNext()) 
      { 
       String key = iterator.next().toString(); 

       if (key.contains("text/html") || key.contains("text/xhtml")) 
       { 
        System.out.println(key); 
        // Content-Type=[text/html; charset=ISO-8859-1] 
        if (filters.matcher(key) != null){ 
         System.out.println(url1); 
         try { 
          final File parentDir = new File("crawl_html"); 
          parentDir.mkdir(); 
          final String hash = MD5Util.md5Hex(url1.toString()); 
          final String fileName = hash + ".txt"; 
          final File file = new File(parentDir, fileName); 
          boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt 


          System.out.println("hash:-" + hash); 

            System.out.println(file); 
          // Create file if it does not exist 



           // File did not exist and was created 
           FileOutputStream fos = new FileOutputStream(file, true); 

           PrintWriter out = new PrintWriter(fos); 

           // Also could be written as follows on one line 
           // Printwriter out = new PrintWriter(new FileWriter(args[0])); 

              // Write text to file 
           Tika t = new Tika(); 
           String content= t.parseToString(new URL(url1.toString())); 


           out.println("==============================================================="); 
           out.println(url1); 
           out.println(key); 
           out.println(success); 
           out.println(content); 

           out.println("==============================================================="); 
           out.close(); 
           fos.flush(); 
           fos.close(); 



         } catch (FileNotFoundException e) { 
          // TODO Auto-generated catch block 
          e.printStackTrace(); 
         } catch (IOException e) { 
          // TODO Auto-generated catch block 

          e.printStackTrace(); 
         } catch (TikaException e) { 
          // TODO Auto-generated catch block 
          e.printStackTrace(); 
         } 


         // http://google.com 
        } 
       } 
    else if (!connection.getContentType().startsWith("text/html"))//print duplicate records of each url 
       //else if (!key.contains("text/html")) 
       { 
        if (filters.matcher(key) != null){ 
        try { 
         final File parentDir = new File("crawl_media"); 
         parentDir.mkdir(); 
         final String hash = MD5Util.md5Hex(url1.toString()); 
         final String fileName = hash + ".txt"; 
         final File file = new File(parentDir, fileName); 
        // Create file if it does not exist 
         boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt 


         System.out.println("hash:-" + hash); 

         Tika t = new Tika(); 
         String content_media= t.parseToString(new URL(url1.toString())); 



          // File did not exist and was created 
          FileOutputStream fos = new FileOutputStream(file, true); 

          PrintWriter out = new PrintWriter(fos); 

          // Also could be written as follows on one line 
          // Printwriter out = new PrintWriter(new FileWriter(args[0])); 

             // Write text to file 
          out.println("==============================================================="); 
          out.println(url1); 
          out.println(key); 
          out.println(success); 
          out.println(content_media); 
          //out.println("==============================================================="); 
          out.close(); 
          fos.flush(); 
          fos.close(); 




        } catch (FileNotFoundException e) { 
         // TODO Auto-generated catch block 
         e.printStackTrace(); 
        } catch (IOException e) { 
         // TODO Auto-generated catch block 

         e.printStackTrace(); 
        } catch (TikaException e) { 
         // TODO Auto-generated catch block 
         e.printStackTrace(); 
        } 
        } 

       } 



      } 
     } catch (MalformedURLException e) { 
      e.printStackTrace(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 



     System.out.println("============="); 
    } 
}

一种方法是逐个检查每个内容类型像PDF它是应用程序/ PDF

if (key.contains("application/pdf")

和XML的同样的方式......但任何其他方法比这其他...

来源

2011-07-11 ferhan

这会有帮助吗？

if (!connection.getContentType.startsWith("text/html"))

来源

2011-07-11 18:25:43 emboss

这不起作用..并且它也需要其内容类型为text/html的链接..任何其他想法..并且我还更新了text/html和non text/html都使用.. – ferhan

喜欢这个？如果getContentType也返回“[”，然后通过使用getContentType.substring（1） – emboss

剥离它，但它正在工作，但它正在打印重复记录。至于特定的url，响应中有很多标题，因此它正在检查每个头文件，如果该头文件不是以“text/html”开头，那么它会打印出这个网址。所以假设如果一个不是text/html的特定url在响应中有8个头文件，那么它会打印出那个url 8 times ..希望你明白我在说什么.. – ferhan

什么是错的使用：

if (key.contains("text/html") || key.contains("text/xhtml")) { 
    //Do stuff 
} else if (key.contains("application/pdf") { 
    //Do other stuff 
} else { 
    //All other cases 
}

由于对其他格式的内容类型可以从每个类型而有所改变，你可能需要为每个内容类型明确的情况下。如果遇到通用内容类型，那么通用方法（else）应该足够吗？ Strategy Pattern可能对您有用。

我很抱歉，如果我误解了你的问题。您能否提供一个示例打印输出key的不同值是通过测试运行的吗？（你的代码的第10行）

来源

2011-07-11 18:54:45 Grambot

感谢您回复..问题是，我不知道有多少内容类型，所以在我的情况下，我需要两件事情，一个是所有那些内容类型为text/html或text/xhtml和第二个所有这些url的内容类型不是text/html或text/xhtml。因此，一种方法是打印出每个网址并查看内容类型，然后为该内容类型添加if if循环。但是，将来如果有人添加任何其他内容类型的其他页面，那么我可能会错过该内容类型。希望您现在能够理解...... – ferhan

'key'包含特定url的响应头的值。所以每个网址都有内容类型，这就是为什么我要检查text/html。 – ferhan

就您的情况而言，最好是对所有已知内容类型实施解决方案，并在遇到未知内容类型时向用户提供警告。编写一个可以在100％的情况下工作的系统是不可能的，因此您的目标是防止系统在未知内容类型事件中发生严重错误。使用'else'的情况下捕获未知/未处理的内容类型，并打印警告或给它一个“尽力而为”的方法（可能使用正则表达式），但为意外行为做好准备。 – Grambot

通过抓取其内容类型不是文本/ html的URL获取URL

回答

相关问题