我可以获取其内容/类型为text/html的所有url,但是如果我想要那些内容/类型不是text/html的url。那我们怎么检查一下。而对于字符串,我们可以使用contains
方法,但它并没有像notcontains
东西..任何建议,可以理解的。而也通过抓取其内容类型不是文本/ html的URL获取URL
The key variable contains:
Content-Type=[text/html; charset=ISO-8859-1]
这是下面的代码检查text/html的我也尝试了不是text/html的内容类型,但它也打印出内容类型也是text/html的内容类型。
try {
URL url1 = new URL(url);
System.out.println("URL:- " +url1);
URLConnection connection = url1.openConnection();
Map responseMap = connection.getHeaderFields();
Iterator iterator = responseMap.entrySet().iterator();
while (iterator.hasNext())
{
String key = iterator.next().toString();
if (key.contains("text/html") || key.contains("text/xhtml"))
{
System.out.println(key);
// Content-Type=[text/html; charset=ISO-8859-1]
if (filters.matcher(key) != null){
System.out.println(url1);
try {
final File parentDir = new File("crawl_html");
parentDir.mkdir();
final String hash = MD5Util.md5Hex(url1.toString());
final String fileName = hash + ".txt";
final File file = new File(parentDir, fileName);
boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt
System.out.println("hash:-" + hash);
System.out.println(file);
// Create file if it does not exist
// File did not exist and was created
FileOutputStream fos = new FileOutputStream(file, true);
PrintWriter out = new PrintWriter(fos);
// Also could be written as follows on one line
// Printwriter out = new PrintWriter(new FileWriter(args[0]));
// Write text to file
Tika t = new Tika();
String content= t.parseToString(new URL(url1.toString()));
out.println("===============================================================");
out.println(url1);
out.println(key);
out.println(success);
out.println(content);
out.println("===============================================================");
out.close();
fos.flush();
fos.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// http://google.com
}
}
else if (!connection.getContentType().startsWith("text/html"))//print duplicate records of each url
//else if (!key.contains("text/html"))
{
if (filters.matcher(key) != null){
try {
final File parentDir = new File("crawl_media");
parentDir.mkdir();
final String hash = MD5Util.md5Hex(url1.toString());
final String fileName = hash + ".txt";
final File file = new File(parentDir, fileName);
// Create file if it does not exist
boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt
System.out.println("hash:-" + hash);
Tika t = new Tika();
String content_media= t.parseToString(new URL(url1.toString()));
// File did not exist and was created
FileOutputStream fos = new FileOutputStream(file, true);
PrintWriter out = new PrintWriter(fos);
// Also could be written as follows on one line
// Printwriter out = new PrintWriter(new FileWriter(args[0]));
// Write text to file
out.println("===============================================================");
out.println(url1);
out.println(key);
out.println(success);
out.println(content_media);
//out.println("===============================================================");
out.close();
fos.flush();
fos.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("=============");
}
}
一种方法是逐个检查每个内容类型像PDF它是应用程序/ PDF
if (key.contains("application/pdf")
和XML的同样的方式......但任何其他方法比这其他...
这不起作用..并且它也需要其内容类型为text/html的链接..任何其他想法..并且我还更新了text/html和non text/html都使用.. – ferhan
喜欢这个?如果getContentType也返回“[”,然后通过使用getContentType.substring(1) – emboss
剥离它,但它正在工作,但它正在打印重复记录。至于特定的url,响应中有很多标题,因此它正在检查每个头文件,如果该头文件不是以“text/html”开头,那么它会打印出这个网址。所以假设如果一个不是text/html的特定url在响应中有8个头文件,那么它会打印出那个url 8 times ..希望你明白我在说什么.. – ferhan