使用Apache TIKA获取内容，关键字和页面标题

此代码有任何错误。如果我在Ti t = new Ti();下面添加此行（String c= t.parseToString(content);），那么我会返回url的实际内容，但在此之后，我会得到null值，关键字，标题和作者。如果我删除这一行（String c= t.parseToString(content);），那么我会得到标题，作者和关键字的实际值。为什么这样呢？使用Apache TIKA获取内容，关键字和页面标题

HttpGet request = new HttpGet("http://xyz.com/d/index.html"); 

     HttpResponse response = client.execute(request); 
     HttpEntity entity = response.getEntity(); 
     InputStream content = entity.getContent(); 
     System.out.println(content)  

     Ti t = new Ti(); 
     String ct= t.parseToString(content); 
     System.out.println(ct); 

     Metadata md = new Metadata(); 



     Reader r = t.parse(content, md); 
     System.out.println(md); 


     System.out.println("Keywords: " +md.get("keywords")); 
     System.out.println("Title: " +md.get("title")); 
     System.out.println("Authors: " +md.get("authors"));

来源

2011-07-16 ferhan

您正在多次阅读相同的流。完整阅读完一个流后，您无法再读取它。做类似的事情，

HttpResponse response = client.execute(request); 
HttpEntity entity = response.getEntity(); 

//http://stackoverflow.com/questions/1264709/convert-inputstream-to-byte-in-java 
byte[] content = streamToByteArray(entity.getContent()); 

String ct = t.parseToString(new ByteArrayInputStream(content)); 
System.out.println(ct); 

Metadata md = new Metadata(); 
Reader r = t.parse(new ByteArrayInputStream(content), md); 
System.out.println(md);

来源

2011-07-16 03:14:41 sbridges

你的代码令我困惑。什么是内容？我们可以在哪里使用内容..？ – ferhan

更新，内容是一个字节[] – sbridges

我必须创建一个方法'streamToByteArray'或我必须包括的东西..因为我得到错误... – ferhan

使用Apache TIKA获取内容，关键字和页面标题

回答

相关问题