2017-05-17 44 views
0

我需要创建通过URL获取网页资源的网页抓取工具。然后统计网页上提供的字词数量和字符数量。解析HTML(网页)JavaSE

URL url = new URL(urlStr); 
URLConnection connection = url.openConnection(); 
InputStream inputStream = connection.getInputStream(); 
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream,"UTF-8")); 

因此,我可以获取页面(和html标签)上的所有文本,以便我接下来做什么?

有人可以帮我吗?一些文件或sthg阅读。我只需要使用JavaSE。不能使用3D派对库。

+1

到底为什么?有这么多的图书馆,*重新发明轮子*通常是一个不好的选择。 –

+0

@Shashwat我明白,并知道jsoup和其他。但这是一个测试案例。他们说“提示: - 不要使用第三方库”,我同意你的看法。所以在5个小时后,我没有找到这个任务的好答案。 –

+0

尝试通过HTMLEditorKit,但是这是正确的? –

回答

0

例如,你有page.html中:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> 
<html> 
    <head> 
     <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> 
     <title>Login Page</title> 
    </head> 
    <body> 
     <div id="login" class="simple" > 
      <form action="login.do"> 
       Username : <input id="username" type="text" /> 
       Password : <input id="password" type="password" /> 
       <input id="submit" type="submit" /> 
       <input id="reset" type="reset" /> 
      </form> 
     </div> 
    </body> 
</html> 

要与解析它,您可以:

import java.io.File; 
import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 

/** 
* Java Program to parse/read HTML documents from File using Jsoup library. 
*/ 
public class HTMLParser{ 

    public static void main(String args[]) { 

     // Parse HTML String using JSoup library 
     String HTMLSTring = "<!DOCTYPE html>" 
       + "<html>" 
       + "<head>" 
       + "<title>JSoup Example</title>" 
       + "</head>" 
       + "<body>" 
       + "<table><tr><td><h1>HelloWorld</h1></tr>" 
       + "</table>" 
       + "</body>" 
       + "</html>"; 

     Document html = Jsoup.parse(HTMLSTring); 
     String title = html.title(); 
     String h1 = html.body().getElementsByTag("h1").text(); 

     System.out.println("Input HTML String to JSoup :" + HTMLSTring); 
     System.out.println("After parsing, Title : " + title); 
     System.out.println("Afte parsing, Heading : " + h1); 

     // JSoup Example 2 - Reading HTML page from URL 
     Document doc; 
     try { 
      doc = Jsoup.connect("http://google.com/").get(); 
      title = doc.title(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 

     System.out.println("Jsoup Can read HTML page from URL, title : " + title); 

     // JSoup Example 3 - Parsing an HTML file in Java 
     //Document htmlFile = Jsoup.parse("login.html", "ISO-8859-1"); // wrong 
     Document htmlFile = null; 
     try { 
      htmlFile = Jsoup.parse(new File("login.html"), "ISO-8859-1"); 
     } catch (IOException e) { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } // right 
     title = htmlFile.title(); 
     Element div = htmlFile.getElementById("login"); 
     String cssClass = div.className(); // getting class form HTML element 

     System.out.println("Jsoup can also parse HTML file directly"); 
     System.out.println("title : " + title); 
     System.out.println("class of div tag : " + cssClass); 
    } 
} 

输出:

Input HTML String to JSoup :<!DOCTYPE html><html><head><title>JSoup Example</title></head><body><table><tr><td><h1>HelloWorld</h1></tr></table></body></html> 
After parsing, Title : JSoup Example 
Afte parsing, Heading : HelloWorld 
Jsoup Can read HTML page from URL, title : Google 
Jsoup can also parse HTML file directly 
title : Login Page 
class of div tag : simple 
+0

OP专门说*不能使用3d派对库* –

+0

好吧,明白了,我只会看到一次 –