2014-06-16 78 views
0

我已经给出了包含任何有效url的字符串。 我必须从给定的网址找到唯一的网站名称。 我也忽略子域。如何从任何字符串url获取网站的名称

http://www.yahoo.com => yahoo 
www.google.co.in =>  google 
http://in.com =>  in 
http://india.gov.in/ => india 
https://in.yahoo.com/ => yahoo 
http://philotheoristic.tumblr.com/ =>tumblr 
http://philotheoristic.tumblr.com/ 
https://in.movies.yahoo.com/  =>yahoo 

如何做到这一点

+1

你不知道什么关于字符串解析或正则表达式吗? –

回答

2

正则表达式可以帮助你:

String str = "www.google.co.in"; 
String [] res = str.split("(\\.|//)+(?=\\w)"); 
System.out.println(res[1]); 

正则表达式是表示一组字符串的方式。该组由与表达式匹配的任何字符串组成。在上面的代码中,用作split参数的字符串是匹配的正则表达式:Any“。”接着是字母数字文本或“//”后跟字母数字文本。 所以这些“。”和“//”子字符串是用于分割字符串的分隔符,第一个是网站名称。

在“www.google.co.in”中,字符串将被拆分为:goole, co, in。由于解决方案正在使用spit数组的第一个元素,因此结果为:google

+0

我希望我能像你一样了解经常的exp。你能解释一下你的经常性的前例如何,我可以学到一些东西吗? –

+1

@KickButtowski我编辑了我的答案以包含解释。 –

+0

谢谢,你知道任何容易理解的外国人常规exp教程吗? –

2

呦可以利用URL

从技术文档 - http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html

import java.net.*; 
import java.io.*; 

public class ParseURL { 
    public static void main(String[] args) throws MalformedURLException { 

     URL aURL = new URL("http://example.com:80/docs/books/tutorial" 
          + "/index.html?name=networking#DOWNLOADING"); 

     System.out.println("protocol = " + aURL.getProtocol()); 
     System.out.println("authority = " + aURL.getAuthority()); 
     System.out.println("host = " + aURL.getHost()); 
     System.out.println("port = " + aURL.getPort()); 
     System.out.println("path = " + aURL.getPath()); 
     System.out.println("query = " + aURL.getQuery()); 
     System.out.println("filename = " + aURL.getFile()); 
     System.out.println("ref = " + aURL.getRef()); 
    } 
} 

这里是由程序显示的输出:

protocol = http 
authority = example.com:80 
host = example.com      // name of website 
port = 80 
path = /docs/books/tutorial/index.html 
query = name=networking 
filename = /docs/books/tutorial/index.html?name=networking 
ref = DOWNLOADING 

因此,通过使用aURL.getHost()你可以得到网站名称。要忽略子域,您可以用"."分割它,因此它变成aURL.getHost().split(".")[0]以获取名称。

+0

不错的答案,但你怎么会最终只是例子? –

0

没有任何可能的方法从url找出有效的网站名称。但是,如果你正试图削减URL字符串的特定部分,你可以通过字符串操作如下

if(url.endsWith("co.in"){ 

    website = url.substring(indexOfLostThirdDot, indexofco.in) 
} 
0

我发现了相似的内容做到这一点。虽然有些不同。

http://www.yahoo.com => Yahoo 
http://www.google.co.in =>  Google 
http://in.com => In.com Offers Videos, News, Photos, Celebs, Live TV Channels..... 
http://india.gov.in/ => National Portal of India 
https://in.yahoo.com/ => Yahoo India 
http://philotheoristic.tumblr.com/ => Philotheoristic 
https://in.movies.yahoo.com/ => Yahoo India Movies - Bollywood News, Movie Reviews & Hindi Movie Videos 

这里是代码

public class TitleExtractor { 
/* the CASE_INSENSITIVE flag accounts for 
* sites that use uppercase title tags. 
* the DOTALL flag accounts for sites that have 
* line feeds in the title text */ 
private static final Pattern TITLE_TAG = 
    Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); 

/** 
* @param url the HTML page 
* @return title text (null if document isn't HTML or lacks a title tag) 
* @throws IOException 
*/ 
public static String getPageTitle(String url) throws IOException { 
    URL u = new URL(url); 
    URLConnection conn = u.openConnection(); 

    // ContentType is an inner class defined below 
    ContentType contentType = getContentTypeHeader(conn); 
    if (!contentType.contentType.equals("text/html")) 
     return null; // don't continue if not HTML 
    else { 
     // determine the charset, or use the default 
     Charset charset = getCharset(contentType); 
     if (charset == null) 
      charset = Charset.defaultCharset(); 

     // read the response body, using BufferedReader for performance 
     InputStream in = conn.getInputStream(); 
     BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset)); 
     int n = 0, totalRead = 0; 
     char[] buf = new char[1024]; 
     StringBuilder content = new StringBuilder(); 

     // read until EOF or first 8192 characters 
     while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) { 
      content.append(buf, 0, n); 
      totalRead += n; 
     } 
     reader.close(); 

     // extract the title 
     Matcher matcher = TITLE_TAG.matcher(content); 
     if (matcher.find()) { 
      /* replace any occurrences of whitespace (which may 
      * include line feeds and other uglies) as well 
      * as HTML brackets with a space */ 
      return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim(); 
     } 
     else 
      return null; 
    } 
} 

/** 
* Loops through response headers until Content-Type is found. 
* @param conn 
* @return ContentType object representing the value of 
* the Content-Type header 
*/ 
private static ContentType getContentTypeHeader(URLConnection conn) { 
    int i = 0; 
    boolean moreHeaders = true; 
    do { 
     String headerName = conn.getHeaderFieldKey(i); 
     String headerValue = conn.getHeaderField(i); 
     if (headerName != null && headerName.equals("Content-Type")) 
      return new ContentType(headerValue); 

     i++; 
     moreHeaders = headerName != null || headerValue != null; 
    } 
    while (moreHeaders); 

    return null; 
} 

private static Charset getCharset(ContentType contentType) { 
    if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName)) 
     return Charset.forName(contentType.charsetName); 
    else 
     return null; 
} 

/** 
* Class holds the content type and charset (if present) 
*/ 
private static final class ContentType { 
    private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); 

    private String contentType; 
    private String charsetName; 
    private ContentType(String headerValue) { 
     if (headerValue == null) 
      throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue"); 
     int n = headerValue.indexOf(";"); 
     if (n != -1) { 
      contentType = headerValue.substring(0, n); 
      Matcher matcher = CHARSET_HEADER.matcher(headerValue); 
      if (matcher.find()) 
       charsetName = matcher.group(1); 
     } 
     else 
      contentType = headerValue; 
    } 
} 
} 

利用这一类的很简单:

String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/"); 
System.out.println(title); 

这里是链接:

http://www.gotoquiz.com/web-coding/programming/java-programming/how-to-extract-titles-from-web-pages-in-java/

我希望它是 帮你。