2016-09-29 79 views
3

我很努力使用正则表达式从youtube url中提取视频ID。在.net中从youtube url中提取视频ID

"(?:.+?)?(?:\\/v\\/|watch\\/|\\?v=|\\&v=|youtu\\.be\\/|\\/v=|^youtu\\.be\\/)([a-zA-Z0-9_-]{11})+";

它的工作,因为它匹配的视频ID,但我想在YouTube的域来限制它,我不希望它匹配的ID如果域访问youtube.com或youtu.be不同。不幸的是我无法理解这个正则表达式来应用这个限制。

我想匹配的ID只有当域:

  • www.youtube.com
  • youtube.com
  • youtu.be
  • www.youtu.be

用http或https在前面(或不在)

上述正则表达式被成功匹配的以下实施例的YouTube的ID:

"http://youtu.be/AAAAAAAAA01" 
"http://www.youtube.com/embed/watch?feature=player_embedded&v=AAAAAAAAA02" 
"http://www.youtube.com/embed/watch?v=AAAAAAAAA03" 
"http://www.youtube.com/embed/v=AAAAAAAAA04" 
"http://www.youtube.com/watch?feature=player_embedded&v=AAAAAAAAA05" 
"http://www.youtube.com/watch?v=AAAAAAAAA06" 
"http://www.youtube.com/v/AAAAAAAAA07" 
"www.youtu.be/AAAAAAAAA08" 
"youtu.be/AAAAAAAAA09" 
"http://www.youtube.com/watch?v=i-AAAAAAA14&feature=related" 
"http://www.youtube.com/attribution_link?u=/watch?v=AAAAAAAAA15&feature=share&a=9QlmP1yvjcllp0h3l0NwuA" 
"http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&u=/watch?v=AAAAAAAAA16&feature=em-uploademail" 
"http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&feature=em-uploademail&u=/watch?v=AAAAAAAAA17" 
"http://www.youtube.com/v/A-AAAAAAA18?fs=1&rel=0" 
"http://www.youtube.com/watch/AAAAAAAAA11" 

,检查该URL现在是当前的代码:

private const string YoutubeLinkRegex = "(?:.+?)?(?:\\/v\\/|watch\\/|\\?v=|\\&v=|youtu\\.be\\/|\\/v=|^youtu\\.be\\/)([a-zA-Z0-9_-]{11})+"; 
    private static Regex regexExtractId = new Regex(YoutubeLinkRegex, RegexOptions.Compiled); 


    public string ExtractVideoIdFromUrl(string url) 
    { 
     //extract the id 
     var regRes = regexExtractId.Match(url); 
     if (regRes.Success) 
     { 
      return regRes.Groups[1].Value; 
     } 
     return null; 
    } 
+0

检查此[正则表达式(http://stackoverflow.com/a/27796139/6290553) –

回答

2

问题是,正则表达式无法检查采矿操作之前所需的字符串,并且同时将此采矿作为采矿操作本身使用。

例如,我们来看看"http://www.youtu.be/v/AAAAAAAAA07" YouTu。是为在URL的开头强制性的,但采矿行动"/v/(11 chars)"

"http://www.youtu.be/AAAAAAAAA07"挖掘行动是"youtu.be/(11 chars)"

这不可能是在相同的正则表达式,这就是为什么我们不能检查域提取物该id在相同的正则表达式。

我决定从有效域列表中检查域权限,然后从URL中提取该ID。

private const string YoutubeLinkRegex = "(?:.+?)?(?:\\/v\\/|watch\\/|\\?v=|\\&v=|youtu\\.be\\/|\\/v=|^youtu\\.be\\/)([a-zA-Z0-9_-]{11})+"; 
private static Regex regexExtractId = new Regex(YoutubeLinkRegex, RegexOptions.Compiled); 
private static string[] validAuthorities = { "youtube.com", "www.youtube.com", "youtu.be", "www.youtu.be" }; 

public string ExtractVideoIdFromUri(Uri uri) 
{ 
    try 
    { 
     string authority = new UriBuilder(uri).Uri.Authority.ToLower(); 

     //check if the url is a youtube url 
     if (validAuthorities.Contains(authority)) 
     { 
      //and extract the id 
      var regRes = regexExtractId.Match(uri.ToString()); 
      if (regRes.Success) 
      { 
       return regRes.Groups[1].Value; 
      } 
     } 
    }catch{} 


    return null; 
} 

UriBuilder是优选的,因为它可以理解更宽范围的URL比Uri类。它可以从不包含方案的URL(如"youtube.com")创建Uri

该函数返回空值(正确)与下面的测试网址:

"ww.youtube.com/v/AAAAAAAAA13" 
"http:/www.youtube.com/v/AAAAAAAAA13" 
"http://www.youtub1e.com/v/AAAAAAAAA13" 
"http://www.vimeo.com/v/AAAAAAAAA13" 
"www.youtube.com/b/AAAAAAAAA13" 
"www.youtube.com/v/AAAAAAAAA1" 
"www.youtube.com/v/AAAAAAAAA1&" 
"www.youtube.com/v/AAAAAAAAA1/" 
".youtube.com/v/AAAAAAAAA13" 
1

septihhere

所述

我有一个玩的例子,并提出了这些: 。

Youtube:youtu(?:\.be|be\.com)/(?:.*v(?:/|=)|(?:.*/)?)([a-zA-Z0-9-_]+) 他们应该匹配所有给出的。 (?:...)表示括号内的所有内容都不会被捕获。所以只有id应该被获得。

8

使用正则表达式不要求这里

var url = @"https://www.youtube.com/watch?v=6QlW4m9xVZY"; 
var uri = new Uri(url); 

// you can check host here => uri.Host <= "www.youtube.com" 

var query = HttpUtility.ParseQueryString(uri.Query); 
var videoId = query["v"]; 

// videoId = 6QlW4m9xVZY 

好了,上面的例子是工作,当你有V =视频ID作为参数。如果你有VideoID的如段,您可以使用此:

var url = "http://youtu.be/AAAAAAAAA09"; 
var uri = new Uri(url); 

var videoid = uri.Segments.Last(); // AAAAAAAAA09 

所有结合在一起,我们可以得到

var url = @"https://www.youtube.com/watch?v=Lvcyj1GfpGY&list=PLolZLFndMkSIYef2O64OLgT-njaPYDXqy"; 
var uri = new Uri(url); 

// you can check host here => uri.Host <= "www.youtube.com" 

var query = HttpUtility.ParseQueryString(uri.Query); 

var videoId = string.Empty; 

if (query.AllKeys.Contains("v")) 
{ 
    videoId = query["v"]; 
} 
else 
{ 
    videoId = uri.Segments.Last(); 
} 

Ofcourse,我不知道你需要什么,但希望它帮助。

+1

我个人不喜欢使用正则表达式,当其他更可读的选项存在 - 我喜欢这个答案比我自己:) – confusedandamused

+0

哦!我喜欢这个答案!注意,如果你还没有这样做,你需要为'HttpUtility'添加一个对System.Web的引用。 –

+0

不幸的是,它不适用于:youtu.be/AAAAAAAAA09,www.youtube.com/watch/aaaaaaaaa,www.youtube.com/v/aaaaaaaa –

0

tym32167的回答抛出在var uri = new Uri(url);异常时url没有一个计划,像“www.youtu.be/AAAAAAAAA08 ”。

此外,错误videoId是返回一些网址。

所以这里是我的代码基于tym32167的。

static private string GetYouTubeVideoIdFromUrl(string url) 
    { 
     Uri uri = null; 
     if (!Uri.TryCreate(url, UriKind.Absolute, out uri)) 
     { 
      try 
      { 
       uri = new UriBuilder("http", url).Uri; 
      } 
      catch 
      { 
       // invalid url 
       return ""; 
      } 
     } 

     string host = uri.Host; 
     string[] youTubeHosts = { "www.youtube.com", "youtube.com", "youtu.be", "www.youtu.be" }; 
     if (!youTubeHosts.Contains(host)) 
      return ""; 

     var query = HttpUtility.ParseQueryString(uri.Query); 

     if (query.AllKeys.Contains("v")) 
     { 
      return Regex.Match(query["v"], @"^[a-zA-Z0-9_-]{11}$").Value; 
     } 
     else if (query.AllKeys.Contains("u")) 
     { 
      // some urls have something like "u=/watch?v=AAAAAAAAA16" 
      return Regex.Match(query["u"], @"/watch\?v=([a-zA-Z0-9_-]{11})").Groups[1].Value; 
     } 
     else 
     { 
      // remove a trailing forward space 
      var last = uri.Segments.Last().Replace("/", ""); 
      if (Regex.IsMatch(last, @"^v=[a-zA-Z0-9_-]{11}$")) 
       return last.Replace("v=", ""); 

      string[] segments = uri.Segments; 
      if (segments.Length > 2 && segments[segments.Length - 2] != "v/" && segments[segments.Length - 2] != "watch/") 
       return ""; 

      return Regex.Match(last, @"^[a-zA-Z0-9_-]{11}$").Value; 
     } 
    } 

让我们来测试它。

 string[] urls = {"http://youtu.be/AAAAAAAAA01", 
      "http://www.youtube.com/embed/watch?feature=player_embedded&v=AAAAAAAAA02", 
      "http://www.youtube.com/embed/watch?v=AAAAAAAAA03", 
      "http://www.youtube.com/embed/v=AAAAAAAAA04", 
      "http://www.youtube.com/watch?feature=player_embedded&v=AAAAAAAAA05", 
      "http://www.youtube.com/watch?v=AAAAAAAAA06", 
      "http://www.youtube.com/v/AAAAAAAAA07", 
      "www.youtu.be/AAAAAAAAA08", 
      "youtu.be/AAAAAAAAA09", 
      "http://www.youtube.com/watch?v=i-AAAAAAA14&feature=related", 
      "http://www.youtube.com/attribution_link?u=/watch?v=AAAAAAAAA15&feature=share&a=9QlmP1yvjcllp0h3l0NwuA", 
      "http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&u=/watch?v=AAAAAAAAA16&feature=em-uploademail", 
      "http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&feature=em-uploademail&u=/watch?v=AAAAAAAAA17", 
      "http://www.youtube.com/v/A-AAAAAAA18?fs=1&rel=0", 
      "http://www.youtube.com/watch/AAAAAAAAA11",}; 

     Console.WriteLine("***Youtube urls***"); 
     foreach (string url in urls) 
     { 
      Console.WriteLine("{0}\n-> {1}", url, GetYouTubeVideoIdFromUrl(url)); 
     } 

     string[] invalidUrls = { 
      "ww.youtube.com/v/AAAAAAAAA13", 
      "http:/www.youtube.com/v/AAAAAAAAA13", 
      "http://www.youtub1e.com/v/AAAAAAAAA13", 
      "http://www.vimeo.com/v/AAAAAAAAA13", 
      "www.youtube.com/b/AAAAAAAAA13", 
      "www.youtube.com/v/AAAAAAAAA1", 
      "www.youtube.com/v/AAAAAAAAA1&", 
      "www.youtube.com/v/AAAAAAAAA1/", 
      ".youtube.com/v/AAAAAAAAA13"}; 

     Console.WriteLine("***Invalid youtube urls***"); 
     foreach (string url in invalidUrls) 
     { 
      Console.WriteLine("{0}\n-> {1}", url, GetYouTubeVideoIdFromUrl(url)); 
     } 

结果(一切是正常的)

***Youtube urls*** 
http://youtu.be/AAAAAAAAA01 
-> AAAAAAAAA01 
http://www.youtube.com/embed/watch?feature=player_embedded&v=AAAAAAAAA02 
-> AAAAAAAAA02 
http://www.youtube.com/embed/watch?v=AAAAAAAAA03 
-> AAAAAAAAA03 
http://www.youtube.com/embed/v=AAAAAAAAA04 
-> AAAAAAAAA04 
http://www.youtube.com/watch?feature=player_embedded&v=AAAAAAAAA05 
-> AAAAAAAAA05 
http://www.youtube.com/watch?v=AAAAAAAAA06 
-> AAAAAAAAA06 
http://www.youtube.com/v/AAAAAAAAA07 
-> AAAAAAAAA07 
www.youtu.be/AAAAAAAAA08 
-> AAAAAAAAA08 
youtu.be/AAAAAAAAA09 
-> AAAAAAAAA09 
http://www.youtube.com/watch?v=i-AAAAAAA14&feature=related 
-> i-AAAAAAA14 
http://www.youtube.com/attribution_link?u=/watch?v=AAAAAAAAA15&feature=share&a=9QlmP1yvjcllp0h3l0NwuA 
-> AAAAAAAAA15 
http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&u=/watch?v=AAAAAAAAA16&feature=em-uploademail 
-> AAAAAAAAA16 
http://www.youtube.com/attribution_link?a=fF1CWYwxCQ4&feature=em-uploademail&u=/watch?v=AAAAAAAAA17 
-> AAAAAAAAA17 
http://www.youtube.com/v/A-AAAAAAA18?fs=1&rel=0 
-> A-AAAAAAA18 
http://www.youtube.com/watch/AAAAAAAAA11 
-> AAAAAAAAA11 



***Invalid youtube urls*** 
ww.youtube.com/v/AAAAAAAAA13 
-> 
http:/www.youtube.com/v/AAAAAAAAA13 
-> 
http://www.youtub1e.com/v/AAAAAAAAA13 
-> 
http://www.vimeo.com/v/AAAAAAAAA13 
-> 
www.youtube.com/b/AAAAAAAAA13 
-> 
www.youtube.com/v/AAAAAAAAA1 
-> 
www.youtube.com/v/AAAAAAAAA1& 
-> 
www.youtube.com/v/AAAAAAAAA1/ 
-> 
.youtube.com/v/AAAAAAAAA13 
-> 
0

这应做到:

public static string GetYouTubeId(string url) { 
    var regex = @"(?:youtube\.com\/(?:[^\/]+\/.+\/|(?:v|e(?:mbed)?|watch)\/|.*[?&amp;]v=)|youtu\.be\/)([^""&amp;?\/ ]{11})"; 

    var match = Regex.Match(url, regex); 

    if (match.Success) 
    { 
     return match.Groups[1].Value; 
    } 

    return url; 
    } 
相关问题