2011-08-06 107 views

回答

3

这是比在这里粘贴一些代码更多的参与。但我可以指出你需要做的事情的正确方向。

  1. 首先,你需要抓取网页
  2. 解析你回来找RSS Autodiscovery Meta tag的字符串。您可以将整个文档映射为XML并使用DOM遍历,但我只是使用正则表达式。
  3. 提取标签的href部分,并且您现在拥有RSS提要的URL。
+0

嗨,你是否提及有关html源报废以确定RSS饲料网址? – Jeyaganesh

13

的一般过程已经回答(QuentinDOOManiac),所以一些代码(Demo):

<?php 

$location = 'http://hakre.wordpress.com/'; 
$html = file_get_contents($location); 
echo getRSSLocation($html, $location); # http://hakre.wordpress.com/feed/ 

/** 
* @link http://keithdevens.com/weblog/archive/2002/Jun/03/RSSAuto-DiscoveryPHP 
*/ 
function getRSSLocation($html, $location){ 
    if(!$html or !$location){ 
     return false; 
    }else{ 
     #search through the HTML, save all <link> tags 
     # and store each link's attributes in an associative array 
     preg_match_all('/<link\s+(.*?)\s*\/?>/si', $html, $matches); 
     $links = $matches[1]; 
     $final_links = array(); 
     $link_count = count($links); 
     for($n=0; $n<$link_count; $n++){ 
      $attributes = preg_split('/\s+/s', $links[$n]); 
      foreach($attributes as $attribute){ 
       $att = preg_split('/\s*=\s*/s', $attribute, 2); 
       if(isset($att[1])){ 
        $att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]); 
        $final_link[strtolower($att[0])] = $att[1]; 
       } 
      } 
      $final_links[$n] = $final_link; 
     } 
     #now figure out which one points to the RSS file 
     for($n=0; $n<$link_count; $n++){ 
      if(strtolower($final_links[$n]['rel']) == 'alternate'){ 
       if(strtolower($final_links[$n]['type']) == 'application/rss+xml'){ 
        $href = $final_links[$n]['href']; 
       } 
       if(!$href and strtolower($final_links[$n]['type']) == 'text/xml'){ 
        #kludge to make the first version of this still work 
        $href = $final_links[$n]['href']; 
       } 
       if($href){ 
        if(strstr($href, "http://") !== false){ #if it's absolute 
         $full_url = $href; 
        }else{ #otherwise, 'absolutize' it 
         $url_parts = parse_url($location); 
         #only made it work for http:// links. Any problem with this? 
         $full_url = "http://$url_parts[host]"; 
         if(isset($url_parts['port'])){ 
          $full_url .= ":$url_parts[port]"; 
         } 
         if($href{0} != '/'){ #it's a relative link on the domain 
          $full_url .= dirname($url_parts['path']); 
          if(substr($full_url, -1) != '/'){ 
           #if the last character isn't a '/', add it 
           $full_url .= '/'; 
          } 
         } 
         $full_url .= $href; 
        } 
        return $full_url; 
       } 
      } 
     } 
     return false; 
    } 
} 

参见:RSS auto-discovery with PHP (archived copy)

+0

优秀!它对我来说工作得很好 – fortytwo

1

一个稍微小一点的函数,将抓取第一个可用的feed,不管它是rss还是atom(大多数博客有两个选项 - 这抓住了第一选择)。

public function getFeedUrl($url){ 
     if(@file_get_contents($url)){ 
      preg_match_all('/<link\srel\=\"alternate\"\stype\=\"application\/(?:rss|atom)\+xml\"\stitle\=\".*href\=\"(.*)\"\s\/\>/', file_get_contents($url), $matches); 
      return $matches[1][0]; 
     } 
     return false; 
    }