自动从网页中提取供稿链接（原子，rss等）

我有一个巨大的URL列表，我的任务是将它们提供给一个python脚本，如果有的话应该吐出feed URL。有没有可以帮助的API库或代码？自动从网页中提取供稿链接（原子，rss等）

2011-10-25 Max

我在推荐Beautiful Soup来解析HTML，然后得到<链接rel =“alternate”标签，其中的feed被引用的第二个华夫饼干悖论。该代码我通常使用：

from BeautifulSoup import BeautifulSoup as parser 

def detect_feeds_in_HTML(input_stream): 
    """ examines an open text stream with HTML for referenced feeds. 

    This is achieved by detecting all ``link`` tags that reference a feed in HTML. 

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method. 
    :type input_stream: an input stream (e.g. open file or URL) 
    :return: a list of tuples ``(url, feed_type)`` 
    :rtype: ``list(tuple(str, str))`` 
    """ 
    # check if really an input stream 
    if not hasattr(input_stream, "read"): 
     raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream)) 
    result = [] 
    # get the textual data (the HTML) from the input stream 
    html = parser(input_stream.read()) 
    # find all links that have an "alternate" attribute 
    feed_urls = html.findAll("link", rel="alternate") 
    # extract URL and type 
    for feed_link in feed_urls: 
     url = feed_link.get("href", None) 
     # if a valid URL is there 
     if url: 
      result.append(url) 
    return result

来源

2011-10-25 07:20:14 PhilS

我不知道任何现有的库，但Atom或RSS提要通常与<link>标签显示在<head>节这样：

<link rel="alternative" type="application/rss+xml" href="http://link.to/feed"> 
<link rel="alternative" type="application/atom+xml" href="http://link.to/feed">

简单的方法将被下载和解析这些URL的用HTML解析器，如lxml.html，并获取相关<link>标记的href属性。

来源

2011-10-25 03:23:49 Avaris

取决于良好的形成在这些饲料中的信息是如何（比如，是否在http://.../形式的所有环节吗？你知道，如果他们都将在href或link标签？在饲料的所有链接去其他的饲料？等），我会推荐从简单的正则表达式到直接的解析模块从提取饲料中提取链接。我只能推荐beautiful soup。尽管即使是最好的解析器也只会走得这么远 - 尤其是在上面提到的情况下，如果不能保证数据中的所有链接都将链接到其他提要;那么你必须自己做一些额外的抓取和探测。

来源

2011-10-25 03:27:53

有feedfinder：

>>> import feedfinder 
>>> 
>>> feedfinder.feed('scripting.com') 
'http://scripting.com/rss.xml' 
>>> 
>>> feedfinder.feeds('scripting.com') 
['http://delong.typepad.com/sdj/atom.xml', 
'http://delong.typepad.com/sdj/index.rdf', 
'http://delong.typepad.com/sdj/rss.xml'] 
>>>

来源

2013-03-22 08:46:08

feedfinder不再维持，但现在有['feedfinder2']（https://pypi.python.org/pypi/ feedfinder2）。 – Scarabee

自动从网页中提取供稿链接（原子，rss等）

回答

相关问题