如何读取蟒蛇

我试图写一个程序读取任何网站，范围可以从Blogspot的或WordPress的博客/任何其他网站的文章（帖子）的网站内容。至于编写与几乎所有可能用HTML5/XHTML编写的网站都兼容的代码，我想用RSS/Atom提要作为提取内容的基础。如何读取蟒蛇

但是，由于RSS/Atom订阅源通常可能不包含整个网站的文章，因此我想从使用feedparser的订阅源中收集所有“帖子”链接，然后要从相应的URL中提取文章内容。

我能得到的所有文章的网址在网站（包括总结。即，在饲料中显示文章内容），但我想访问的，我必须使用相应的URL整篇文章的数据。我不知道如何获得文章的“确切”内容（我认为“确切”意味着数据的数据，但是我真的不知道如何获得文章的“确切”内容（我认为“确切”意味着数据与所有的超链接，iframes，幻灯片演出等仍然存在;我不想CSS部分）。

那么，任何人都可以帮助我吗？

来源

2012-05-15 Surya

你到目前为止尝试过什么？你想要HTML，图像和网站的所有文件，还是只想抓取HTML的一部分？请更具体一些。 – serk

@serk考虑一个博客文章，我想要的信息完全按照它的写法。（保存CSS）。 – Surya

那为什么不试试'wget'呢？ – serk

获取所有链接页面的HTML代码非常简单。

难的是要准确提取你正在寻找的内容。如果您只需要使用<body>标签中的所有代码，那么这也不是什么大问题;提取所有文本同样简单。但是如果你想要一个更具体的子集，你就有更多的工作要做。

我建议你下载请求和BeautifulSoup模块（都可以通过easy_install requests/bs4或更好的pip install requests/bs4）。请求模块使获取页面非常简单。

下面的示例获取一个RSS提要，并返回三个列表：

linksoups是从进料
linktexts链接每一页BeautifulSoup实例列表是可见的文字列表的每页链接
linkimageurls是列表的列表src - 每页中嵌入的所有图像都包含在链接的Feed中
- 例如[['/pageone/img1.jpg', '/pageone/img2.png'], ['/pagetwo/img1.gif', 'logo.bmp']]

import requests, bs4 

# request the content of the feed an create a BeautifulSoup object from its content 
response = requests.get('http://rss.slashdot.org/Slashdot/slashdot') 
responsesoup = bs4.BeautifulSoup(response.text) 

linksoups = [] 
linktexts = [] 
linkimageurls = [] 

# iterate over all <link>…</link> tags and fill three lists: one with the soups of the 
# linked pages, one with all their visible text and one with the urls of all embedded 
# images 
for link in responsesoup.find_all('link'): 
    url = link.text 
    linkresponse = requests.get(url) # add support for relative urls with urlparse 
    soup = bs4.BeautifulSoup(linkresponse.text) 
    linksoups.append(soup) 

    linktexts.append(soup.find('body').text) 
    # Append all text between tags inside of the body tag to the second list 

    images = soup.find_all('img') 
    imageurls = [] 
    # get the src attribute of each <img> tag and append it to imageurls 
    for image in images: 
     imageurls.append(image['src']) 
    linkimageurls.append(imageurls) 

# now somehow merge the retrieved information.

这可能是为您的项目一个粗略的起点。

来源

2012-06-02 16:06:28 camelNeck

你为什么使用'requests'？ – Surya

我只是觉得它比urllib，urllib2或urllib3更方便实用。你可能想看看[documentation]（http://docs.python-requests.org/en/latest/）。我向你保证这是非常好的pythonic :)这也可以用标准库中的一个完成;这更像是个人喜好。 – camelNeck

如何读取蟒蛇

回答

相关问题