从URL中提取HTML信息

我正在尝试在python中编写一个程序，该程序从网页中读取所有数据，并将任何标题标记<h1>到<h6>的内容附加到列表中。到目前为止，我只是想首先获取网站信息，事实证明这很困难。从URL中提取HTML信息

编辑：这是一个班。令人遗憾的是，我们不允许使用未预先安装python的库。

编辑2：感谢您的所有提示。该程序现在成功读取给定网站的HTML。有没有人有任何建议，搜索网页内的特定字符串（即<H>标签）？

import urllib 
from urllib.request import urlopen 

#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/ 
userAddress = input("Enter a website URL: ") 

webPage = urllib.request.urlopen(userAddress) 

print (webPage.read()) 

webPage.close()

来源

2015-12-13 Cameron

http://docs.python-requests.org/en/latest/和http://www.crummy.com/software/BeautifulSoup/ BS4/DOC / – pvg

我想你使用python3来获取网页。它可以通过下面的代码来获取：

import urllib 
from urllib.request import urlopen 

address = "http://www.hobo-web.co.uk/headers/" 
webPage = urllib.request.urlopen(address) 

print (webPage.read())

对于从网页拉出的信息，您可以使用BeautifulSoup。这是一个令人难以置信的工具，用于从网页中提取信息。您可以使用它来提取表格，列表和段落，也可以使用过滤器从网页中提取信息。

从这里安装：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

来源

2015-12-13 22:22:24 perfectus

我建议使用requests库。

import requests 

r = requests.get('http://www.hobo-web.co.uk/') 
print(r.text)

检查在http://docs.python-requests.org/en/latest/user/quickstart/

来源

2015-12-13 22:10:46 zsoobhan

检查出beautifulsoup库中的文档了。它是解析DOM树的API。你可以做一些事情，比如soup.find_all（'h1'），它将返回所有h1元素的列表。

来源

2015-12-13 22:13:45

其更好地使用with open因此它会自动关闭连接。这里有一个例子：

import urllib.request 
address = "http://www.hobo-web.co.uk/headers/" 
with urllib.request.urlopen(address) as response: 
    html = response.read() 
    print html

来源

2015-12-13 22:25:43 heinst

您webPage变量是一个网络对象，实际得到的HTML内容使用

content = webPage.read()

用于获取标题标签的内容，你可以使用BeautifulSoup库

from bs4 import BeautifulSoup 

htmlContent = webPage.read() 
soup = BeautifulSoup(htmlContent, from_encoding=htmlContent.info().getparam('charset')) 
heads = soup.find_all('head').text

现在heads是所有出现的头标记的内容列表

阅读更多关于BeautifulSoup库去：http://www.crummy.com/software/BeautifulSoup/bs4/doc/

来源

2015-12-13 22:25:43 tffu

从URL中提取HTML信息

回答

相关问题