如何从此代码中省略<h>标签？

所以这段代码需要一个网站，并将所有头信息添加到列表中。我怎样才能修改列表，所以当程序打印时，它显示在单独的行上的每一个列表，并摆脱标题标签？如何从此代码中省略<h>标签？

from urllib.request import urlopen 
address = "http://www.w3schools.com/html/html_head.asp" 
webPage = urlopen (address) 

encoding = "utf-8" 

list = [] 

for line in webPage: 
    findHeader = ('<h1>', '<h2>', '<h3>', '<h4>', '<h5>', '<h6>') 
    line = str(line, encoding) 
    for startHeader in findHeader:   
     endHeader = '</'+startHeader[1:] 
     if (startHeader in line) and (endHeader in line): 
      content = line.split(startHeader)[1].split(endHeader)[0] 
      list.append(line) 
      print (list) 

webPage.close()

来源

2015-12-15 Cameron

一个问题与当前你写的是，开始/结束标题标签可能是不同的路线。我们是否假设html始终有效？ –

就我而言，HTML是否有效并不重要。 – Cameron

如果你不介意使用第三方软件包，试图BeautifulSoup到HTML转换为纯文本。你有你的列表后，您可以删除从环print (list)并做到这一点：

for e in list: 
    # .rstrip() to remove trailing '\r\n' 
    print(BeautifulSoup(e.rstrip(), "html.parser").text)

但是不要忘了先导入BeautifulSoup：

from bs4 import BeautifulSoup

我假设你有BS4安装之前，运行这个例子（pip3安装beautifulsoup4）。

此外，您可以使用正则表达式去除html标签。但它可能比使用bs这样的html解析更加冗长和容易出错。

来源

2015-12-15 17:25:36 vrs

对不起，不明白你想做什么。

但是，例如你可以很容易收集所有唯一的标题在字典：

from urllib.request import urlopen 
import re 

address = "http://www.w3schools.com/html/html_head.asp" 
webPage = urlopen(address) 

# get page content 
response = str(webPage.read(), encoding='utf-8') 

# leave only <h*> tags content 
p = re.compile(r'<(h[0-9])>(.+?)</\1>', re.IGNORECASE | re.DOTALL) 
headers = re.findall(p, response) 

# headers dict 
my_headers = {} 

for (tag, value) in headers: 
    if tag not in my_headers.keys(): 
     my_headers[tag] = [] 

    # remove all tags inside 
    re.sub('<[^>]*>', '', value) 

    # replace few special chars 
    value = value.replace('&lt;', '<') 
    value = value.replace('&gt;', '>') 

    if value not in my_headers[tag]: 
     my_headers[tag].append(value) 

# output 
print(my_headers)

输出：

{'h2': ['The HTML <head> Element', 'Omitting <html> and <body>?', 'Omitting <head>', 'The HTML <title> Element', 'The HTML <style> Element', 'The HTML <link> Element', 'The HTML <meta> Element', 'The HTML <script> Element', 'The HTML <base> Element', 'HTML head Elements', 'Your Suggestion:', 'Thank You For Helping Us!'], 'h4': ['Top 10 Tutorials', 'Top 10 References', 'Top 10 Examples', 'Web Certificates'], 'h1': ['HTML <span class="color_h1">Head</span>'], 'h3': ['Example', 'W3SCHOOLS EXAMS', 'COLOR PICKER', 'SHARE THIS PAGE', 'LEARN MORE:', 'HTML/CSS', 'JavaScript', 'HTML Graphics', 'Server Side', 'Web Building', 'XML Tutorials', 'HTML', 'CSS', 'XML', 'Charsets']}

来源

2015-12-15 17:42:40 mrDinkelman

你问了没有标题标签结果。您已在content变量中拥有这些值，但不会将content添加到结果列表中，而是添加line，这是整个原始行。

接下来，您要求打印在新行上的每个项目。要做到这一点，首先删除循环内的print声明。打印整个列表每次添加一个结果。接着，在该程序的底部，添加新的代码外所有的循环：

for item in list: 
    print(item)

不过，您的HTML标识头的技术还不是很强大的。它预计成对的开启和关闭标签在一条线上。它也预计在一行中不会有多于一个的任何类型的标题。它预计每个开标签都有一个匹配的结束标签。你不能依赖任何这些东西，即使在有效的 HTML。

Vrs's answer是在正确的轨道建议美味的汤，但不是仅使用它从结果中移除标签，实际上你可以用它来寻找的结果了。请看下面的代码：

from bs4 import BeautifulSoup 
from urllib.request import urlopen 

address = "http://www.w3schools.com/html/html_head.asp" 
webPage = urlopen(address) 

# The list of tag names we want to find 
# Just the names, not the angle brackets  
findHeader = ('h1', 'h2', 'h3', 'h4', 'h5', 'h6') 

soup = BeautifulSoup(webPage, 'html.parser') 
headers = soup.find_all(findHeader) 
for header in headers: 
    print(header.get_text())

的find_all方法接受标记名称的列表，并返回一个表示文档顺序每个结果Tag对象。我们将列表存储在headers，并打印每个文本。方法get_text仅显示标签的文本部分，不仅省略了周围的标题标签，而且还省略了任何嵌入的标签。（有在你刮，例如网页一些嵌入式span标签。）

来源

2015-12-15 18:24:34

如何从此代码中省略<h>标签？

回答

相关问题