Python循环获取html标记返回空列表而不是标记

所以我试图做一个函数，将通过列表中的html标记列表作为字符并返回标记。举例来说，它会经历如下列表Python循环获取html标记返回空列表而不是标记

['<'，'h'，'t'，'m'，'l'，'>'，'<'，'h'，' e'，'a'，'d'，'>'，'<'，'m'，'e'，'t'，'a'，'>']

并返回像这样的列表

[ 'HTML'， '头'， '元']

然而，当我运行了以下功能它返回一个空列表[]

def getTag(htmlList): 
    tagList=[] 
    for iterate, character in enumerate(htmlList): 
     tagAppend = '' 
     if character=='<': 
      for index, word in enumerate(htmlList): 
       if index>iterate: 
        if character=='>': 
         tagList.append(tagAppend) 
         break 
        tagAppend += character 

    return tagList

该程序似乎对我有意义吗？它创建一个空列表（tagList），然后它像我发布的第一个列表一样遍历列表（htmlList）。

迭代时，如果遇到'<'，它会将其上找到'<'的索引上的所有字符添加到名为tagAppend的字符串中。然后当它到达结束标签的'>'时停止。 tagAppend然后被添加到tagList。然后清除tagList并重做循环。

来源

2016-10-07 pythonHelp

我打算假设这只是为了学习而进行的练习。一般来说，Python有更好的工具来解析HTML（https://www.crummy.com/software/BeautifulSoup/）或字符串（https://docs.python.org/2/library/re.html）。

def getTag(htmlList): 
    tagList=[] 
    for iterate, character in enumerate(htmlList): 
     tagAppend = '' 
     if character=='<': 
      for index, word in enumerate(htmlList): 
       if index>iterate: 
        # use word here otherwise this will never be True 
        if word=='>': 
         tagList.append(tagAppend) 
         break 
        # and here 
        tagAppend += word 

    return tagList

关键的错误是使用字符而不是字。我认为否则它会正常工作。虽然效率不高。

我们也可以简化。不需要嵌套for循环。

def getTag(htmlList): 
    tagList=[] 
    tag = "" 
    for character in htmlList: 
     if character == "<": 
      tag = "" 
     elif character == ">": 
      tagList.append(tag) 
     else: 
      tag.append(character) 

    return tagList

上面有一些严重的问题取决于输入数据上的约束。仔细考虑它并查看是否可以找到它们可能是有益的。

我们也可以像其他答案中提到的那样，使用像split和join这样的内置插件来产生很大的影响。

来源

2016-10-07 20:28:29 intrepidhero

看起来太复杂了。相反，加入列表转换为字符串，删除开启角度支架，并在闭角括号分开，记住要丢弃空字符串：

def get_tag(l): 
    return [item for item in ''.join(l).replace('<','').split('>') if item]

结果：

>>> l = ['<', 'h', 't', 'm', 'l', '>', '<', 'h', 'e', 'a', 'd', '>', '<', 'm', 'e', 't', 'a', '>'] 
>>> get_tag(l) 
['html', 'head', 'meta']

来源

2016-10-07 20:28:43 TigerhawkT3

我觉得re会是个不错的选择。

def get_tag(l): 
    return re.findall(r'<([a-z]+)>', ''.join(l)) 

get_tag(l) 
['html', 'head', 'meta']

来源

2016-10-07 20:37:41 Nf4r

你的代码是接近正确，你只需要更换的character所有外观与word内循环; word从来没有在内部循环使用：

 ... 
     for index, word in enumerate(htmlList): 
      if index > iterate: 
       if word == '>': # here 
        tagList.append(tagAppend) 
        break 
       tagAppend += word # here 
     ...

你可以不enumerate做，一个嵌套循环下列要求：

def get_tag(htmlList): 
    tag_list = [] 
    for x in htmlList: 
     if x == '<': 
      tag = '' 
      continue 
     elif x == '>': 
      tag_list.append(tag) 
      continue 
     tag += x 
    return tag_list

来源

2016-10-07 20:38:37

Python循环获取html标记返回空列表而不是标记

回答

相关问题