2015-04-29 82 views
3

解析与BS4此示例文件,从蟒蛇2.7.6:BeautifulSoup(BS4)解析错误

<html> 
<body> 
<p>HTML allows omitting P end-tags. 

<p>Like that and this. 

<p>And this, too. 

<p>What happened?</p> 

<p>And can we <p>nest a paragraph, too?</p></p> 

</body> 
</html> 

使用:

from bs4 import BeautifulSoup as BS 
... 
tree = BS(fh) 

HTML有,望穿秋水,允许省略结束标签各种元素类型,包括P(检查模式或解析器)。然而,BS4的美化()这份文件表明,它并没有结束任何这些段落,直到它看到</BODY>:

<html> 
<body> 
    <p> 
    HTML allows omitting P end-tags. 
    <p> 
    Like that and this. 
    <p> 
    And this, too. 
    <p> 
     What happened? 
    </p> 
    <p> 
     And can we 
     <p> 
     nest a paragraph, too? 
     </p> 
    </p> 
    </p> 
    </p> 
    </p> 
</body> 

这不是美化()的错,因为手动遍历树我得到同样的结构:

<[document]> 
    <html> 
     ␊ 
     <body> 
      ␊ 
      <p> 
       HTML allows omitting P end-tags.␊␊ 
       <p> 
        Like that and this.␊␊ 
        <p> 
         And this, too.␊␊ 
         <p> 
          What happened? 
         </p> 
         ␊ 
         <p> 
          And can we 
          <p> 
           nest a paragraph, too? 
          </p> 
         </p> 
         ␊ 
        </p> 
       </p> 
      </p> 
     </body> 
     ␊ 
    </html> 
    ␊ 
</[document]> 

现在,这将是XML正确的结果(至少到</BODY>,此时它应该报告WF错误)。但这不是XML。是什么赋予了?

回答