beautifulsoup 4：分段故障（核心转储）

http://www.nasa.gov/topics/earth/features/plains-tornadoes-20120417.html

但我得到分割故障（核心倾倒）时调用：BeautifulSoup（page_html），其中page_html是从内容请求库。这是BeautifulSoup的错误吗？有没有办法解决这个问题？即使像try ...一样的方法可以帮助我运行代码。提前致谢。

的代码如下：

import requests 
from bs4 import BeautifulSoup 

toy_url = 'http://www.nasa.gov/topics/earth/features/plains-tornadoes-20120417.html' 
res = requests.get(toy_url,headers={"USER-Agent":"Firefox/12.0"}) 
page = res.content 
soup = BeautifulSoup(page)

来源

2012-11-10 Taosof

请显示您使用的代码，以便它可以被复制（我无法使用urllib2和BeautifulSoup复制此代码）。 –

@DavidRobinson代码现在被添加。感谢您的询问。 – Taosof

安装'lxml'。 py2.7默认的HTML解析器不会解析这个页面，因为标签错误... BTW，py3.2可以正常工作。（不能使段错误） – JBernardo

此问题是由a bug in lxml引起的，该问题已在lxml 2.3.5中修复。您可以升级lxml，或使用HTML5lib或HTMLParser解析器的Beautiful Soup。

来源

2012-11-11 00:05:49

我碰到类似的错误“Segmentation fault：11”升级lxml从3.4.1-py27_0到3.4.3-py27_0解决了这个问题。 – Andrew

绝对的错误。不应该以这种方式进行段错误。我可以重现（4.0.1）：

>>> import bs4, urllib2 
>>> url = "http://www.nasa.gov/topics/earth/features/plains-tornadoes-20120417.html" 
>>> page = urllib2.urlopen(url).read() 
>>> soup = bs4.BeautifulSoup(page) 
Segmentation fault

一些平分之后，它看起来在DOCTYPE造成的：

>>> page[:page.find(">")+1] 
'<!DOCTYPE "xmlns:xsl=\'http://www.w3.org/1999/XSL/Transform\'">'

而且粗黑客允许BS4解析它：

>>> soup = bs4.BeautifulSoup(page[page.find(">")+1:]) 
>>> soup.find_all("a")[:3] 
[<a href="/home/How_to_enable_Javascript.html" target="_blank">› Learn How</a>, <a href="#maincontent">Follow this link to skip to the main content</a>, <a class="nasa_logo" href="/home/index.html"><span class="hide">NASA - National Aeronautics and Space Administration</span></a>]

有人知道更多可能会看到真正发生的事情，但无论如何，这可能会帮助您开始。

来源

2012-11-10 16:00:20 DSM

它取决于解析器bs4正在使用：'HTMLParseError：坏的结束标记：u“”，在第138行，第93列...... 'lxml'和Python3.2默认的html解析器工作正常 – JBernardo

感谢帝斯曼。但我需要更通用的解决方案。由于我的抓取工具不仅针对nasa.gov网站。 – Taosof

我在解析Wikipedia MW API中的HTML时遇到了同样的问题。这工作得很好，快速和肮脏。谢谢。 – Nilesh

beautifulsoup 4：分段故障（核心转储）

回答

相关问题