Beautifulsoup不能从IMG标记

这里提取src属性是我的代码：Beautifulsoup不能从IMG标记

html = '''<img onload='javascript:if(this.width>950) this.width=950' 
src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">''' 
soup = BeautifulSoup(html) 
imgs = soup.findAll('img') 

print imgs[0].attrs

它将打印[(u'onload', u'javascript:if(this.width>950) this.width=950')]

那么，是src属性？

如果我取代HTML通过类似html = '''<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />'''

我得到正确的结果为[(u'src', u'/image/fluffybunny.jpg'), (u'title', u'Harvey the bunny'), (u'alt', u'a cute little fluffy bunny')]

我很新的HTML和beautifulsoup。我错过了一些知识吗？感谢您的任何想法。

来源

2013-04-14 foresightyj

我有两个版本三BeautifulSoup四个进行了测试，发现bs4（第4版）似乎更好地解决你的HTML版本比3

随着BeautifulSoup 3：

>>> html = """<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">""" 
>>> soup = BeautifulSoup(html) # Version 3 of BeautifulSoup 
>>> print soup 
<img onload="javascript:if(this.width&gt;950) this.width=950" />950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg"&gt;

注意>现在是>，有些位不合适。

此外，当您调用BeautifulSoup（）时，它会将它分开。如果你要打印soup.img，你会得到：

<img onload="javascript:if(this.width&gt;950) this.width=950" />

所以你会错过细节。

随着bs4（BeautifulSoup 4，目前的版本）：

>>> html = '''<img onload='javascript:if(this.width>950) this.width=950' src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg">''' 
>>> soup = BeautifulSoup(html) 
>>> print soup 
<html><body><img onload="javascript:if(this.width&gt;950) this.width=950" src="http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg"/></body></html>

与 .attrs

现在：在BeautifulSoup 3，它返回一个元组列表，因为是你发现了什么。在BeautifulSoup 4中，它返回一本字典：

>>> print soup.findAll('img')[0].attrs # Version 3 
[(u'onload', u'javascript:if(this.width>950) this.width=950')] 

>>> print soup.findAll('img')[0].attrs # Version 4 
{'onload': 'javascript:if(this.width>950) this.width=950', 'src': 'http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg'}

那该怎么办？ Get BeautifulSoup 4。它会更好地解析HTML。

顺便说一句，如果你想要的仅仅是src，呼吁.attrs不需要：

>>> print soup.findAll('img')[0].get('src') 
http://ww4.sinaimg.cn/mw600/c3107d40jw1e3rt4509j.jpg

来源

2013-04-16 10:59:18 TerryA

感谢精湛的答案，所有的细节。我没有配置SO自动发送回复我的电子邮件，所以我读了这么晚。我安装了bs4，它工作正常！ – foresightyj

@foresightyj哈哈没问题:) – TerryA

这种方法可能是有用的：

image=container.find("div",{"class":"ika-picture-flex-box"}) 
image=image.find_all("source") 
image[1].get('srcset')

来源

2018-02-14 09:31:10 Debdeep

Beautifulsoup不能从IMG标记

回答

相关问题