2013-10-30 133 views
0

我在做解析。我想要获取描述标签内的图片。我正在使用urllib和BeautifulSoup。我可以获取单独标签内的图像,但无法以编码格式获取描述标签内的图像。使用Beautifulsoup提取img内部的xml文件的描述标签

XML代码

<item> 
     <title>Kidnapped NDC member and political activist tells his story</title> 
     <link>http://www.yementimes.com/en/1724/news/3065</link> 
     <description>&lt;img src="http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg" border="0" align="left" hspace="5" /&gt; 
‘I kept telling them that they would never break me and that the change we demanded in 2011 would come whether they wanted it or not’ 
&lt;br clear="all"&gt;</description> 

views.py

for q in b.findAll('item'): 
      d={} 
      d['desc']=strip_tags(q.description.string).strip('&nbsp') 
      if q.guid: 
       d['link']=q.guid.string 
      else: 
       d['link']=strip_tags(q.comments) 
      d['title']=q.title.string 
      for r in q.findAll('enclosure'): 
       d['image']=r['url'] 
      arr.append(d) 

任何人都可以,请给我一个想法做吧..
这是我已经做了解析单独的内部图像标签... 我试图得到,如果它是内部描述,但我不能。

回答

0

你可以尝试从<description>提取所有内容,创建一个新的BeautifulSoup对象与它搜索第一<img>元素src属性:

from bs4 import BeautifulSoup 
import sys 
import html.parser 

h = html.parser.HTMLParser() 

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html') 
for i in soup.find_all('item'): 
    d = BeautifulSoup(h.unescape(i.description.string)) 
    print(d.img['src']) 

运行它想:

python3 script.py xmlfile 

那产量:

http://www.yementimes.com/images/thumbnails/cms-thumb-000003081.jpg 
相关问题