如何使用正则表达式提取img标签中的src？

我试图从HTML img标记中提取图像源URL。如何使用正则表达式提取img标签中的src？

如果HTML数据如下图所示：

<div> My profile <img width='300' height='300' src='http://domain.com/profile.jpg'> </div>

或

<div> My profile <img width="300" height="300" src="http://domain.com/profile.jpg"> </div>

如何在Python中的正则表达式？

我曾试过如下：

i = re.compile('(?P<src>src=[["[^"]+"][\'[^\']+\']])') 
i.search(htmldata)

，但我得到一个错误

Traceback (most recent call last): 
File "<input>", line 1, in <module> 
AttributeError: 'NoneType' object has no attribute 'group'

来源

2015-11-21 eachone

您是否已尝试创建自己的正则表达式;这将有助于 – Evert

上面的两行代码不会给你那个错误。 – Evert

[Python正则表达式字符串抽取]可能的重复（http://stackoverflow.com/questions/7384275/python-regex-string-extraction） – ozy

BeautifulSoup解析器是要走的路。

>>> from bs4 import BeautifulSoup 
>>> s = '''<div> My profile <img width='300' height='300' src='http://domain.com/profile.jpg'> </div>''' 
>>> soup = BeautifulSoup(s, 'html.parser') 
>>> img = soup.select('img') 
>>> [i['src'] for i in img if i['src']] 
[u'http://domain.com/profile.jpg'] 
>>>

来源

2015-11-21 09:13:22

或'img.get（'src'）' –

我改编了一下你的代码。请看看：

import re 

url = """<div> My profile <img width="300" height="300" src="http://domain.com/profile.jpg"> </div>""" 
ur11 = """<div> My profile <img width='300' height='300' src='http://domain.com/profile.jpg'> </div>""" 

link = re.compile("""src=[\"\'](.+)[\"\']""") 

links = link.finditer(url) 
for l in links: 
    print l.group() 
    print l.groups() 

links1 = link.finditer(ur11) 
for l in links1: 
    print l.groups()

在你可以找到链接。

输出是这样的：

src="http://domain.com/profile.jpg" 
('http://domain.com/profile.jpg',) 
('http://domain.com/profile.jpg',)

finditer（）是发电机，并允许使用for in循环。

来源：

http://www.tutorialspoint.com/python/python_reg_expressions.htm

https://docs.python.org/2/howto/regex.html

来源

2015-11-21 10:44:57 rocksteady

如何使用正则表达式提取img标签中的src？

回答

相关问题