蟒蛇：易子/解析

我有这样的字符串蟒蛇：易子/解析

<img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/> 
Begado is the newest online casino in our listings. As the newest 
member of the Affactive group, Begado features NuWorks slots and games 
for both US and international players. 
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>

我需要从第一img标签获得src

我可以做到这一点无论如何容易吗？

来源

2012-10-31 yital9

你的意思是'src'？ –

你的意思是'src'属性？ – Vikas

任何时候我看到HTML，我的大脑立即去BeautifulSoup（http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html）。检查这里（http://stackoverflow.com/questions/5815747/beautifulsoup-getting-href）的一个类似的问题。 – RocketDonkey

对于python中的HTML屏幕抓取，我推荐Beautiful Soup库。

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc) 
images = list(soup.findAll('img')) 
print images[0]['src']

来源

2012-10-31 21:21:15

谢谢！这是很好的解决方案 – yital9

对此的一种方法是使用regex。

另一种方法是将split的字符串用引号括起来，然后取第二个返回的元素。

splits = your_string.split('"') 
print splits[1]

来源

2012-10-31 21:16:54

强制性警告 “不要用正则表达式解析HTML”：https://stackoverflow.com/a/1732454/505154

危机正则表达式的解决方案：

import re 
re.findall(r'<img\s*src="([^"]*)"\s*/>', text)

这将与src属性返回一个列表，每<img>标记，只有包含一个src属性（因为你说你只想匹配第一个）。

来源

2012-10-31 21:17:09

这是一个快速和丑陋的方式做到这一点没有任何库：

""" 
    >>> get_src(data) 
    ['http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg', 'http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo'] 
""" 

data = """<img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/> 
Begado is the newest online casino in our listings. As the newest 
member of the Affactive group, Begado features NuWorks slots and games 
for both US and international players. 
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>""" 

def get_src(lines): 
    srcs = [] 
    for line in data.splitlines(): 
     i = line.find('src=') + 5 
     f = line.find('"', i) 
     if i > 0 and f > 0: 
      srcs.append(line[i:f]) 
    return srcs

不过，我会建议使用Beatiful Soup，它的一个非常好的图书馆专门用来对付真正的网（破HTML和全部），或者如果您的数据是有效的XML，则可以使用Python标准库中的Element Tree。

来源

2012-10-31 21:34:05

蟒蛇：易子/解析

回答

相关问题