我有一些网页的源代码,我需要找到标签的所有事件,并提取图片的名称和位置(例如<img src="../images/test.jpg" />
我需要 path="../images/"
和file="test.jpg"
)。我如何用正则表达式来做到这一点?提取路径和文件名从<img >标签
1
A
回答
4
你应该使用lxml.html
>>> from urllib2 import urlopen
>>> from lxml import html
>>> page = urlopen('http://www.amazon.co.uk/')
>>> page_source = html.parse(page)
>>> from pprint import pprint
>>> pprint(page_source.xpath('//img/@src'))
['http://g-ecx.images-amazon.com/images/G/02/gno/images/orangeBlue/navPackedSprites-UK-15._V202471918_.png',
'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
'http://g-ecx.images-amazon.com/images/G/02/uk-marketing/xmas10/janbargains/uk-january-bargains-loz75._V175451391_.gif',
'http://g-ecx.images-amazon.com/images/G/02/UK-Shoe/email/7_jan_11-amzn-sale-loz-1._V173375114_.png',
'http://g-ecx.images-amazon.com/images/G/02/uk-jw/homepage/uk-wtch-police-roto._V185455265_.png',
'http://g-ecx.images-amazon.com/images/G/02/kindle/shasta/merch/gw/shasta-gw-bestselling-01a-470x265._V173993687_.jpg',
'http://ecx.images-amazon.com/images/I/412wF8LJ-uL._SL135_.jpg',
'http://ecx.images-amazon.com/images/I/51YC5H64AuL._SL135_.jpg',
'http://ecx.images-amazon.com/images/I/41%2BdpTvM1FL._SL135_.jpg',
'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
'http://g-ecx.images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V42752373_.gif',
'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
'http://ecx.images-amazon.com/images/I/51-kiOR0NwL._SL135_.jpg',
'http://ecx.images-amazon.com/images/I/51DRc-7HuxL._SL135_.jpg',
'http://ecx.images-amazon.com/images/I/51SK5htD22L._SL135_.jpg',
'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
'http://ecx.images-amazon.com/images/I/31POT%2BzL1tL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
'http://ecx.images-amazon.com/images/I/41hkDkhjrTL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
'http://ecx.images-amazon.com/images/I/41zDYiAWasL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
'http://ecx.images-amazon.com/images/I/31HqB5H8j%2BL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
'http://g-ecx.images-amazon.com/images/G/02/uk-clothing/Lingerie/UK_APP_LingerieStore_50._V171062881_.png',
'http://g-ecx.images-amazon.com/images/G/02/uk-pets/graphics/B000FVC1HE_50._V198692831_.jpg',
'http://g-ecx.images-amazon.com/images/G/02/uk-grocery/images/illy_50._V198779066_.gif',
'http://g-ecx.images-amazon.com/images/G/02/uk-electronics/MI_Store/UK_MIN_MILaunch_50._V191178779_.png',
'http://g-ecx.images-amazon.com/images/G/02/uk-lighting/graphics/NoveltyLighting_50._V192237013_.jpg',
'http://g-ecx.images-amazon.com/images/G/02/UK-Shoe/email/7_jan_11-amzn-sale-TCG-1._V173375108_.png',
'http://g-ecx.images-amazon.com/images/G/02/gno/images/general/navAmazonLogoFooter._V192252709_.gif']
3
你不应该使用正则表达式解析HTML为this answer概述的各种原因。你应该使用HTML parser。
0
有多种方式,你可以使用捕获组
path=("[^"]+")
或回顾后语法
(?<=path=)"[^"]+"
大概有其他一些选择也是如此。无论哪种方式,你应该像前面提到的海报可能使用一个HTML解析器的工作。不过,如果你使用正则表达式,你可能需要首先提取img标签,然后运行上面的正则表达式之一。
相关问题
- 1. <img>不知道的相对路径<base>标签
- 2. 获取路径img标签
- 3. <img>标签和JSP
- 4. Keystone.js和tinyMCE的变化<img>和<a>标签[IMG]和[A]文本
- 5. 接受JSoup中的相对路径clean for <img>标签
- 6. 自动从<img>标签的src中获取<a>标签
- 7. 我怎样才能从所述图像提取路径<RI:附件RI:文件名=“故事-IMG-05.png” />
- 8. strip_tags - Strip <a>标签<img>
- 9. 在<img>标签
- 10. <img>标签需要文件扩展名吗?
- 11. 从去除收盘</img>标签
- 12. 如何从路径中提取标签?
- 13. 如何从文件路径名中提取文件名?
- 14. 添加下划线到<img>标签<a>标签
- 15. <img>标签内<a>标签只适用于IE
- 16. <img>标签不能嵌套在<a>标签
- 17. 如何添加<br>标签链接<img>标签?
- 18. jQuery CSS - 不要选择<img>标签<a>标签
- 19. C# - 如何从路径中提取文件名和扩展名?
- 20. jsoup:提取两个之间的标签<img>
- 21. 如何使用<amp-img>标签替换HTML中的<img>标签?
- 22. 删除所有html标签,但<img>或<img/>标签与javascript
- 23. jQuery的取入<a>标签<img src>到可变
- 24. 包裹<a>标签<img />与href = img src?
- 25. 如何从随机图像在开放式办公室中提取文件名<img>标签
- 26. jQuery的:操纵<img src>路径
- 27. 从文件名中提取文件名,路径来自参数
- 28. 提取文件夹名和文件名从文件路径斯卡拉
- 29. 如何从同名的<input>标签提取数值?
- 30. 如何提取路径的文件名