找到特定的链接w/beautifulsoup

嗨，我不知道如何找到链接，以我的生活的某些文本开始。 findall（'a'）工作正常，但它太多了。我只想列出所有以 http://www.nhl.com/ice/boxscore.htm?id=找到特定的链接w/beautifulsoup

开头的链接任何人都可以帮我吗？

非常感谢您

来源

2011-10-11 Jen Scott

首先建立一个测试文件，并打开分析器与BeautifulSoup：

>>> from BeautifulSoup import BeautifulSoup 
>>> doc = '<html><body><div><a href="something">yep</a></div><div><a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a></div><a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a></body></html>' 
>>> soup = BeautifulSoup(doc) 
>>> print soup.prettify() 
<html> 
<body> 
    <div> 
    <a href="something"> 
    yep 
    </a> 
    </div> 
    <div> 
    <a href="http://www.nhl.com/ice/boxscore.htm?id=3"> 
    somelink 
    </a> 
    </div> 
    <a href="http://www.nhl.com/ice/boxscore.htm?id=7"> 
    another 
    </a> 
</body> 
</html>

接下来，我们可以搜索所有<a>代码与一个href属性开始http://www.nhl.com/ice/boxscore.htm?id=。您可以使用正则表达式是：

>>> import re 
>>> soup.findAll('a', href=re.compile('^http://www.nhl.com/ice/boxscore.htm\?id=')) 
[<a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a>, <a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a>]

来源

2011-10-11 21:35:44 jterrace

哇感谢您可能不需要BeautifulSoup。我想美丽的文档预设了正则表达式的流畅性。谢谢你给我看， –

@JenScott如果这回答了你的问题，你应该接受它。 – serk

好，但是如果你的属性名称叫做“class”呢？ – Wajih

，因为搜索是特定

>>> import re 
>>> links = re.findall("http:\/\/www\.nhl\.com\/ice\/boxscore\.htm\?id=.+", str(doc))

来源

2016-05-02 16:05:36 Emma

找到特定的链接w/beautifulsoup

回答

相关问题