2017-10-20 43 views
1

我使用findAll函数在beautifulsoup中刮取文本的网页并将结果返回到列表中。出于某种原因,当td容器中还存在链接时,它不会返回条目。例如:Beautifulsoup - findAll找不到字符串,当链接也在容器中

<html 
<tr> 
<td> 
    Taken at. string without link, this is found 
</td> 
</tr> 
<tr> 
<td> 
    Taken at. string followed by link, this is not found 
    <a href="http://www.thisisalink.com/index.html"> 
    text for link 
    </a> 
</td> 
</tr> 
</html> 

第一td容器返回,但第二还包含一个链接是不是用下面的代码:

import requests 
from bs4 import BeautifulSoup 
import re 

genus = 'Parsonsia' 
species = 'straminea' 

page = requests.get("https://www.anbg.gov.au/cgi-bin/apiiName?030=" + genus + "+" + species) 
soup = BeautifulSoup(page.content, 'html.parser') 

grep_str = '^Taken at.*$' 
pattern = re.compile(grep_str) 
location = soup.findAll('td', text=pattern) 

for item in location: 
    print(item) 

我如何获得的findAll函数返回两个实例?结果与其他刮取的数据一起放入data.frame中,因此,以正确的顺序一次查找所有这些实例很重要。

干杯!

+0

它不会与当前的代码,甚至不是第一个'td'容器 – RomanPerekhrest

+0

道歉返回任何东西,我示例必须破坏代码。让我编辑这个问题。 –

+0

@RomanPerekhrest感谢您指出代码无法正常工作。现在应该工作。 –

回答

1

事实上,我测试过的情况下,似乎BeautifulSoup没有给出预期的结果。
使用lxml.html库,而不是:

import lxml.html as html 
import requests 

genus = 'Parsonsia' 
species = 'straminea' 

page = requests.get("https://www.anbg.gov.au/cgi-bin/apiiName?030=" + genus + "+" + species) 
root = html.fromstring(page.content) 
for td in root.xpath("//td[contains(text(),'Taken at')]"): 
    print(td.text_content()) 

实际输出:

Taken at. ANBG nursery 
Taken at. ANBG 
Taken at. ANBG 
Taken at. Robertson to Belmore Falls Road, NSW 
Taken at. Robertson to Belmore Falls Road, NSW 
Taken at. Robertson to Belmore Falls Road, NSW 
Taken at. Robertson to Belmore Falls Road, NSW 
Taken at. Back Hillston Rd, near Goolgowi, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. Tuross Head, near Memorial Gardens, Tuross, NSW 
Taken at. Tuross Head, near Memorial Gardens, Tuross, NSW 
Taken at. Lake Conjola beach, N of Ulladulla, NSW 
Taken at. Chain Valley Bay, Lake Macquarie State Conservation Area, NSW 
Taken at. Chain Valley Bay, Lake Macquarie State Conservation Area, NSW 
Taken at. Wright's Lookout walk, New England Nat Pk, NSW 
Taken at. Wright's Lookout walk, New England Nat Pk, NSW 
Taken at. Wright's Lookout walk, New England Nat Pk, NSW 
Taken at. Boondall Wet Lands, Brisbane QLD 
Taken at. Boondall Wet Lands, Brisbane QLD 
Taken at. Boondall Wet Lands, Brisbane QLD 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key 

http://lxml.de/lxmlhtml.html

+0

谢谢罗马。我必须熟悉这个软件包。 –

1

这很有趣。我不知道为什么会发生,但这是一个解决方法。

您可以先将所有td标签拉到列表中,然后根据它们包含的文本对其进行过滤。

td_all = soup.findAll('td') 
location = list(filter(lambda td: 'Taken at.' in td.text, td_all)) 

如果你要通过location在你的代码只有一次迭代,最好是直接在你的循环中删除list转换,并使用filter对象:

location = filter(lambda td: 'Taken at.' in td.text, td_all) 

编辑:替代解决方案

您试图抓取的页面结构足够。所以很容易浏览它(至少对于你在问题中提到的页面)。

由于每个图像索引都包含在tr之内,所以我们可以先将所有这些图像拖入列表中。但是由于这些tr元素中的每一个都有嵌套表,因此我们只需要获取直接子元素,这可以使用findAll方法的recursive=False参数完成。

trows = soup.table.tbody.findAll('tr', recursive=False) 

for trow in trows[1:]: 
    print(trow.findAll('tr')[1].td.text) 

我只从列表中的第二项迭代,因为第一个是标题行。

这使打印出30株的整个列表:

Taken at. ANBG nursery 
Taken at. ANBG 
Taken at. ANBG 
Taken at. Robertson to Belmore Falls Road, NSW 
Taken at. Robertson to Belmore Falls Road, NSW 
Taken at. Robertson to Belmore Falls Road, NSW 
Taken at. Robertson to Belmore Falls Road, NSW 
Taken at. Back Hillston Rd, near Goolgowi, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. shoreline, Tuross Head, NSW 
Taken at. Tuross Head, near Memorial Gardens, Tuross, NSW 
Taken at. Tuross Head, near Memorial Gardens, Tuross, NSW 
Taken at. Lake Conjola beach, N of Ulladulla, NSW 
Taken at. Chain Valley Bay, Lake Macquarie State Conservation Area, NSW 
Taken at. Chain Valley Bay, Lake Macquarie State Conservation Area, NSW 
Taken at. Wright's Lookout walk, New England Nat Pk, NSW 
Taken at. Wright's Lookout walk, New England Nat Pk, NSW 
Taken at. Wright's Lookout walk, New England Nat Pk, NSW 
Taken at. Boondall Wet Lands, Brisbane QLD 
Taken at. Boondall Wet Lands, Brisbane QLD 
Taken at. Boondall Wet Lands, Brisbane QLD 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key 
Taken at. see Australian Tropical Rainforest Plants Key