嵌套标签网页抓取python

我从特定网站上抓取固定内容。内容在于嵌套DIV中，如下图所示：嵌套标签网页抓取python

<div class="table-info"> 
    <div> 
    <span>Time</span> 
     <div class="overflow-hidden"> 
      <strong>Full</strong> 
     </div> 
    </div> 
    <div> 
    <span>Branch</span> 
     <div class="overflow-hidden"> 
      <strong>IT</strong> 
     </div> 
    </div> 
    <div> 
    <span>Type</span> 
     <div class="overflow-hidden"> 
      <strong>Standard</strong> 
     </div> 
    </div> 
    <div> 
    <span>contact</span> 
     <div class="overflow-hidden"> 
      <strong>my location</strong> 
     </div> 
</div> 
</div>

我要检索的DIV中的强中唯一的内容“溢出隐”与字符串值分公司跨度内。我使用的代码是：

from bs4 import BeautifulSoup 
import urllib2 
url = urllib2.urlopen("https://www.xyz.com") 
content = url.read() 
soup = BeautifulSoup(content) 
type = soup.find('div',attrs={"class":"table-info"}).findAll('span') 
print type

我刮内部的主DIV“表信息”中的所有内容跨度，这样我可以使用条件语句来检索所需的内容。但如果我尝试放弃跨度内的DIV内容为：

type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div') 
print type

我得到的错误是：

AttributeError: 'list' object has no attribute 'find'

任何人都可以请给我一些想法来检索跨度的div内容。谢谢。我使用python2.7

来源

2014-04-01 sulav_lfc

好像你想从里面div-“表信息”第二个div内容。但是，您正试图使用与您尝试访问的内容无关的标签来获取它。

type = soup.find('div',attrs={"class":"table-info"}).findAll('span').find('div')

返回错误，因为它是空的。

更好试试这个：

from bs4 import BeautifulSoup 
import urllib2 
url = urllib2.urlopen("https://www.xyz.com") 
content = url.read() 
soup = BeautifulSoup(content) 
type = soup.find('div',attrs={"class":"table-info"}).findAll('div') 
print type[2].find('strong').string

来源

2014-04-01 05:25:42 Anish

感谢，代码工作。我想我是在解决问题时采取了一种完全错误的方法。 –

的findAll返回BS元素的列表，并且find是BS对象，而不是BS对象的列表，因此误差定义。你的代码的开始部分是好的，而是执行此操作：

from bs4 import BeautifulSoup 
import urllib2 

url = urllib2.urlopen("https://www.xyz.com") 
content = url.read() 
soup = BeautifulSoup(content) 

table = soup.find('div',attrs={"class":"table-info"}) 
spans = table.findAll('span') 
branch_span = span[1] 
# Do you manipulation with the branch_span

from bs4 import BeautifulSoup 
import urllib2 

url = urllib2.urlopen("https://www.xyz.com") 
content = url.read() 
soup = BeautifulSoup(content) 

table = soup.find('div',attrs={"class":"table-info"}) 
spans = table.findAll('span') 

for span in spans: 
    if span.text.lower() == 'branch': 
     # Do your manipulation

来源

2014-04-01 05:04:39 shaktimaan

嵌套标签网页抓取python

回答

相关问题