2015-05-06 184 views
1

我知道这种类型的问题经常出现,但我一直在浏览并没有看到类似的问题。刮嵌套标签

<div class="casts"> 
    <table cellpadding="0" cellspacing="0"> 
     <tbody> 
      <tr> 
       <td class=""> 
        <a class="cast"> 
         <span class="title"> 
          Nested data 1 
          <span class="schedule"> 
           Nested data 2 
          </span> 
         </span> 
        </a> 
       </td> 
      </tr> 
     </tbody> 
    </table> 
</div> 

有多个td具有相同的结构,但我删除了其余的只是为了简单。假设我想拉从跨度的数据Nested data 1Nested data 2我用的是以下几点:

finda = soup.find_all('a', attrs={'class':'cast'}) 

for var in finda: 
    var2 = var.find_all('span') 

使用:

var2[1]

IM能够把所有的Nested data 2

但我无法拉动Nested data 1

var2[0]

将返回Nested data2 Nested data1

回答

1

这可以或多或少简单的方式通过每个跨度的孩子迭代来完成:

stack.html

<!DOCTYPE html> 
<html lang="en"> 
<head> 
    <title>StackO</title> 
    <meta charset="utf-8"> 
</head> 
<body> 
    <div class="casts"> 
    <table cellpadding="0" cellspacing="0"> 
     <tbody> 
     <tr> 
      <td class=""> 
      <a class="cast"> 
       <span class="title"> 
       Nested data 1 
       <span class="schedule"> 
        Nested data 2 
        <span class="moar-nesting"> 
        Nested data 3 
        </span> 
       </span> 
       Nested data 4 
       </span> 
      </a> 
      </td> 
     </tr> 
     </tbody> 
    </table> 
    </div> 
</body> 
</html> 

与此同时,在ipython土地....

In [1]: from bs4 import BeautifulSoup, NavigableString, Comment 

In [2]: with open('stack.html', 'r') as f: 
    ...:  markup = f.read() 
    ...: 

In [3]: soup = BeautifulSoup(markup) 

In [4]: casts = soup.find_all('a', attrs={'class': 'cast'}) 

In [5]: cast = casts[0] 

In [6]: for span in cast.find_all('span'): 
    ...:  for child in span.children: 
    ...:   if isinstance(child, NavigableString) and not isinstance(child, Comment) and str(child).strip() != "": 
    ...:    print '"{}"'.format(str(child).strip()) 
    ...: 
"Nested data 1" 
"Nested data 4" 
"Nested data 2" 
"Nested data 3" 

In [10]: 
+0

我从来没有想过这个谢谢 – kayduh