soup = BeautifulSoup(html)
boxes = soup.findAll("div", { "class" : re.compile(r'\bmixDesc\b') })
我认为我只有类'mixDesc'的箱子。Python和Beautifulsoup:奇怪的行为findAll
所以我调试,以确保
count = 0
for box in boxes :
count = count + 1
print "JWORG box {0}".format(count)
print "JWORG box len {0}".format(len(box))
print box
我只有10 mixDesc
类的div中解析HTML文件
但我得到了30盒和大量(20的30 )被打印为
[]
你能解释为什么会发生这种情况吗?为什么findAll抓住这个空标签? 或...我还有什么错误?
编辑1:
我用这个来编程XBMC插件,所以我使用的唯一可用的版本,我
编辑2:
我不能复制/粘贴所有的HTML,但我刮这个页面:http://www.jw.org/it/video/?start=70
所以你可以看到HTML源,以帮助我。
编辑3: 这是我的XBMC日志,请没有我印连计数器和len(盒)
20:27:54 T:5356 NOTICE: JWORG box 1
20:27:54 T:5356 NOTICE: JWORG box len 5
20:27:54 T:5356 NOTICE: [<div class="syn-img sqr mixDesc">
<a href="/it/cosa-dice-la-Bibbia/famiglia/bambini/diventa-amico-di-geova/cantici/120-felice-chi-mette-in-pratica-ci%C3%B2-che-ode/" class="jsDownload jsVideoModal jsCoverDoc" data-jsonurl="/apps/TRGCHlZRQVNYVrXF?output=json&pub=pksn&fileformat=mp4&alllangs=1&track=120&langwritten=I&txtCMSLang=I" data-coverurl="/it/cosa-dice-la-Bibbia/famiglia/bambini/diventa-amico-di-geova/cantici/120-felice-chi-mette-in-pratica-ci%C3%B2-che-ode/" data-onpagetitle="Cantico 120: Felice chi mette in pratica ciò che ode" title="Play o download | Cantico 120: Felice chi mette in pratica ciò che ode" data-mid="1102013357">
<span class="jsRespImg" data-img-type="sqr" data-img-size-lg="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_lg.jpg" data-img-size-md="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_md.jpg" data-img-size-sm="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_sm.jpg" data-img-size-xs="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_sqr_xs.jpg"></span></a><noscript><img src="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357_I/1102013357_univ_sqr_xs.jpg" alt="" /></noscript>
<div style="display:none;" class="jsVideoPoster mid1102013357" data-src="http://assets.jw.org/assets/m/ijw13pk/1102013357/ijw13pk_id-1102013357.art/1102013357_univ_lsr_lg.jpg" data-alt=""></div>
</div>]
20:27:54 T:5356 NOTICE: JWORG box 2
20:27:54 T:5356 NOTICE: JWORG box len 7
20:27:54 T:5356 NOTICE: []
20:27:54 T:5356 NOTICE: JWORG box 3
20:27:54 T:5356 NOTICE: JWORG box len 7
20:27:54 T:5356 NOTICE: []
编辑4:
行,有30个div的,因为the're嵌套,但为什么他们是空的?以及如何过滤掉这些?
什么是您的HTML? –
至少,你最好使用'BeautifulSoup4'和'find_all()'。 – alecxe
我正在使用python在XBMC上编写插件。我无法复制/粘贴所有的html,但我在抓这个页面:http://www.jw.org/it/video/?start=70 – realtebo