2014-12-22 69 views
0

我在源文件的末尾有HTML注释。Python用BeautifulSoup查找文本

<!-- FEO DEBUG OUTPUT [TextTransAttempted:RENAME_JAVASCRIPT(18), RENAME_IMAGE(7), MINIFY_JAVASCRIPT(25), (1), JAVASCRIPT_HTML5_CACHE(19), EMBED_JAVASCRIPT(1), RENAME_CSS(3), (1), IMAGE_COMPRESSION(7), RESPONSIVE_IMAGES(6), ASYNC_JAVASCRIPT(2);TextTransApplied:RENAME_JAVASCRIPT(18), RENAME_IMAGE(7), MINIFY_JAVASCRIPT(25), (1), JAVASCRIPT_HTML5_CACHE(19), EMBED_JAVASCRIPT(1), RENAME_CSS(3), (1), IMAGE_COMPRESSION(7), RESPONSIVE_IMAGES(6), ASYNC_JAVASCRIPT(2);TagTransAttempted:(8), ASYNC_JAVASCRIPT(61);TagTransFailed:ASYNC_JAVASCRIPT(42);TagTransApplied:(8), ASYNC_JAVASCRIPT(19); ] --> 

现在我想检查括号中的所有内容是否大于零。例如,我想从RENAME_JAVASCRIPT中获得18的值,并检查它是否大于零,并且类似地为其余部分。由于这是一个评论,而不是任何html标签的一部分,BeautifulSoup有没有办法实现这一点。

+0

http://stackoverflow.com/questions/6062210/how-to-find-the -comment-tag-with-beautifulsoup –

回答

0

我只想用重:

import re 
from bs4 import BeautifulSoup 
with open("/sample_html.txt") as f: 
    soup = BeautifulSoup(f.read()) 
    tag = soup.find("html").next_sibling 
    print(all(x > 0 for x in map(int,re.findall("\((\d+)\)",tag)))) 

    True 

如果你想看到的名称:

from bs4 import BeautifulSoup 
with open("/sample_html.txt") as f: 
    soup = BeautifulSoup(f.read()) 
    tag = soup.find("html").next_sibling 
    for ele in re.findall("\w+\(\d+\)",tag): 
     if int(ele.split("(")[1].rstrip(")")) > 0: 
      print(ele) 
RENAME_JAVASCRIPT(18) 
RENAME_IMAGE(7) 
MINIFY_JAVASCRIPT(25) 
JAVASCRIPT_HTML5_CACHE(19) 
EMBED_JAVASCRIPT(1) 
RENAME_CSS(3) 
IMAGE_COMPRESSION(7) 
RESPONSIVE_IMAGES(6) 
ASYNC_JAVASCRIPT(2) 
RENAME_JAVASCRIPT(18) 
RENAME_IMAGE(7) 
MINIFY_JAVASCRIPT(25) 
JAVASCRIPT_HTML5_CACHE(19) 
EMBED_JAVASCRIPT(1) 
RENAME_CSS(3) 
IMAGE_COMPRESSION(7) 
RESPONSIVE_IMAGES(6) 
ASYNC_JAVASCRIPT(2) 
ASYNC_JAVASCRIPT(61) 
ASYNC_JAVASCRIPT(42) 
ASYNC_JAVASCRIPT(19) 
+0

引发以下错误。回溯(最近通话最后一个): 文件 “body_parser.py”,线路119, 打印(所有(x> 0映射图X(INT,re.findall( “\((\ d +)\)” ,饲料)))) 文件 “/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py”,线路177,在的findall 回报_compile(图案,旗).findall (字符串) TypeError:期望的字符串或缓冲区 – station

+0

哦,我明白了,我的输入将是整个HTML源代码,并且该评论将在最后 – station

+0

是的,我推测您已经提取了您在问题中提供的html –

相关问题