BeautifulSoup python解析html文件

我正在使用BeautifulSoup用&sbquo;替换html文件中的所有逗号。这里是我的代码为：BeautifulSoup python解析html文件

f = open(sys.argv[1],"r") 
data = f.read() 

soup = BeautifulSoup(data) 

comma = re.compile(',') 


for t in soup.findAll(text=comma): 
     t.replaceWith(t.replace(',', '&sbquo;'))

此代码的工作原理，除非在html文件中包含一些javascript。在这种情况下，它甚至会用javascript代码替换逗号（，）。这不是必需的。我只想替换html文件的所有文本内容。

来源

2011-09-14 Divya

soup.findall可以采取调用：

tags_to_skip = set(["script", "style"]) 
# Add to this list as needed 

def valid_tags(tag): 
    """Filter tags on the basis of their tag names 

    If the tag name is found in ``tags_to_skip`` then 
    the tag is dropped. Otherwise, it is kept. 
    """ 
    if tag.source.name.lower() not in tags_to_skip: 
     return True 
    else: 
     return False 

for t in soup.findAll(valid_tags): 
    t.replaceWith(t.replace(',', '&sbquo;'))

来源

2011-09-14 19:07:58

冷静..这是真棒。我如何跳过评论？它甚至显示<！Doctype ....>我不需要替换HTML文件的注释部分 – Divya

如果您导入BeautifulSoup;打印BeautifulSoup .__ version__'，返回哪个版本号？ –

BeautifulSoup python解析html文件

回答

相关问题