2017-05-20 71 views
2

我试图抓住使用美丽的汤的.txt文件中的每个标签(即我的列表中)之间的每一段单独的文本,并将它们存储到字典中。如果我运行大文件,这段代码有效,但速度非常慢,那么还有另一种方法可以让代码更快吗?使用BeautifulSoup抓取标签之间的文本

from bs4 import BeautifulSoup 

words_dict = dict() 

# these are all of the tags in the file I'm looking for 
tags_list = ['title', 'h1', 'h2', 'h3', 'b', 'strong'] 

def grab_file_content(file : str): 
    with open(file, encoding = "utf-8") as file_object: 
     # entire content of the file with tags 
     content = BeautifulSoup(file_object, 'html.parser') 

     # if the content has content within the <body> tags... 
     if content.body: 
      for tag in tags_list: 
       for tags in content.find_all(tag): 
        text_list = tags.get_text().strip().split(" ") 
        for words in text_list: 
         if words in words_dict: 
          words_dict[words] += 1 
         else: 
          words_dict[words] = 1 

     else: 
      print('no body') 
+0

你说你想要的文字_between_标签(这将是之间,也就是说,

和另一

),但在你的例如,您可以提取标签中的单词(即,在之间)。你想要什么? – DyZ

+0

啊,是的,我想要在两个标签的中间输入。因此,例如

我的文字

,我想我的字典存储{我:1,文本:1}。感谢那 – dppham1

回答

1

下面的代码在功能上等同于您的代码:

from collections import Counter  
from itertools import chain 

words_dict = Counter() # An empty counter further used as an accumulator 

# Probably a loop 
# Create the soup here, as in your original code 
content = BeautifulSoup(file_object, 'html.parser') 
words_dict += Counter(chain.from_iterable(tag.string.split() 
         for tag in content.find_all(tags_list) if tag.string)) 
相关问题