我试图抓住使用美丽的汤的.txt文件中的每个标签（即我的列表中）之间的每一段单独的文本，并将它们存储到字典中。如果我运行大文件，这段代码有效，但速度非常慢，那么还有另一种方法可以让代码更快吗？使用BeautifulSoup抓取标签之间的文本

from bs4 import BeautifulSoup 

words_dict = dict() 

# these are all of the tags in the file I'm looking for 
tags_list = ['title', 'h1', 'h2', 'h3', 'b', 'strong'] 

def grab_file_content(file : str): 
    with open(file, encoding = "utf-8") as file_object: 
     # entire content of the file with tags 
     content = BeautifulSoup(file_object, 'html.parser') 

     # if the content has content within the <body> tags... 
     if content.body: 
      for tag in tags_list: 
       for tags in content.find_all(tag): 
        text_list = tags.get_text().strip().split(" ") 
        for words in text_list: 
         if words in words_dict: 
          words_dict[words] += 1 
         else: 
          words_dict[words] = 1 

     else: 
      print('no body')

来源

2017-05-20 dppham1

你说你想要的文字_between_标签（这将是之间，也就是说，

和另一

），但在你的例如，您可以提取标签中的单词（即，在和之间）。你想要什么？ – DyZ

啊，是的，我想要在两个标签的中间输入。因此，例如

我的文字

，我想我的字典存储{我：1，文本：1}。感谢那 – dppham1

下面的代码在功能上等同于您的代码：

from collections import Counter  
from itertools import chain 

words_dict = Counter() # An empty counter further used as an accumulator 

# Probably a loop 
# Create the soup here, as in your original code 
content = BeautifulSoup(file_object, 'html.parser') 
words_dict += Counter(chain.from_iterable(tag.string.split() 
         for tag in content.find_all(tags_list) if tag.string))

来源

2017-05-20 22:17:46 DyZ

使用BeautifulSoup抓取标签之间的文本

我的文字

回答

相关问题