读取保存在文本文件中的源文件并提取文本

我有多个文本文件，这些文件用于存储网站的源页面。所以每个文本文件都是一个源页面。读取保存在文本文件中的源文件并提取文本

我需要使用下面的代码保存在文本文件中一个div类提取文本：

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt")) 
txt = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text 
print txt

我已经检查了我的汤对象的类型，以确保它不使用字符串find方法，同时寻找为div类。类型汤对象的

print type(soup) 
<class 'bs4.BeautifulSoup'>

我已经从一个the previous post所取出的参考，并书面beautifulsoup语句内公开声明。

错误：从页面

Traceback (most recent call last): 
    File "html_desc_cleaning.py", line 13, in <module> 
    txt2 = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text 
AttributeError: 'NoneType' object has no attribute 'text'

来源：

来源

2015-10-14 Pappu Jha

请勿上传图片添加文字，因为图片无用 – styvane

我已经解决了这个问题。

在我的情况下，beautifulsoup的默认解析器是'lxml'，它无法读取完整的源页面。

更改解析器为'html.parser'已为我工作。

f = open("zing.internet.accelerator.plus.txt") 
soup = f.read() 
bs = BeautifulSoup(soup,"html.parser") 
print bs.find('div',{'class' : 'id-app-orig-desc'}).text

来源

2015-10-14 14:04:59

尝试替换此：

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt"))

与此：

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt").read())

而且顺便说一下，关闭这个文件读完后是个不错的主意。您可以使用with这样的：

with open("zing.internet.accelerator.plus.txt") as f: 
    soup = BeautifulSoup(f.read())

with将会自动关闭该文件。

这是为什么你需要.read()函数的一个例子：

>>> a = open('test.txt') 
>>> type(a) 
<class '_io.TextIOWrapper'> 

>>> print(a) 
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'> 

>>> b = a.read() 
>>> type(b) 
<class 'str'> 

>>> print(b) 
Hey there. 

>>> print(open('test.txt')) 
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'> 

>>> print(open('test.txt').read()) 
Hey there.

来源

2015-10-14 06:11:06

嘿，谢谢。我试过上面的代码，并包括阅读，但仍然得到相同的错误:( –

嗯...尝试'打开（“zing.internet.accelerator.plus.txt”）。阅读（）' –

它是打印整体源代码页 –

读取保存在文本文件中的源文件并提取文本

回答

相关问题