如何从Scrapy网站获取所有纯文本？

我希望在呈现HTML之后在网站上显示所有文本。我使用Scrapy框架在Python中工作。 With xpath('//body//text()')我能够得到它，但带有HTML标签，我只想要文本。任何解决方案？谢谢！如何从Scrapy网站获取所有纯文本？

来源

2014-04-18 tomasyany

最简单的办法是找到extract//body//text()和join一切：

''.join(sel.select("//body//text()").extract()).strip()

其中sel是Selector实例。

另一种选择是使用nltk的clean_html()：

>>> import nltk 
>>> html = """ 
... <div class="post-text" itemprop="description"> 
... 
...   <p>I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
... With <code>xpath('//body//text()')</code> I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !</p> 
... 
...  </div>""" 
>>> nltk.clean_html(html) 
"I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework.\nWith xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !"

另一种选择是使用BeautifulSoup的get_text()：

get_text()

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string.

>>> from bs4 import BeautifulSoup 
>>> soup = BeautifulSoup(html) 
>>> print soup.get_text().strip() 
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

另一种选择是使用lxml.html的text_content() ：

.text_content()

Returns the text content of the element, including the text content of its children, with no markup.

>>> import lxml.html 
>>> tree = lxml.html.fromstring(html) 
>>> print tree.text_content().strip() 
I would like to have all the text visible from a website, after the HTML is rendered. I'm working in Python with Scrapy framework. 
With xpath('//body//text()') I'm able to get it, but with the HTML tags, and I only want the text. Any solution for this? Thanks !

来源

2014-04-18 15:18:56 alecxe

我删除了我的问题..我已经使用了下面的代码html = sel.select（“// body // text（）”） tree = lxml.html.fromstring（html） item ['description'] = tree.text_content（）。strip（）但是我得到了\t is_full_html = _looks_like_full_html_unicode（html） \t exceptions.TypeError：期望的字符串或缓冲区..erro。出错了 – Backtrack

'nltk'对我来说效果最好 – user4421975

就像更新一样，'nltk'弃用了他们的'clean_html'方法，而是建议： 'NotImplementedError：要删除HTML标记，请使用BeautifulSoup的get_text（）函数 ' – TheNastyOne

你试过了吗？

xpath('//body//text()').re('(\w+)')

xpath('//body//text()').extract()

来源

2014-04-18 15:08:41

这实际上工作得很好，但仍然返回一些html标签和其他。 – tomasyany

如何从Scrapy网站获取所有纯文本？

回答

相关问题