执行我的类爬虫时遇到问题

当我使用类来抓取任何Web数据时，我完全是python的新手。所以，对于任何严重的错误，事先道歉。我编写了一个脚本来使用wikipedia网站上的a标签解析文本。我试图从我的级别准确地编写代码，但由于某种原因，当我执行代码时会抛出错误。我的代码和错误在下面给出，供您考虑。执行我的类爬虫时遇到问题

脚本：

import requests 
from lxml.html import fromstring 

class TextParser(object): 

    def __init__(self): 
     self.link = 'https://en.wikipedia.org/wiki/Main_Page' 
     self.storage = None 

    def fetch_url(self): 
     self.storage = requests.get(self.link).text 

    def get_text(self): 
     root = fromstring(self.storage) 
     for post in root.cssselect('a'): 
      print(post.text) 

item = TextParser() 
item.get_text()

错误：

Traceback (most recent call last): 
    File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\testmatch.py", line 38, in <module> 
    item.get_text() 
    File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\testmatch.py", line 33, in get_text 
    root = fromstring(self.storage) 
    File "C:\Users\mth\AppData\Local\Programs\Python\Python35-32\lib\site-packages\lxml\html\__init__.py", line 875, in fromstring 
    is_full_html = _looks_like_full_html_unicode(html) 
TypeError: expected string or bytes-like object

来源

2017-10-18 shayan

你执行下面两行

item = TextParser() 
item.get_text()

当初始化TextParser，self.storage等于无。当你执行函数get_text（）时，它仍然等于None。所以这就是为什么你会得到这个错误。

但是，如果将其更改为以下内容。 self.storage应该填充一个字符串，而不是没有。

item = TextParser() 
item.fetch_url() 
item.get_text()

如果你想调用的函数get_text无需调用fetch_url你能做到这样。

来源

2017-10-18 20:56:36 Jonathan

谢谢先生乔纳森，它现在有效。我们很快就会接受它作为答案。请不要忽略提供关于如何在不调用'fetch_url（）'的情况下执行刮板的建议。这是我第一次尝试的。非常感谢，非常感谢。 – shayan

那么，你可以在函数get_text中调用fetch_url。 – Jonathan

非常感谢。就是这样。 – shayan

执行我的类爬虫时遇到问题

回答

相关问题