2016-05-06 16 views
2

获取文本我有一些当前的Python代码应该从网站的某个部分使用HTML标记所在位置的xpath获取HTML代码。尝试从网站的某个部分使用lxml.html

def wordorigins(word): 
    pageopen = lxml.html.fromstring("http://www.merriam-webster.com/dictionary/" + str(word)) 
    pbody = pageopen.xpath("/html/body/div[1]/div/div[4]/div/div[1]/main/article/div[5]/div[3]/div[1]/div/p[1]") 
    etybody = lxml.html.fromstring(pbody) 
    etytxt = etybody.xpath('text()') 
    etytxt = etytxt.replace("<em>", "") 
    etytxt = etytxt.replace("</em>", "") 
    return etytxt 

此代码返回该错误有关期待一个字符串或缓冲区:

Traceback (most recent call last): 
    File "mott.py", line 47, in <module> 
    print wordorigins(x) 
    File "mott.py", line 30, in wordorigins 
    etybody = lxml.html.fromstring(pbody) 
    File "/usr/lib/python2.7/site-packages/lxml/html/__init__.py", line 866, in fromstring 
    is_full_html = _looks_like_full_html_unicode(html) 
TypeError: expected string or buffer 

的思考?

回答

1

xpath()方法返回一个结果列表,fromstring()需要一个字符串。

但是,您不需要重新分析文档的一部分。只需使用你已经发现:

def wordorigins(word): 
    pageopen = lxml.html.fromstring("http://www.merriam-webster.com/dictionary/" + str(word)) 
    pbody = pageopen.xpath("/html/body/div[1]/div/div[4]/div/div[1]/main/article/div[5]/div[3]/div[1]/div/p[1]")[0] 
    etytxt = pbody.text_content() 
    etytxt = etytxt.replace("<em>", "") 
    etytxt = etytxt.replace("</em>", "") 
    return etytxt 

请注意,我用的方法text_content()代替了xpath("text()")的。

1

@alecxe的回答所提到的,在这种情况下匹配的元素,因此,当你试图列表传递给lxml.html.fromstring()错误的xpath()方法返回列表。另外需要注意的是,XPath的text()函数和lxmltext_content()方法都不会返回包含标记的字符串,如<em></em>。它们会自动剥离标签,因此不需要两条线。您可以简单地使用text_content()或XPath的string()函数(而不是text()):

...... 
# either of the following lines should be enough 
etytxt = pbody[0].xpath('string()') 
etytxt = pbody[0].text_content() 
相关问题