我的xpath表达式有什么问题？

我想提取td中所有类为u-ctitle的链接。我的xpath表达式有什么问题？

import os 
import urllib 
import lxml.html 
down='http://v.163.com/special/opencourse/bianchengdaolun.html' 
file=urllib.urlopen(down).read() 
root=lxml.html.document_fromstring(file) 
namelist=root.xpath('//td[@class="u-ctitle"]/a') 
len(namelist)

输出[]，有这么多的TD，其经“U型ctitle”，用萤火你CA得到的，为什么不能提取呢？

我的python版本是2.7.9。

这是没有用的，更改文件到其他名称。

来源

2017-01-26 it_is_a_literature

您能分享网页中的html吗？ – Shijo

由于len（）应该返回整数，所以输出不能是空列表（'[]'）而且你的'XPath'也能正常工作（在'Python 3.5'上试过，''用来代替'urllib ' - output'34'）.. – Andersson

用Python 2.7.5确认，工作和列表是不是空的。你确定你得到'[]'作为输出吗？ –

你的XPath是正确的。这个问题是无关的。

如果检查HTML，你会看到以下meta标签：

<meta http-equiv="Content-Type" content="text/html; charset=GBK" />

而在这个代码：

file=urllib.urlopen(down).read() 
root=lxml.html.document_fromstring(file)

file实际上是一个字节序列，从这样的解码GBK编码的字节Unicode字符串在document_fromstring方法内发生。

问题是，HTML编码实际上并不是GBK，lxml不正确地解码，导致数据丢失。

>>> file.decode('gbk') 
Traceback (most recent call last): 
    File "down.py", line 9, in <module> 
    file.decode('gbk') 
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 7247-7248: illegal multibyte sequence

一些试验和错误后，我们可以发现，实际的编码是GB_18030。要使脚本正常工作，您需要手动解码字节：

root=lxml.html.document_fromstring(file.decode('GB18030'))

来源

2017-01-26 14:42:53 alexanderlukanin13

您还可以解释为什么输出与不同版本的Python/urllib/lxml有所不同？使用Python 2.7.5/urllib 1.12/lxml 3.4.1与Python 2.7.9/urllib 1.18/lxml 3.6.4相比，我可以检索更多的HTML。谢谢！ –

我也想知道。 –

我在py3.4.3/lxml3.7.2，py2.7.6/lxml3.7.2和py2.7.6/lxml3.4.1中看到了相同的结果。但无论如何，我猜不同版本的lxml *可能会对这种情况的处理略有不同。 – alexanderlukanin13

我的xpath表达式有什么问题？

回答

相关问题