2016-04-26 55 views
1

在Python,我有这样得到的一个html表元素的变量:无法获取表头元素

page = requests.get('http://www.myPage.com') 
tree = html.fromstring(page.content) 
table = tree.xpath('//table[@class="list"]') 

table变量有这样的内容:

<table class="list"> 
     <tr> 
     <th>Date(s)</th> 
     <th>Sport</th> 
     <th>Event</th> 
     <th>Location</th> 
     </tr> 
     <tr> 
     <td>Jan 18-31</td> 
     <td>Tennis</td> 
     <td><a href="tennis-grand-slam/australian-open/index.htm">Australia Open</a></td> 
     <td>Melbourne, Australia</td> 
     </tr> 
</table> 

我想提取这样的标题:

rows = iter(table) 
headers = [col.text for col in next(rows)] 
print "headers are: ", headers 

但是,当我打印headers变量我得到这个:

headers are: ['\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n 
     ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n 
', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n  ', '\n 
     ', '\n  ', '\n  '] 

如何正确提取标题?

+0

不能重现该问题://要点。 github.com/har07/c693eac57c79c2896881f9b6e2de2202)。你能发布简单但完整的代码来重现这个问题吗? – har07

回答

0

试试这个:

from lxml import html 

HTML_CODE = """<table class="list"> 
     <tr> 
     <th>Date(s)</th> 
     <th>Sport</th> 
     <th>Event</th> 
     <th>Location</th> 
     </tr> 
     <tr> 
     <td>Jan 18-31</td> 
     <td>Tennis</td> 
     <td><a href="tennis-grand-slam/australian-open/index.htm">Australia Open</a></td> 
     <td>Melbourne, Australia</td> 
     </tr> 
</table>""" 

tree = html.fromstring(HTML_CODE) 
headers = tree.xpath('//table[@class="list"]/tr/th/text()') 
print "Headers are: {}".format(', '.join(headers)) 

输出:

Headers are: Date(s), Sport, Event, Location 
0

使用表,假设只有一个:

table[0].xpath("//th/text()") 

或者,如果你只是想来自表格的标题和做使用它,没什么别的打算,你只需要:

​​

都将给您:

使用[验证码](HTTPS
['Date(s)', 'Sport', 'Event', 'Location']