2012-07-31 164 views
2

我想用lxml和xpath使用python解析值表单html。在Python中使用lxml解析HTML,xpath

这里是我的HTML数据

<table> 
<tr> 
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td> 
     <td class="u"> 
     <select name="record[13][type]"> 
     <option SELECTED value="A" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td> 

<td class="u"><input class="wide" name="record[14][name]" value="exampledomain2.com"></td> 
     <td class="u"> 
     <select name="record[14][type]"> 
     <option SELECTED value="CNAME" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[14][content]" value='exampledomain1.com'></td> 

<td class="u"><input class="wide" name="record[15][name]" value="exampledomain3.com"></td> 
     <td class="u"> 
     <select name="record[15][type]"> 
     <option SELECTED value="A" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[15][content]" value='10.10.10.3'></td> 
</tr> 
</table> 

我要的是解析值和打印如下:

exampledomain1.com A 10.10.10.1 
exampledomain2.com CNAME exampledomain1.com 
exampledomain3.com A 10.10.10.3 

这里是我试过

#!/usr/bin/python 
import lxml.html 
from lxml import etree 

doc = lxml.html.document_fromstring("""Here whole html data""") 
txt1 = doc.xpath('//*[@class="wide"]/@value') 
txt2 = doc.xpath('//@SELECTED/text()') 
print txt1 
print txt2 

但它不是按我想要的方式工作。任何帮助,将不胜感激。

谢谢大家。

+4

运行“xmllint --noout在您的HTML报告7个错误。在解析它之前,你应该修复它们。 – 2012-07-31 16:33:17

+0

它如何“不按你想要的”工作? – 2012-07-31 17:11:49

+1

使用BeautifulSoup ..它的简单和容易 – Surya 2012-08-01 14:55:47

回答

3

我固定的代码返回以下,这是非常接近你的要求为:

(py26_default)[[email protected] ~]$ python parse.py 
exampledomain1.com 10.10.10.1 
exampledomain2.com exampledomain1.com 
exampledomain3.com 10.10.10.3 
(py26_default)[[email protected] ~]$ 

无法检索record[13][type]使用XPath ......还有其他的方式,通过这个迭代,但我将这作为OP的练习。请注意,我没有固定的OP的问题HTML包括<table><tr>标签...

import lxml.html 
from lxml import etree 
from lxml.etree import XMLParser 

parser = XMLParser(ns_clean=True, recover=True) 
doc = etree.fromstring("""Here whole html data""", parser) 
elem1 = doc.xpath('//input[@name="record[13][name]"]') 
# NOTE: <option SELECTED> cannot be retrieved with xpath... SELECTED must have 
# a value to do so... 
#elem2 = doc.xpath('//select[@name="record[13][type]"]/option[@SELECTED]') 
elem3 = doc.xpath('//input[@name="record[13][content]"]') 

for idx, val in enumerate(elem1): 
    print val.attrib['value'], elem3[idx].attrib['value'] 

<!-- The (fixed) html source I used --> 
<table> 
<tr> 
<td class="u"><input class="wide" name="record[13][name]" value="exampledomain1.com"></td> 
     <td class="u"> 
     <select name="record[13][type]"> 
     <option SELECTED value="A" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.1'></td> 

<td class="u"><input class="wide" name="record[13][name]" value="exampledomain2.com"></td> 
     <td class="u"> 
     <select name="record[13][type]"> 
     <option SELECTED value="CNAME" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[13][content]" value='exampledomain1.com'></td> 

<td class="u"><input class="wide" name="record[13][name]" value="exampledomain3.com"></td> 
     <td class="u"> 
     <select name="record[13][type]"> 
     <option SELECTED value="A" >A</option> 
     <option value="AAAA" >AAAA</option> 
     <option value="CNAME" >CNAME</option> 
     <option value="HINFO" >HINFO</option> 
     <option value="MX" >MX</option> 
     <option value="NAPTR" >NAPTR</option> 
     <option value="NS" >NS</option> 
     <option value="PTR" >PTR</option> 
     <option value="SOA" >SOA</option> 
     <option value="SPF" >SPF</option> 
     <option value="SRV" >SRV</option> 
     <option value="SSHFP" >SSHFP</option> 
     <option value="TXT" >TXT</option> 
     <option value="RP" >RP</option> 
     <option value="URL" >URL</option> 
     <option value="MBOXFW" >MBOXFW</option> 
     <option value="CURL" >CURL</option> 
     </select> 
     </td> 
     <td class="u"><input class="wide" name="record[13][content]" value='10.10.10.3'></td> 
</tr> 
</table> 
+0

嗨迈克,字段“name =”记录[13]“正在改变所有这些其他dns记录记录,我已纠正在这个html代码中,所以在这种情况下,/input [@ name =“record [13] [name]”]'不会捕获所有不同数字的记录,所以我可以在其中定义通配符或范围。 – Manish 2012-08-01 15:01:42

+0

您可以使用[lxml'正则表达式]( http://stackoverflow.com/a/2756994/667301)解决这个问题 – 2012-08-01 15:26:26

+0

谢谢你迈克,那么我得到了与正则表达式工作,但仍然坚持获得SELECTED值 – Manish 2012-08-02 16:13:49

0
record_13_name = tree.xpath("//select[@name='record[13][name]']/text()") 
record_13_type = tree.xpath("//select[@name='record[13][type]']/option/text()") 
record_13_content = tree.xpath("//input[@name='record[13][content]']/text()") 


record_14_name = tree.xpath("//select[@name='record[14][name]']/text()") 
record_14_type = tree.xpath("//select[@name='record[14][type]']/option/text()") 
record_14_content = tree.xpath("//input[@name='record[14][content]']/text()") 


record_15_name = tree.xpath("//select[@name='record[15][name]']/text()") 
record_15_type = tree.xpath("//select[@name='record[15][type]']/option/text()") 
record_15_content = tree.xpath("//input[@name='record[15][content]']/text()")