2013-12-13 51 views
0

类似的问题在这里(Python XML Parsing)问,但我不能达到我感兴趣的内容。如何分析其XML字符串深层结构使用python

我需要提取所有的之间封闭的信息标签patent-classification如果classification-scheme标签值为CPC。有多个这样的元素,并包含在patent-classifications标签内。

在下面的例子中给出的,有三个这样的价值观:C 07 K 16 22 IA 61 K 2039 505 AC 07 K 2317 21 A

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?> 
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink"> 
    <ops:meta name="elapsed-time" value="21"/> 
    <exchange-documents> 
     <exchange-document system="ops.epo.org" family-id="39103486" country="US" doc-number="2009234106" kind="A1"> 
      <bibliographic-data> 
       <publication-reference> 
        <document-id document-id-type="docdb"> 
         <country>US</country> 
         <doc-number>2009234106</doc-number> 
         <kind>A1</kind> 
         <date>20090917</date> 
        </document-id> 
        <document-id document-id-type="epodoc"> 
         <doc-number>US2009234106</doc-number> 
         <date>20090917</date> 
        </document-id> 
       </publication-reference> 
       <classifications-ipcr> 
        <classification-ipcr sequence="1"> 
         <text>C07K 16/ 44   A I     </text> 
        </classification-ipcr> 
       </classifications-ipcr> 
       <patent-classifications> 
        <patent-classification sequence="1"> 
         <classification-scheme office="" scheme="CPC"/> 
         <section>C</section> 
         <class>07</class> 
         <subclass>K</subclass> 
         <main-group>16</main-group> 
         <subgroup>22</subgroup> 
         <classification-value>I</classification-value> 
        </patent-classification> 
        <patent-classification sequence="2"> 
         <classification-scheme office="" scheme="CPC"/> 
         <section>A</section> 
         <class>61</class> 
         <subclass>K</subclass> 
         <main-group>2039</main-group> 
         <subgroup>505</subgroup> 
         <classification-value>A</classification-value> 
        </patent-classification> 
        <patent-classification sequence="7"> 
         <classification-scheme office="" scheme="CPC"/> 
         <section>C</section> 
         <class>07</class> 
         <subclass>K</subclass> 
         <main-group>2317</main-group> 
         <subgroup>92</subgroup> 
         <classification-value>A</classification-value> 
        </patent-classification> 
        <patent-classification sequence="1"> 
         <classification-scheme office="US" scheme="UC"/> 
         <classification-symbol>530/387.9</classification-symbol> 
        </patent-classification> 
       </patent-classifications> 
      </bibliographic-data> 
     </exchange-document> 
    </exchange-documents> 
</ops:world-patent-data> 

回答

1

你可以使用python xml标准模块:

import xml.etree.ElementTree as ET 

root = ET.parse('a.xml').getroot() 

for node in root.iterfind(".//{http://www.epo.org/exchange}classification-scheme[@scheme='CPC']/.."): 
    data = [] 
    for d in node.getchildren(): 
     if d.text: 
      data.append(d.text) 
    print ' '.join(data) 
2

安装BeautifulSoup如果你没有它:

$ easy_install BeautifulSoup4

试试这个:

from bs4 import BeautifulSoup 

xml = open('example.xml', 'rb').read() 
bs = BeautifulSoup(xml) 

# find patent-classification 
patents = bs.findAll('patent-classification') 
# filter the ones with CPC 
for pa in patents: 
    if pa.find('classification-scheme', {'scheme': 'CPC'}): 
     print pa.getText() 
+1

谢谢,但'xml'被用作变量? – user1140126

+0

well xml变量是你加载你的xml的地方。实际上,要尝试确切的代码,创建一个文件名'example.xml'并在其中写入你在问题中发布的内容,然后编辑我的答案,我缺少一行。谢谢 – PepperoniPizza

+0

@ user1140126再次检查答案我更新了它。我错过了一条线 – PepperoniPizza