2017-05-05 39 views
1

我对python非常陌生。我有这个非常大的XML文件,我想从中提取一些数据。下面是摘录:解析许多儿童和孙子的XML文件

<program> 
    <id>38e072a7-8fc9-4f9a-8eac-3957905c0002</id> 
    <programID>3853</programID> 
    <orchestra>New York Philharmonic</orchestra> 
    <season>1842-43</season> 
    <concertInfo> 
     <eventType>Subscription Season</eventType> 
     <Location>Manhattan, NY</Location> 
     <Venue>Apollo Rooms</Venue> 
     <Date>1842-12-07T05:00:00Z</Date> 
     <Time>8:00PM</Time> 
    </concertInfo> 
    <worksInfo> 
     <work ID="52446*"> 
      <composerName>Beethoven, Ludwig van</composerName> 
      <workTitle>SYMPHONY NO. 5 IN C MINOR, OP.67</workTitle> 
      <conductorName>Hill, Ureli Corelli</conductorName> 
     </work> 
     <work ID="8834*4"> 
      <composerName>Weber, Carl Maria Von</composerName> 
      <workTitle>OBERON</workTitle> 
      <movement>"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II</movement> 
      <conductorName>Timm, Henry C.</conductorName> 
      <soloists> 
       <soloist> 
        <soloistName>Otto, Antoinette</soloistName> 
        <soloistInstrument>Soprano</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="3642*"> 
      <composerName>Hummel, Johann</composerName> 
      <workTitle>QUINTET, PIANO, D MINOR, OP. 74</workTitle> 
      <soloists> 
       <soloist> 
        <soloistName>Scharfenberg, William</soloistName> 
        <soloistInstrument>Piano</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Hill, Ureli Corelli</soloistName> 
        <soloistInstrument>Violin</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Derwort, G. H.</soloistName> 
        <soloistInstrument>Viola</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Boucher, Alfred</soloistName> 
        <soloistInstrument>Cello</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Rosier, F. W.</soloistName> 
        <soloistInstrument>Contrabass</soloistInstrument> 
        <soloistRoles>A</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="0*"> 
      <interval>Intermission</interval> 
     </work> 
     <work ID="8834*3"> 
      <composerName>Weber, Carl Maria Von</composerName> 
      <workTitle>OBERON</workTitle> 
      <movement>Overture</movement> 
      <conductorName>Etienne, Denis G.</conductorName> 
     </work> 
     <work ID="8835*1"> 
      <composerName>Rossini, Gioachino</composerName> 
      <workTitle>ARMIDA</workTitle> 
      <movement>Duet</movement> 
      <conductorName>Timm, Henry C.</conductorName> 
      <soloists> 
       <soloist> 
        <soloistName>Otto, Antoinette</soloistName> 
        <soloistInstrument>Soprano</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
       <soloist> 
        <soloistName>Horn, Charles Edward</soloistName> 
        <soloistInstrument>Tenor</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="8837*6"> 
      <composerName>Beethoven, Ludwig van</composerName> 
      <workTitle>FIDELIO, OP. 72</workTitle> 
      <movement>"In Des Lebens Fruhlingstagen...O spur ich nicht linde," Florestan (aria)</movement> 
      <conductorName>Timm, Henry C.</conductorName> 
      <soloists> 
       <soloist> 
        <soloistName>Horn, Charles Edward</soloistName> 
        <soloistInstrument>Tenor</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="8336*4"> 
      <composerName>Mozart, Wolfgang Amadeus</composerName> 
      <workTitle>ABDUCTION FROM THE SERAGLIO,THE, K.384</workTitle> 
      <movement>"Ach Ich liebte," Konstanze (aria)</movement> 
      <conductorName>Timm, Henry C.</conductorName> 
      <soloists> 
       <soloist> 
        <soloistName>Otto, Antoinette</soloistName> 
        <soloistInstrument>Soprano</soloistInstrument> 
        <soloistRoles>S</soloistRoles> 
       </soloist> 
      </soloists> 
     </work> 
     <work ID="5543*"> 
      <composerName>Kalliwoda, Johann W.</composerName> 
      <workTitle>OVERTURE NO. 1, D MINOR, OP. 38</workTitle> 
      <conductorName>Timm, Henry C.</conductorName> 
     </work> 
    </worksInfo> 
</program> 
<program> 

我想要做的是提取下列信息:programID,乐团,季节,事件类型,工作证,soloistName,solositInstrument,soloistRole

下面是代码我正在使用:

import csv 
import xml.etree.cElementTree as ET 
tree = ET.iterparse('complete.xml.txt') 
#root = tree.getroot() 


for program in root.iter('program'): 
    ID = program.findtext('id') 
    programID = program.findtext('programID') 
    orchestra = program.findtext('orchestra') 
    season = program.findtext('season') 

    for concert in program.findall('concertInfo'): 
    event = concert.findtext('eventType') 

    for worksInfo in program.findall('worksInfo'): 
     for work in worksInfo.iter('work'): 
      workid = work.get('ID') 
      for soloists in work.iter('soloists'): 
       for soloist in soloists.iter('soloist'): 
        soloname = soloist.findtext('soloistName') 
        soloinstrument =                `soloist.findtext('soloistInstrument')` 
        solorole = soloist.findtext('soloistRoles') 
        #print(soloname, soloinstrument, solorole) 
      #print(workid) 
    #print(event)    
#print(programID , " , " , orchestra , " , " , season) 
with open("nyphil.txt","a") as nyphil: 
    nyphilwriter = csv.writer(nyphil) 
    nyphilwriter.writerow([programID, orchestra, season, event, workid, `soloname.encode('utf-8'), soloinstrument, solorole]) 
nyphil.close() 

当我运行此代码时,我只获取最后一个soloistName和soloistInstrumet。我想到的结果有点像对每个程序的重复观察。所以我有这样的:

13918,纽约爱乐乐团,1842年至1843年,认购季节,52446 *,奥托,安托瓦内特,女高音,S

13918,...,3642 *,夏芬伯格威廉,钢琴,A

13918,...,3642 *,山,Ureli科雷利,小提琴,A

,并依此类推,直至最后一部作品ID:

13918,... 。,8336 * 4,奥托,安托瓦内特,女高音,S

我所得到的是只有最后的工作:

13918,纽约爱乐乐团,1842年至1843年,认购季节,8336 *,奥托,安托瓦内特,女高音,S

在该文件中有超过15000个像我发布的例子一样。我想解析所有这些信息并提取上面提到的信息。我不完全确定如何去做这件事,我已经搜索了互联网寻找方法来做到这一点,但我试过的一切都不起作用!

回答

0

这里你的问题是你误解了循环的工作方式。特别是,当你在循环是值只改变:

for x in range(10): 
    pass 

print(x) # prints 9 

VS

for x in range(10): 
    print(x) 

这是两个不同的东西。你在做前者。你需要做的是这样的:

with open('nyphil.txt', 'w') as f: 
    nyphilwriter = csv.writer(f)   
    for program in root.iter('program'): 
     id_ = program.findtext('id') 
     program_id = program.findtext('programID') 
     orchestra = program.findtext('orchestra') 
     season = program.findtext('season') 
     for concert in program.findall('concertInfo'): 
      event = concert.findtext('eventType') 
     for info in program.findall('worksInfo'): 
      for work in info.iter('work'): 
       work_id = work.get('ID') 
       for soloists in work.iter('soloists'): 
        for soloist in soloists.iter('soloist'): 
         # Change this line to whatever you want to write out 
         nyphilwriter.writerow([id, program_id, orchestra, season, event, work_id, soloist.findtext('soloistName')]) 
+0

非常感谢你!这正是我需要的。我对这一切都很陌生,实际上循环的工作方式让我感到非常困惑。这虽然是一个巨大的帮助,谢谢! –

+0

如果这个答案是最能解决你问题的答案,那么你应该在左边的复选标记处标记为'<-----'。如果您发现它(以及其他答案)特别有用,您还可以用数字上方的三角形对其进行加注。 –

+0

嗨韦恩,我upvoted它,但我有不到15声望,所以它没有记录:/但你的答案是非常有益的! –

0

13918没有出现在你的数据中。抛开一边,这是我写的,它似乎能够成功处理您的数据。

from lxml import etree 

tree = etree.parse('test.xml') 
programs = tree.xpath('.//program') 

for program in programs: 
    programID, orchestra, season = [program.xpath(_)[0].text for _ in ['programID', 'orchestra', 'season']] 
    print (programID, orchestra, season) 
    works = program.xpath('worksInfo/work') 
    for work in works: 
     workID = work.attrib['ID'] 
     soloistItems = work.xpath('soloists/soloist') 
     for soloistItem in soloistItems: 
      print (workID, soloistItem.find('soloistName').text, soloistItem.find('soloistInstrument').text, soloistItem.find('soloistRoles').text) 

该脚本产生以下输出。其他

3853 New York Philharmonic 1842-43 
8834*4 Otto, Antoinette Soprano S 
3642* Scharfenberg, William Piano A 
3642* Hill, Ureli Corelli Violin A 
3642* Derwort, G. H. Viola A 
3642* Boucher, Alfred Cello A 
3642* Rosier, F. W. Contrabass A 
8835*1 Otto, Antoinette Soprano S 
8835*1 Horn, Charles Edward Tenor S 
8837*6 Horn, Charles Edward Tenor S 
8336*4 Otto, Antoinette Soprano S 

有一点要注意:我把一个标签在你的XML的开始和结束时,因为真正的数据将包含多个元素。