2012-02-28 140 views
2

我有几个大的.xml文件。我想解析出文件做几件事情。用Python解析XML

我想仅抽出:

  • 基于XML/TITLE1并将其保存到列表A(例如)
  • 基于XML /标题2,并将其保存到列表B中
  • 基于XML/TITLE3并保存到列表C
  • 等,等

使用Python 2.x的哪个库将是最好的导入/使用。我将如何设置? 有何建议?

例如:

<PubmedArticle> 
    <MedlineCitation Owner="NLM" Status="MEDLINE"> 
     <PMID Version="1">8981971</PMID> 
     <Article PubModel="Print"> 
      <Journal> 
       <ISSN IssnType="Print">0002-9297</ISSN> 
       <JournalIssue CitedMedium="Print"> 
        <Volume>60</Volume> 
        <Issue>1</Issue> 
        <PubDate> 
         <Year>1997</Year> 
         <Month>Jan</Month> 
        </PubDate> 
       </JournalIssue> 
       <Title>American journal of human genetics</Title> 
       <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation> 
      </Journal> 
      <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle> 
      <Pagination> 
       <MedlinePgn>241-4</MedlinePgn> 
      </Pagination> 
      <AuthorList CompleteYN="Y"> 
       <Author ValidYN="Y"> 
        <LastName>Scozzari</LastName> 
        <ForeName>R</ForeName> 
        <Initials>R</Initials> 
       </Author> 
      </AuthorList> 
     <MeshHeadingList> 
      <MeshHeading> 
       <DescriptorName MajorTopicYN="N">Alleles</DescriptorName> 
      </MeshHeading> 
      <MeshHeading> 
       <DescriptorName MajorTopicYN="Y">Y Chromosome</DescriptorName> 
      </MeshHeading> 
     </MeshHeadingList> 
     <OtherID Source="NLM">PMC1712541</OtherID> 
    </MedlineCitation> 
</PubmedArticle> 
+1

我会使用'xml.dom.minidom'为此,它带有Python和工作正常。 'lxml'是另一个很好的库,但你必须安装它。 – kindall 2012-02-28 18:49:11

回答

2

尝试使用Beautiful soup。我发现这个库很方便。正如刚才指出的那样,BeautifulStoneSoup专门用于解析XML。

+0

具体来说,BeautifulStoneSoup – Nishant 2012-02-28 18:53:28

+0

谢谢,已经更新了我的答案。 – varunl 2012-02-28 18:56:49

+0

谢谢ALL,我选择BeautifulSoup作为我的路线。我发现那个B.S.文档比lxml更清晰。 – oaxacamatt 2012-03-18 20:28:11

5

尝试看看lxml模块。

要找到标题,您可以使用Xpath与lxml,或者您可以使用lxml中的xml对象结构将标题“索引”到标题元素。

1

尝试lxmlxpath expressions

其中一小段

>>> from lxml import etree 
>>> xml = """<foo><bar/>baz!</foo>""" 
>>> doc = etree.fromstring(xml) 
>>> doc.xpath('//foo/text()') #xpath expr 
['baz!'] 
>>> 

如果你有一个xml file

s = StringIO(xml) 
doc = etree.parse(s) 

您可以使用Firebug addon来获取xpath expr

0

ElementTree非常棒,并附带Python。

2

我不确定你为什么希望每个标题都在自己的列表中,这是你的问题引导我相信的。

如何在一个列表中的所有标题?下面的示例使用示例XML的修剪版本,再加上我复制一个<Article/>表明,使用lxml.etree.xpath创建的<Title/>'s为你的清单:

>>> import lxml.etree 

>>> xml_text = """<PubmedArticle> 
    <MedlineCitation Owner="NLM" Status="MEDLINE"> 
    <PMID Version="1">8981971</PMID> 
    <Article PubModel="Print"> 
     <Journal> 
     <ISSN IssnType="Print">0002-9297</ISSN> 
     <!-- <JournalIssue ... /> --> 
     <Title>American journal of human genetics</Title> 
     <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation> 
     </Journal> 
     <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle> 
     <!--<Pagination> 
      ... 
      </MeshHeadingList>--> 
     <OtherID Source="NLM">PMC1712541</OtherID> 
    </Article> 
    <Article PubModel="Print"> 
     <Journal> 
     <ISSN IssnType="Print">9297-0002</ISSN> 
     <!-- <JournalIssue ... /> --> 
     <Title>American Journal of Pediatrics</Title> 
     <ISOAbbreviation>Am. J. Ped.</ISOAbbreviation> 
     </Journal> 
     <ArticleTitle>Healthy Foo, Healthy Bar</ArticleTitle> 
     <!--<Pagination> 
      ... 
      </MeshHeadingList>--> 
     <OtherID Source="NLM">PMC1712541</OtherID> 
    </Article> 
    </MedlineCitation> 
</PubmedArticle>""" 

的XPath是由要返回的lxml.etree.xpath转换成一个Python列表节点

>>> xml_obj = lxml.etree.fromstring(xml_text) 
>>> for title_obj in xml_obj.xpath('//Article/Journal/Title'): 
     print title_obj.text 

American journal of human genetics 
American Journal of Pediatrics 

编辑1:现在用Python的xml.etree.ElementTree

我想节点对象如果安装第三方模块不可行或缺乏吸引力,请使用附带的模块展示此解决方案。

>>> import xml.etree.ElementTree as ETree 
>>> element = ETree.fromstring(xml_text) 
>>> xml_obj = ETree.ElementTree(element) 
>>> for title_obj in xml_obj.findall('.//Article/Journal/Title'): 
    print title_obj.text 


American journal of human genetics 
American Journal of Pediatrics 

这是小的,但这种XPath是相同于lxml示例中的XPath:有一个句点(“”)在开头。没有期间,我得到了这个警告(与Python 2.7.2):

>>> xml_obj.findall('//Article/Journal/Title') 

Warning (from warnings module): 
    File "__main__", line 1 
FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version. If you rely on the current behaviour, change it to './/Article/Journal/Title' 
+0

我终于看到了所有发布的答案。感谢您的努力!我能够安装lxml库没有问题,但我有一个hellava时间通过文档。在那个时候,我无法将头包裹起来。我发现BeautifulSoup的文档更容易处理。 – oaxacamatt 2012-03-18 20:21:38