用Python解析XML

我有几个大的.xml文件。我想解析出文件做几件事情。用Python解析XML

我想仅抽出：

基于XML/TITLE1并将其保存到列表A（例如）
基于XML /标题2，并将其保存到列表B中
基于XML/TITLE3并保存到列表C
等，等

使用Python 2.x的哪个库将是最好的导入/使用。我将如何设置？有何建议？

例如：

<PubmedArticle> 
    <MedlineCitation Owner="NLM" Status="MEDLINE"> 
     <PMID Version="1">8981971</PMID> 
     <Article PubModel="Print"> 
      <Journal> 
       <ISSN IssnType="Print">0002-9297</ISSN> 
       <JournalIssue CitedMedium="Print"> 
        <Volume>60</Volume> 
        <Issue>1</Issue> 
        <PubDate> 
         <Year>1997</Year> 
         <Month>Jan</Month> 
        </PubDate> 
       </JournalIssue> 
       <Title>American journal of human genetics</Title> 
       <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation> 
      </Journal> 
      <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle> 
      <Pagination> 
       <MedlinePgn>241-4</MedlinePgn> 
      </Pagination> 
      <AuthorList CompleteYN="Y"> 
       <Author ValidYN="Y"> 
        <LastName>Scozzari</LastName> 
        <ForeName>R</ForeName> 
        <Initials>R</Initials> 
       </Author> 
      </AuthorList> 
     <MeshHeadingList> 
      <MeshHeading> 
       <DescriptorName MajorTopicYN="N">Alleles</DescriptorName> 
      </MeshHeading> 
      <MeshHeading> 
       <DescriptorName MajorTopicYN="Y">Y Chromosome</DescriptorName> 
      </MeshHeading> 
     </MeshHeadingList> 
     <OtherID Source="NLM">PMC1712541</OtherID> 
    </MedlineCitation> 
</PubmedArticle>

来源

2012-02-28 oaxacamatt

我会使用'xml.dom.minidom'为此，它带有Python和工作正常。 'lxml'是另一个很好的库，但你必须安装它。 – kindall 2012-02-28 18:49:11

尝试使用Beautiful soup。我发现这个库很方便。正如刚才指出的那样，BeautifulStoneSoup专门用于解析XML。

来源

2012-02-28 18:49:59 varunl

具体来说，BeautifulStoneSoup – Nishant 2012-02-28 18:53:28

谢谢，已经更新了我的答案。 – varunl 2012-02-28 18:56:49

谢谢ALL，我选择BeautifulSoup作为我的路线。我发现那个B.S.文档比lxml更清晰。 – oaxacamatt 2012-03-18 20:28:11

尝试看看lxml模块。

要找到标题，您可以使用Xpath与lxml，或者您可以使用lxml中的xml对象结构将标题“索引”到标题元素。

来源

2012-02-28 18:48:54 aweis

尝试lxml与xpath expressions。

其中一小段

>>> from lxml import etree 
>>> xml = """<foo><bar/>baz!</foo>""" 
>>> doc = etree.fromstring(xml) 
>>> doc.xpath('//foo/text()') #xpath expr 
['baz!'] 
>>>

如果你有一个xml file比

s = StringIO(xml) 
doc = etree.parse(s)

您可以使用Firebug addon来获取xpath expr。

来源

2012-02-28 18:53:57 RanRag

ElementTree非常棒，并附带Python。

来源

2012-02-28 19:23:00 01100110

我不确定你为什么希望每个标题都在自己的列表中，这是你的问题引导我相信的。

如何在一个列表中的所有标题？下面的示例使用示例XML的修剪版本，再加上我复制一个<Article/>表明，使用lxml.etree.xpath创建的<Title/>'s为你的清单：

>>> import lxml.etree 

>>> xml_text = """<PubmedArticle> 
    <MedlineCitation Owner="NLM" Status="MEDLINE"> 
    <PMID Version="1">8981971</PMID> 
    <Article PubModel="Print"> 
     <Journal> 
     <ISSN IssnType="Print">0002-9297</ISSN> 
     <!-- <JournalIssue ... /> --> 
     <Title>American journal of human genetics</Title> 
     <ISOAbbreviation>Am. J. Hum. Genet.</ISOAbbreviation> 
     </Journal> 
     <ArticleTitle>mtDNA and Y chromosome-specific polymorphisms in modern Ojibwa: implications about the origin of their gene pool.</ArticleTitle> 
     <!--<Pagination> 
      ... 
      </MeshHeadingList>--> 
     <OtherID Source="NLM">PMC1712541</OtherID> 
    </Article> 
    <Article PubModel="Print"> 
     <Journal> 
     <ISSN IssnType="Print">9297-0002</ISSN> 
     <!-- <JournalIssue ... /> --> 
     <Title>American Journal of Pediatrics</Title> 
     <ISOAbbreviation>Am. J. Ped.</ISOAbbreviation> 
     </Journal> 
     <ArticleTitle>Healthy Foo, Healthy Bar</ArticleTitle> 
     <!--<Pagination> 
      ... 
      </MeshHeadingList>--> 
     <OtherID Source="NLM">PMC1712541</OtherID> 
    </Article> 
    </MedlineCitation> 
</PubmedArticle>"""

的XPath是由要返回的lxml.etree.xpath转换成一个Python列表节点

>>> xml_obj = lxml.etree.fromstring(xml_text) 
>>> for title_obj in xml_obj.xpath('//Article/Journal/Title'): 
     print title_obj.text 

American journal of human genetics 
American Journal of Pediatrics

编辑1：现在用Python的xml.etree.ElementTree

我想节点对象如果安装第三方模块不可行或缺乏吸引力，请使用附带的模块展示此解决方案。

>>> import xml.etree.ElementTree as ETree 
>>> element = ETree.fromstring(xml_text) 
>>> xml_obj = ETree.ElementTree(element) 
>>> for title_obj in xml_obj.findall('.//Article/Journal/Title'): 
    print title_obj.text 


American journal of human genetics 
American Journal of Pediatrics

这是小的，但这种XPath是不相同于lxml示例中的XPath：有一个句点（“”）在开头。没有期间，我得到了这个警告（与Python 2.7.2）：

>>> xml_obj.findall('//Article/Journal/Title') 

Warning (from warnings module): 
    File "__main__", line 1 
FutureWarning: This search is broken in 1.3 and earlier, and will be fixed in a future version. If you rely on the current behaviour, change it to './/Article/Journal/Title'

来源

2012-02-28 20:31:33

我终于看到了所有发布的答案。感谢您的努力！我能够安装lxml库没有问题，但我有一个hellava时间通过文档。在那个时候，我无法将头包裹起来。我发现BeautifulSoup的文档更容易处理。 – oaxacamatt 2012-03-18 20:21:38

用Python解析XML

回答

相关问题