使用Python解析XML文件

因此，我已经能够查询和接收HTTP RSS网页，将其转换为.txt文件，并使用minidom查询XML中的元素。使用Python解析XML文件

我正在做的下一步是创建一个符合我的要求的链接选择列表。

这里是有一个类似的架构到我的文件的一个示例XML文件：

<xml> 
    <Document name = "example_file.txt"> 
     <entry id = "1"> 
      <link href="http://wwww.examplesite.com/files/test_image_1_Big.jpg"/> 
     </entry> 
     <entry id = "2"> 
      <link href="http://wwww.examplesite.com/files/test_image_1.jpg"/> 
     </entry> 
     <entry id = "3"> 
      <link href="http://wwww.examplesite.com/files/test_image_1_Small.jpg"/> 
     </entry> 
     </entry> 
     <entry id = "4"> 
      <link href="http://wwww.examplesite.com/files/test_image_1.png"/> 
     </entry> 
     <entry id = "5"> 
      <link href="http://wwww.examplesite.com/files/test_image_2_Big.jpg"/> 
     </entry> 
     <entry id = "6"> 
      <link href="http://wwww.examplesite.com/files/test_image_2.jpg"/> 
     </entry> 
     <entry id = "7"> 
      <link href="http://wwww.examplesite.com/files/test_image_2_Small.jpg"/> 
     </entry> 
     </entry> 
     <entry id = "8"> 
      <link href="http://wwww.examplesite.com/files/test_image_2.png"/> 
     </entry> 
    </Document> 
</xml>

随着minidom命名，我可以得到它下降到只有链接的列表，但我想我可以跳过这一步，如果我可以根据文本搜索参数创建一个列表。我不希望所有的链接，我只希望这些链接：

http://wwww.examplesite.com/files/test_image_1.jpg 
http://wwww.examplesite.com/files/test_image_2.jpg

作为新的Python的，我不知道怎么说巴纽”，‘大’或“没有只抢通” “小”的链接名称。

我的最终目标是让蟒蛇下载这些文件，一次一个。会名单是最适合呢？

为了使这个更复杂，我仅限于使用Python 2.6的股票库我无法实现任何优秀的第三方API

来源

2013-12-12 Michael

使用lxml和cssselect这是很容易：

from pprint import pprint 


import cssselect # noqa 
from lxml.html import fromstring 


doc = fromstring(open("foo.html", "r").read()) 
links = [e.attrib["href"] for e in doc.cssselect("link")] 
pprint(links)

输出：

['http://wwww.examplesite.com/files/test_image_1_Big.jpg', 
'http://wwww.examplesite.com/files/test_image_1.jpg', 
'http://wwww.examplesite.com/files/test_image_1_Small.jpg', 
'http://wwww.examplesite.com/files/test_image_1.png', 
'http://wwww.examplesite.com/files/test_image_2_Big.jpg', 
'http://wwww.examplesite.com/files/test_image_2.jpg', 
'http://wwww.examplesite.com/files/test_image_2_Small.jpg', 
'http://wwww.examplesite.com/files/test_image_2.png']

如果你只想要两个链接（两个？）的：

links = links[:2]

这是在Python中称为Slicing。

作为新的Python的，我不知道怎么说的链接名称巴纽”，‘大’或‘小’，“没有只抢通”。任何帮助将是巨大的

您可以过滤列表如下：

doc = fromstring(open("foo.html", "r").read()) 
links = [e.attrib["href"] for e in doc.cssselect("link")] 
predicate = lambda l: not any([s in l for s in ("png", "Big", "Small")]) 
links = [l for l in links if predicate(l)] 
pprint(links)

这会给你：

['http://wwww.examplesite.com/files/test_image_1.jpg', 
'http://wwww.examplesite.com/files/test_image_2.jpg']

来源

2013-12-12 02:37:20

我能走到这一步返回字典。我只想打印我在原始文章中列出的两个链接。我不知道如何将逻辑应用到两个链接。 – Michael

更新回答，以包含此内容。 –

作为Python的新手，我不知道该怎么说“只有在链接名称中没有”.png“，”Big“或者”Small“的链接才能说”。“任何帮助都会很棒！ – Michael

import re 
from xml.dom import minidom 

_xml = '''<?xml version="1.0" encoding="utf-8"?> 
<xml > 
    <Document name="example_file.txt"> 
     <entry id="1"> 
      <link href="http://wwww.examplesite.com/files/test_image_1_Big.jpg"/> 
     </entry> 
     <entry id="2"> 
      <link href="http://wwww.examplesite.com/files/test_image_1.jpg"/> 
     </entry> 
     <entry id="3"> 
      <link href="http://wwww.examplesite.com/files/test_image_1_Small.jpg"/> 
     </entry> 
     <entry id="4"> 
      <link href="http://wwww.examplesite.com/files/test_image_1.png"/> 
     </entry> 
     <entry id="5"> 
      <link href="http://wwww.examplesite.com/files/test_image_2_Big.jpg"/> 
     </entry> 
     <entry id="6"> 
      <link href="http://wwww.examplesite.com/files/test_image_2.jpg"/> 
     </entry> 
     <entry id="7"> 
      <link href="http://wwww.examplesite.com/files/test_image_2_Small.jpg"/> 
     </entry> 
     <entry id="8"> 
      <link href="http://wwww.examplesite.com/files/test_image_2.png"/> 
     </entry> 
    </Document> 
</xml> 
''' 

doc = minidom.parseString(_xml) # minidom.parse(your-file-path) gets same resul 
entries = doc.getElementsByTagName('entry') 
link_ref = (
    entry.getElementsByTagName('link').item(0).getAttribute('href') 
    for entry in entries 
) 
plain_jpg = re.compile(r'.*\.jpg$') # regex you needs 
result = (link for link in link_ref if plain_jpg.match(link)) 
print list(result)

此代码得到的结果为[u'http://wwww.examplesite.com/files/test_image_1_Big.jpg', u'http://wwww.examplesite.com/files/test_image_1.jpg', u'http://wwww.examplesite.com/files/test_image_1_Small.jpg', u'http://wwww.examplesite.com/files/test_image_2_Big.jpg', u'http://wwww.examplesite.com/files/test_image_2.jpg', u'http://wwww.examplesite.com/files/test_image_2_Small.jpg']。

但是我们可以更好地使用xml.etree.ElementTree。 etree更快，内存更低，接口更智能。

etree被捆绑在标准库中。

来源

2013-12-12 03:19:59

from feedparse import parse 
data=parse("foo.html") 
for elem in data['entries']: 
    if 'link' in elem.keys(): 
     print(elem['link'])

图书馆“feedparse”通过解析XML内容

来源

2017-08-02 11:59:19

使用Python解析XML文件

回答

相关问题