Python 3美丽的汤找到冒号的标签

我想刮这个网站，并得到两个单独的标签。这就是html的样子。Python 3美丽的汤找到冒号的标签

<url> 
    <loc> 
    http://link.com 
    </loc> 
    <lastmod>date</lastmode> 
    <changefreq>daily</changefreq> 
    <image:image> 
    <image:loc> 
    https://imagelink.com 
    <image:loc> 
    <image:title>Item title</image:title> 
    <image:image> 
</url>

我试图得到的标签是loc和image：title。我遇到的问题是标题标签中的冒号。我到目前为止的代码是

r = requests.get(url) 
soup = BeautifulSoup(r.content, 'html.parser') 

for item in soup.find_all('url'): 
    print(item.loc) 
    #print image title

我也试图做到这

print(item.title)

但不起作用

来源

2016-10-08 Ryan Bautista

这是xml不是html和一个名称空间不是两个的节点。你从哪里得到它？ –

你应该"xml" mode而不是解析它（需要lxml是也可以安装）：

from bs4 import BeautifulSoup 

data = """ 
<url> 
    <loc> 
    http://link.com 
    </loc> 
    <lastmod>date</lastmod> 
    <changefreq>daily</changefreq> 
    <image:image> 
    <image:loc> 
    https://imagelink.com 
    </image:loc> 
    <image:title>Item title</image:title> 
    </image:image> 
</url>""" 

soup = BeautifulSoup(data, 'xml') 

for item in soup.find_all('url'): 
    print(item.title.get_text())

打印Item title。

请注意，我已经对XML字符串应用了几个修复程序，因为它最初是非格式良好的。

来源

2016-10-08 15:52:38 alecxe

Python 3美丽的汤找到冒号的标签

回答

相关问题