如何凑网页缺乏使用BeautifulSoup

我想从这个网页刮数据标签：http://www.kitco.com/texten/texten.html 如何凑网页缺乏使用BeautifulSoup

这里是我使用的代码：

import requests 
from bs4 import BeautifulSoup 

url = "http://www.kitco.com/texten/texten.html" 
r = requests.get(url) 

# Doing this to force UFT-8 encoding. Not sure if this is needed... 
r.encoding = "UTF-8" 

soup = BeautifulSoup(r.content) 
tag = soup.find_all("London Fix") 
print tag

正如您看到的，而查看该页面的来源，术语“伦敦修复”是不是在任何标签 - 我不知道这是否是cdata或什么...

任何想法如何解析这些表？

来源

2014-08-29 Jeffrey Stilwell

如果您正在使用的是r.content，则确实不需要设置r.encoding。顺便说一句，这是完全正确的。 – 2014-08-29 17:20:21

我认为这太宽泛了，但我也可以证明'你不清楚你问的是什么'，因为你没有指定你期望的输出。 – 2014-08-29 17:21:45

我建议你开始阅读[BeautifulSoup文档]（http://www.crummy.com/software/BeautifulSoup/bs4/doc/）更仔细一点，看看'soup.find_all（）'*做*，作为好。 – 2014-08-29 17:22:26

正如@shaktimaan在评论中指出的那样，“伦敦修复”表格不是真实的 - 它位于pre标记内，行使用破折号格式化。

一个办法是找到表前font标签，并获得.next_sibling：

import requests 
from bs4 import BeautifulSoup 

url = "http://www.kitco.com/texten/texten.html" 
r = requests.get(url) 

soup = BeautifulSoup(r.content) 
print soup.body.pre.find('font', size="4").next_sibling.strip()

打印：

-------------------------------------------------------------------------------- 
London Fix   GOLD   SILVER  PLATINUM   PALLADIUM 
       AM  PM     AM  PM   AM  PM 
-------------------------------------------------------------------------------- 
Aug 29,2014 1285.75 1285.75 19.4700 1424.00 1424.00 895.00 NA 
Aug 28,2014 1288.00 1292.00 19.7500 1425.00 1428.00 897.00 898.00 
-------------------------------------------------------------------------------- 
...

另一种办法是通过text搜索（产生相同的输出）：

import re 

print soup.body.pre.find(text=re.compile('London Fix'))

来源

2014-08-29 17:46:17 alecxe

如何凑网页缺乏使用BeautifulSoup

回答

相关问题