如何使用Python检索动态html内容的值

我正在使用Python 3，并试图从网站检索数据。然而，这个数据动态加载和我现在所拥有的代码不起作用：如何使用Python检索动态html内容的值

url = eveCentralBaseURL + str(mineral) 
print("URL : %s" % url); 

response = request.urlopen(url) 
data = str(response.read(10000)) 

data = data.replace("\\n", "\n") 
print(data)

当我试图找到一个特定的值，我发现一个模板，而不是如“{{formatPrice位数}}“而不是”4.48“。

我该如何使它能够检索值而不是占位符文本？

编辑：This是我试图从中提取信息的特定页面。我试图获得使用模板的“中值”值{{formatPrice median}}

编辑2：我已经安装并设置了我的程序以使用Selenium和BeautifulSoup。

我现在的代码是：

from bs4 import BeautifulSoup 
from selenium import webdriver 

#... 

driver = webdriver.Firefox() 
driver.get(url) 

html = driver.page_source 
soup = BeautifulSoup(html) 

print "Finding..." 

for tag in soup.find_all('formatPrice median'): 
    print tag.text

Here是因为它是执行程序的屏幕截图。不幸的是，它似乎没有找到任何指定了“formatPrice median”的东西。

来源

2013-07-11 Tagc

当你访问浏览器中的URL时，你会得到模板标签吗？编辑：另外，你的模板如何呈现。如果您使用JavaScript模板引擎（例如Handlebars），这可能意味着您将在响应中获得模板标签。 –

RE编辑2 - 这只是一个新问题...无论如何，我认为你需要查看find_all的文档，因为你的find_all字符串无效。我将在下面更新一些更接近您需要的内容http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#arg-name。 –

干杯！我尝试使用soup.findall（True）来获取所有标签，并且我需要的信息就在那里！这只是为了找到我需要搜索哪个标签以获取该信息。 – Tagc

假设你正试图从正在使用的JavaScript模板（比如像handlebars）呈现的页面得到的值，那么这就是你将与任何标准的解决方案（即beautifulsoup或requests）的得到了什么。

这是因为浏览器使用JavaScript来改变它收到的内容并创建新的DOM元素。 urllib将会像浏览器那样做请求部分，但不是模板渲染部分。 A good description of the issues can be found here。本文讨论了三个主要的解决方案：

解析AJAX JSON直接
使用离线Javascript解释来处理请求SpiderMonkey，crowbar
使用浏览器自动化工具splinter

This answer提供对于选项3还有几点建议，如selenium或watir。我使用硒进行自动化Web测试，它非常方便。

编辑

从您的意见看起来它是一个车把驱动的网站。我推荐硒和美丽的汤。 This answer给出了可能是有用的一个很好的代码示例：

from bs4 import BeautifulSoup 
from selenium import webdriver 
driver = webdriver.Firefox() 
driver.get('http://eve-central.com/home/quicklook.html?typeid=34') 

html = driver.page_source 
soup = BeautifulSoup(html) 

# check out the docs for the kinds of things you can do with 'find_all' 
# this (untested) snippet should find tags with a specific class ID 
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class 
for tag in soup.find_all("a", class_="my_class"): 
    print tag.text

基本上硒从浏览器中得到呈现的HTML，然后你可以使用BeautifulSoup从page_source属性解析它。祝你好运:)

来源

2013-07-11 17:35:44

感谢您的帮助。我对网络语言或基于网络的编程方面的经验很少，但如果有帮助，我会链接我试图解析数据的网站。 – Tagc

我会开始寻找请求和美丽的泡泡。 – Tagc

我看了一下网站 - 它几乎打破了我的电脑几次加载:)是的，如果你是铬击中F12，如果你去“网络”选项卡，你会看到'Backbone'，'下划线'和“把手”全部加载。我认为你将不得不采用“硒”方法。我会用一些示例代码编辑 –

如何使用Python检索动态html内容的值

回答

相关问题