问题:可靠刮股价表
我的目标是自动从本网站stock prices抓取与货币的价格表。由于股票经纪人未提供API,我不得不寻找解决办法。
为了避免重复发明轮子和浪费时间/金钱,我已经为此寻找申请,但不幸的是我没有找到一个适用于本网站的申请。
我已经试过:
R
和rvest
R为以其简单和直接的使用。让我们看看这个代码,它基本上是一个从texbook复制粘贴的例子:
library("rvest")
url <- "https://iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=120&date=2016-12-19-19-0"
population <- url %>%
read_html() %>%
html_nodes(xpath='//*[@id="mCSB_3_container"]/table') %>%
html_table()
population
population <- population[[1]]
head(population)
获取一个空表。
JavaScript
和casperJS
JavaScipt
和PhantomJS
Python
和BeautifulSoup
Pandas
和它的read_html()
- 请问你能解释为什么我在尝试不同的网页抓取和HTML解析工具时得到空表吗?
- 什么是最可靠的方式来处理这个特定的股票价格网站的网络抓取?
这个选项是迄今为止最好的,我居然能提取数据,但它是非常缓慢的,并最终与崩溃“内存耗尽” 错误:
var casper = require('casper').create({
logLevel:'debug',
verbose:true,
loadImages: false,
loadPlugins: false,
webSecurityEnabled: false,
userAgent: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11"
});
var url = 'https://eu.iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=60&date=2016-12-19-21-0';
var length;
var fs = require('fs');
var sep = ';';
//var count = 0;
casper.start(url);
//date
var today = new Date();
var dd = today.getDate();
var mm = today.getMonth()+1; //January is 0!
var hh = today.getHours();
var fff = today.getMilliseconds();
var MM = today.getMinutes();
var yyyy = today.getFullYear();
if(dd<10){
dd='0'+dd;
}
if(mm<10){
mm='0'+mm;
}
var today = yyyy +'_'+mm + '_' +dd + '_'+ hh +'_'+ MM +'_'+ fff;
casper.echo(today);
function getCellContent(row, cell) {
cellText = casper.evaluate(function(row, cell) {
return document.querySelectorAll('table tbody tr')[row].childNodes[cell].innerText.trim();
}, row, cell);
return cellText;
}
function moveNext()
{
var rows = casper.evaluate(function() {
return document.querySelectorAll('table tbody tr');
});
length = rows.length;
this.echo("table length: " + length);
};
//get 3 tables
for (var mins = 0; mins < 3; mins++)
{
url = 'https://eu.iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=60&date=2016-12-19-21-' + mins;
casper.echo(url);
casper.thenOpen(url);
casper.then(function() {
this.waitForSelector('#mCSB_3_container table tbody tr');
});
casper.then(moveNext);
casper.then(function() {
for (var i = 0; i < length; i++)
{
//this.echo("Date: " + getCellContent(i, 0));
//this.echo("Bid: " + getCellContent(i, 1));
//this.echo("Ask: " + getCellContent(i, 2));
//this.echo("Quotes: " + getCellContent(i, 4));
fs.write('prices_'+today+'.csv', getCellContent(i, 0) + sep + getCellContent(i, 1) + sep + getCellContent(i, 2) + sep + getCellContent(i, 4) + "\n", "a");
}
});
}
casper.run();
this.echo("finished with processing");
使用此选项我只得到一个单一的表中:
var webPage = require('webpage');
var page = webPage.create();
page.open('https://iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=120&date=2016-12-19-19-0', function(status) {
var title = page.evaluate(function() {
return document.querySelectorAll('table tbody tr');
});
});
获得一个空表的结果:
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = "https://iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=120&date=2016-12-19-19-0"
soup = BeautifulSoup(urlopen(url), "lxml")
table = soup.findAll('table', attrs={ "class" : "quotes-table-result"})
print("table length is: "+ str(len(table)))
尝试与“Scrapy壳牌”,但得到了一张空表。
随着pandas
我有以下错误:
ValueError: No tables found matching pattern '.+'
的代码:
import pandas as pd
import html5lib
f_states = pd.read_html("https://iqoption.com/en/historical-financial-quotes?active_id=1&tz_offset=120&date=2016-12-19-19-0")
print f_states
该问题:
注:这可能是该网站正试图阻止网络刮,我研究robots.txt
,但它看起来像有只通过浏览器支持的具体和谷歌机器人的具体说明。
尝试用Python'selenium'编号:http://selenium-python.readthedocs.io/installation.html – Prabhakar
尝试用Scrapy +飞溅蟒蛇。 @Prabhakar硒很好,但速度太慢。 – parik
另外python + pandas''read_html'很好。 http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html –