2015-04-25 33 views
3

对于使用R/Python的1-2%的样本数据,我有一个适合的机器学习分类器,我对精度测量(精度,召回率和F_score)非常满意。对一个非常大的数据集进行评分

现在我想得分了巨大的数据库,70万行/与这个分类这是在R.编码

信息有关数据集驻留在Hadoop的/蜂房环境实例:

70万元X 40个变量(列):大约18个变量是分类的,其余22个是数字(包括整数)

我该如何去做呢?有什么建议么 ?

我曾想过做的事情是:

一)组块了在1M增量数据从CSV文件中Hadoop的系统和喂养它与R

二)某种类型批 - 处理。

它不是一个实时系统,所以不需要每天都进行,但我仍然想在2-3小时内对它进行评分。

回答

1

如果你能在所有的数据节点安装R运行时,你可以将调用将R代码的简单hadoop streaming地图唯一的工作

您也可以看看SparkR

1

我推断你想上运行一个完整的数据集的R代码里面(你的分类),而不是样本数据集

因此,我们正在寻找一个大规模分布式系统上执行R代码里面

而且,它必须有一个tigh t与hadoop组件集成。

所以RHadoop将适合您的问题陈述。

http://www.rdatamining.com/big-data/r-hadoop-setup-guide

+0

分类器使用样本数据集构建 - 即只有约1%的数据。但我会研究RHadoop。 –

0
The scoring of 80 million to 8.5 seconds 

The code below was run on an off lease Dell T7400 workstation with 64gb ram, dual quad 3ghz XEONS and two raid 0 SSD arrays on separate channels which I purchased for $600. I also use the free SPDE to partition the dataset. 

For small datasets like your 80 million you might want to consider SAS or WPS. 
The code below scores 80 million 40 char records in 9 seconds 

The combination of in memory R and SAS/WPS makes a great combinations. Many SAS users consider datasets less than 1TB to be small. 

I ran 8 parallel processes, SAS 9.4 64bit Win Pro 64bit 

8.5 

%let pgm=utl_score_spde; 

proc datasets library=spde; 
delete gig23ful_spde; 
run;quit; 

libname spde spde 'd:/tmp' 
    datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g") 
    partsize=4g; 
; 

data spde.littledata_spde (compress=char drop=idx); 
    retain primary_key; 
    array num[20] n1-n20; 
    array chr[20] $4 c1-c20; 
    do primary_key=1 to 80000000; 
    do idx=31 to 50; 
     num[idx-30]=uniform(-1); 
     chr[idx-30]=repeat(byte(idx),40); 
    end; 
    output; 
    end; 
run;quit; 



%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -nosplash -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas)); 

* score it; 


data _null_;file "c:\oto\utl_scoreit.sas" lrecl=512;input;put _infile_;putlog _infile_; 
cards4; 
%macro utl_scoreit(beg=1,end=10000000); 

    libname spde spde 'd:/tmp' 
    datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g") 
    partsize=4g; 

    libname out "G:/wrk"; 

    data keyscore; 

    set spde.littledata_spde(firstobs=&beg obs=&end 
     keep= 
      primary_key 
      n1 
      n12 
      n3 
      n14 
      n5 
      n16 
      n7 
      n18 
      n9 
      n10 
      c18 
      c19 
      c12); 
    score= (.1*n1 + 
      .1*n12 + 
      .1*n3 + 
      .1*n14 + 
      .1*n5 + 
      .1*n16 + 
      .1*n7 + 
      .1*n18 + 
      .1*n9 + 
      .1*n10 + 
      (c18='0000') + 
      (c19='0000') + 
      (c12='0000'))/3 ; 
    keep primary_key score; 
    run; 

%mend utl_scoreit; 
;;;; 
run;quit; 

%utl_scoreit; 


%let tym=%sysfunc(time()); 
systask kill sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=1,end=10000000);) -log G:\wrk\sys101.log" taskname=sys101; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=10000001,end=20000000);) -log G:\wrk\sys102.log" taskname=sys102 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=20000001,end=30000000);) -log G:\wrk\sys103.log" taskname=sys103 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=30000001,end=40000000);) -log G:\wrk\sys104.log" taskname=sys104 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=40000001,end=50000000);) -log G:\wrk\sys105.log" taskname=sys105 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=50000001,end=60000000);) -log G:\wrk\sys106.log" taskname=sys106 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=60000001,end=70000000);) -log G:\wrk\sys107.log" taskname=sys107 ; 
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=70000001,end=80000000);) -log G:\wrk\sys108.log" taskname=sys108 ; 
waitfor _all_ sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108; 
systask list; 
%put %sysevalf(%sysfunc(time()) - &tym); 

8.56500005719863 

NOTE: AUTOEXEC processing completed. 

NOTE: Libref SPDE was successfully assigned as follows: 
     Engine:  SPDE 
     Physical Name: d:\tmp\ 
NOTE: Libref OUT was successfully assigned as follows: 
     Engine:  V9 
     Physical Name: G:\wrk 

NOTE: There were 10000000 observations read from the data set SPDE.LITTLEDATA_SPDE. 
NOTE: The data set WORK.KEYSCORE has 10000000 observations and 2 variables. 
NOTE: DATA statement used (Total process time): 
     real time   7.05 seconds 
     cpu time   6.98 seconds 



NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414 
NOTE: The SAS System used: 
     real time   8.34 seconds 
     cpu time   7.36 seconds 
+0

我可以使用SAS,我的合作伙伴。是一家大型的分析店,但我如何将RandomForest或Naive-Bayes模型转移到SAS生态系统中。 –

相关问题