2017-06-02 26 views
0

我工作的星火1.6.2,我有一个DataFrame有102列:适合多种数字列到火花毫升模型PySpark

f0, f1,....,f101 

F0包含索引和F101包含标签,以及其他列是数字特征(浮动)。

我想通过这DataFrame培训一个随机森林模型(spark-ml)。

所以我用VectorAssembler输出一个特色栏目,以拟合模型

from pyspark.ml.feature import VectorAssembler 
ignore = ['f0', 'f101'] 
assembler = VectorAssembler(inputCols=[x for x in df.columns if x not in ignore], outputCol='features') 

assembler.transform(df) 
df.show() 

但没有成功,这引发了以下错误:

py4j.protocol.Py4JJavaError: An error occurred while calling o255.transform. 
: org.apache.spark.SparkException: VectorAssembler does not support the StringType type 

是否有另一种适合这些模型中有多列?

这里的前两行我DataFrame:(请注意,我的所有列都可能是问题的原因类型的字符串)

| f0|     f1|     f2|     f3|     f4|    f5|     f6|     f7|     f8|     f9|    f10|    f11|     f12|    f13|    f14|    f15|    f16|     f17|     f18|    f19|    f20|    f21|     f22|    f23|    f24|     f25|    f26|     f27|    f28|    f29|    f30|    f31|    f32|     f33|    f34|    f35|     f36|    f37|    f38|     f39|    f40|    f41|     f42|     f43|    f44|    f45|    f46|    f47|     f48|    f49|    f50|    f51|     f52|    f53|    f54|    f55|     f56|    f57|    f58|    f59|     f60|    f61|     f62|    f63|    f64|    f65|    f66|    f67|    f68|    f69|    f70|    f71|    f72|    f73|     f74|     f75|    f76|     f77|    f78|    f79|    f80|    f81|    f82|     f83|     f84|    f85|    f86|    f87|    f88|    f89|    f90|    f91|    f92|    f93|    f94|    f95|     f96|    f97|    f98|    f99|    f100|f101| 
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------+-------------------+-------------------+--------------------+-------------------+-------------------+--------------------+------------------+------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+--------------------+-------------------+-------------------+--------------------+-------------------+--------------------+-------------------+-------------------+-------------------+------------------+-------------------+--------------------+------------------+-------------------+--------------------+-------------------+-------------------+--------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+--------------------+------------------+------------------+-------------------+--------------------+-------------------+-------------------+------------------+--------------------+-------------------+------------------+-------------------+--------------------+-------------------+--------------------+-------------------+-------------------+-----------------+------------------+-------------------+------------------+-------------------+-------------------+------------------+-------------------+------------------+--------------------+--------------------+-------------------+--------------------+-----------------+-------------------+-------------------+-------------------+-------------------+--------------------+--------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-----------------+-------------------+------------------+------------------+--------------------+------------------+-------------------+-------------------+-------------------+----+ 
| 0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|-0.22044000029563904|-0.6735600233078003| 0.2453099936246872|0.031005999073386192| -0.831250011920929| 0.9731900095939636|-0.04734800010919571|2.0506999492645264|0.6324499845504761| 1.0824999809265137|0.46728000044822693| 0.7816500067710876|0.011575000360608101| 0.3381200134754181|-0.2861100137233734| 3.037100076675415|-0.36792999505996704| 0.8862199783325195|-0.8241199851036072|-0.47086000442504883|-0.6407700181007385| -0.3201499879360199| 0.7545999884605408| 2.753200054168701|0.17207999527454376|-0.676639974117279|-0.8336099982261658|-0.41405001282691956|0.7059500217437744|0.37801000475883484| 0.15550999343395233|-1.0931999683380127|0.10803999751806259|-0.23667000234127045| 0.6708999872207642| 0.3448899984359741|-0.11162000149488449| 0.9600099921226501| -0.899370014667511|0.09950699657201767| -1.065000057220459|-1.3912999629974365|-0.16773000359535217|1.2430000305175781|-2.471100091934204| 1.8344999551773071| 0.6032400131225586|-0.6902700066566467|0.09102000296115875|1.7200000286102295|-0.24295000731945038|-1.8884999752044678|0.1710599958896637|-1.1556999683380127| -2.4221999645233154|-0.7604399919509888|0.014763999730348587| 0.6575700044631958|-0.5731899738311768|1.170199990272522|1.8212000131607056|0.14872999489307404|-1.582800030708313|-0.4311999976634979| -0.756820023059845|-2.511399984359741|-2.4605000019073486| 1.469599962234497|-0.49924999475479126| 2.031399965286255|-0.4928399920463562|-0.20021000504493713|0.685479998588562| -1.482100009918213|-1.6536999940872192|0.08350799977779388| 1.2898000478744507| -2.196000099182129|-0.06448200345039368|-0.5987200140953064| 0.1709499955177307|0.8191999793052673| 0.856190025806427| 0.5832300186157227| -1.926300048828125|-0.7517899870872498|2.174499988555908| 2.433000087738037|1.6503000259399414|0.5555099844932556| -1.583899974822998|1.7556999921798706| 0.3153800070285797|-0.1724800020456314|-0.6098300218582153| 361| 
| 1| 0.6452699899673462| 0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181|-0.11873000115156174|-0.4033699929714203|0.44887998700141907| -0.3801800012588501|-1.6754000186920166|-0.4689599871635437| 0.09016799926757812|1.5816999673843384|1.4657000303268433|0.11236999928951263|0.05620399862527847|-0.00242649996653...| 1.4306999444961548|0.05022599920630455| 0.71288001537323|1.7551000118255615|-0.30507999658584595|0.40630999207496643| 1.1753000020980835| 0.4212299883365631| -2.208199977874756|-0.18940000236034393|0.21938000619411469|-0.5088800191879272|-0.5000600218772888|0.2771399915218353| 1.0090999603271484| 0.08775299787521362|0.7567399740219116| 0.4211699962615967|-0.25742998719215393|-0.6665199995040894| -0.265639990568161| 0.5249500274658203|-0.5251700282096863|-0.5188699960708618| 0.2909500002861023|-0.49011000990867615|-0.1070299968123436| 1.2991000413894653|-1.2252000570297241|-0.5937600135803223|-0.09345000237226486|1.1332999467849731|-2.444999933242798|-1.9296000003814697|-0.15282000601291656| 0.5004400014877319|-0.3229599893093109|0.5092599987983704| 0.4438900053501129|-1.2383999824523926|0.9989299774169922|-0.6500200033187866|-0.46276000142097473|0.28137001395225525| -0.270440012216568|-1.3233000040054321| 0.4525200128555298|2.731100082397461|1.8000999689102173|-0.1950400024652481|-0.748520016670227| 0.5018399953842163|-0.6080600023269653|-1.093500018119812|-1.7791999578475952|1.1186000108718872| 1.15339994430542|-0.10273999720811844|-1.9773999452590942| 0.23173999786376953|0.604610025882721|-1.1047999858856201|-1.8122999668121338|-1.0922000408172607|0.14993999898433685|-0.23330999910831451| 0.4197700023651123|-0.5616300106048584|-1.2773000001907349|1.0683000087738037|-0.3670499920845032|0.25751999020576477|-1.1461000442504883| 0.0685959979891777|2.424999952316284|-0.2257699966430664|0.8041399717330933|0.7866700291633606|-0.45813000202178955| 1.329200029373169|0.10018999874591827| -1.253499984741211|0.01594099961221218|1047| 
+1

你有没有试过在'VectorAssembler()'之外分配列表理解的结果,然后把它作为参数传递? – mtoto

+0

是的,我也有同样的错误 我也试图通过这个列表['f1','f2']和相同的错误发生 – abdelkarim

+0

你可以请添加你的数据框架? – eliasah

回答

1

我们将使用parse_ udf我们定义hereconcat_ws

from pyspark.sql.functions import udf 
from pyspark.ml.feature import StringIndexer 
from pyspark.mllib.linalg import Vectors, VectorUDT 

dd = sc.parallelize(['0|-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796','1|0.6452699899673462|0.528219997882843|-0.5653899908065796|-0.4328500032424927|0.9352899789810181']).map(lambda x : x.split('|')) 

df = sqlContext.createDataFrame(rdd, ['f1','f2','f3','f4','f5','f6']) 

ignore = ['f1','f4'] # columns to ignore 
keep = [x for x in df.columns if x not in ignore] # columns to keep 

parse_ = udf(Vectors.parse, VectorUDT()) 
parsed = df.withColumn("features", F.concat(F.lit('['), F.concat_ws(",", *keep), F.lit(']'))). \ 
      withColumn("features", parse_("features")) 

parsed.show(truncate=False) 
# +---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+ 
# |f1 |f2     |f3     |f4     |f5     |f6    |features                  | 
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+ 
# |0 |-0.38672998547554016|-1.5183000564575195|0.21291999518871307| 1.2288000583648682|0.7216399908065796|[-0.38672998547554016,-1.5183000564575195,1.2288000583648682,0.7216399908065796]| 
# |1 |0.6452699899673462 |0.528219997882843 |-0.5653899908065796|-0.4328500032424927|0.9352899789810181|[0.6452699899673462,0.528219997882843,-0.4328500032424927,0.9352899789810181] | 
+---+--------------------+-------------------+-------------------+-------------------+------------------+--------------------------------------------------------------------------------+ 

这应该做的。我刚刚使用了一个比你小的例子。

相关问题