2017-06-14 243 views
0

我想汇总一个dataframe。我有一些输出。我想将三个dataframe合并成一个dataframe,这将与第一个完全相同。将数据帧添加到另一个

这是我做的。

// Compute column summary statistics. 
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate() 
val dataframe = spark.read.option("header", true).option("inferSchema", true).format("com.databricks.spark.csv").load("C:/Users/mhattabi/Desktop/donnee/cassandraTest_1.csv") 
val colNames=dataframe.columns 
val data=dataframe.describe().show() 

+-------+-------------------+-------------------+-------------------+-------------------+-------------------+ 
|summary|    Col0|    Col1|    Col2|    Col3|    Col4| 
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+ 
| count|    9999|    9999|    9999|    9999|    9999| 
| mean| 0.4976937166129511| 0.5032998128645433| 0.5002933978916888| 0.5008783202471074|0.49977372871783293| 
| stddev| 0.2893201326892155|0.28767789122296994|0.29041197844235034|0.28989958496291496| 0.2881033430504947| 
| min|4.92436811557243E-6|3.20277176946531E-5|1.41602940923349E-5|6.53252937203857E-5| 5.4864212896146E-5| 
| max| 0.999442967120299| 0.9999608020298| 0.999968873336897| 0.999836584087385| 0.999822016805327| 
+-------+-------------------+-------------------+-------------------+-------------------+-------------------+ 
println("Skewness") 
val Skewness = dataframe.columns.map(c => skewness(c).as(c)) 
val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).show() 

偏度

+--------------------+--------------------+--------------------+--------------------+--------------------+ 
    |    Col0|    Col1|    Col2|    Col3|    Col4| 
    +--------------------+--------------------+--------------------+--------------------+--------------------+ 
    |0.015599787007160271|-0.00740111491496...|0.006096695102089171|0.003614431405637598|0.007869663345343194| 
    +--------------------+--------------------+--------------------+--------------------+--------------------+ 
    println("Kurtosis") 
    val Kurtosis = dataframe.columns.map(c => kurtosis(c).as(c)) 
    val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).show//kurtosis 
Kurtosis 
+-------------------+-------------------+-------------------+-------------------+------------------+ 
|    Col0|    Col1|    Col2|    Col3|    Col4| 
+-------------------+-------------------+-------------------+-------------------+------------------+ 
|-1.2187774053075133|-1.1861812968784207|-1.2107252263053805|-1.2108988817869097|-1.199054929668751| 
+-------------------+-------------------+-------------------+-------------------+------------------+ 

我想补充到偏度和峰度dataframe到第一个和他们的名字加入到第一colummns。

在此先感谢

+0

我误解你的问题,并发布了错误的答案,对不起。删除后,我会看看我能否提出有意义的答案。 – stefanobaghino

回答

0

需要summary列使用withColumn

val Skewness_ = dataframe.agg(Skewness.head, Skewness.tail: _*).withColumn("summary", lit("Skewness")) 

做同样的峰度

val Kurtosis_ = dataframe.agg(Kurtosis.head, Kurtosis.tail: _*).withColumn("summary", lit("Kurtosis")) 

使用Select在所有dataframes添加到这两个skewnesskurtosis表来按顺序有column名称

val orderColumn = Vector("summary", "col0", "col1", "col2", "col3", "col4") 
val Skewness_ordered = Skewness_.select(orderColumn.map(col):_*) 
val Kurtosis_ordered = Kurtosis_.select(orderColumn.map(col):_*) 

union他们。

val combined = dataframe.union(Skewness_ordered).union(Kurtosis_ordered) 
+0

花费了很多时间,特别是在处理大数据集时,有没有办法让它更快?非常感谢 –

+0

我想这就是我所知道的。 :) –

0

在你可以用一个初始的结合您的dataframes偏度和峰度到新的数据帧优雅的方式:

import org.apache.spark.sql.functions._ 

val result = dataframe.union(Skewness.select(lit("Skewness"),Skewness.col("*"))) 
     .union(Kurtosis.select(lit("Kurtosis"),Kurtosis.col("*"))) 

result.show() 
+0

unionAll已弃用 – Gevorg

相关问题