如何将列分成两组？

我有一个CSV输入文件。我们阅读使用以下内容如何将列分成两组？

val rawdata = spark. 
    read. 
    format("csv"). 
    option("header", true). 
    option("inferSchema", true). 
    load(filename)

这整齐地读取数据并构建架构。

下一步是将列拆分为字符串和整数列。怎么样？

如果以下是我的数据集的架构......

scala> rawdata.printSchema 
root 
|-- ID: integer (nullable = true) 
|-- First Name: string (nullable = true) 
|-- Last Name: string (nullable = true) 
|-- Age: integer (nullable = true) 
|-- DailyRate: integer (nullable = true) 
|-- Dept: string (nullable = true) 
|-- DistanceFromHome: integer (nullable = true)

我想这分为两个变量（StringCols，IntCols）其中：

StringCols应有“姓”，“姓”，“部门”
IntCols应该有“ID”，“年龄”，“DailyRate”，“DistanceFromHome”

这是我曾尝试：

val names = rawdata.schema.fieldNames 
val types = rawdata.schema.fields.map(r => r.dataType)

现在types，我想循环并找到所有StringType和地名查找起来的列名，同样为IntegerType。

来源

2017-06-05 Balaji Krishnan

在这里你去，你可以使用基础schema和dataType

import org.apache.spark.sql.types.{IntegerType, StringType} 

val stringCols = df.schema.filter(c => c.dataType == StringType).map(_.name) 
val intCols = df.schema.filter(c => c.dataType == IntegerType).map(_.name) 

val dfOfString = df.select(stringCols.head, stringCols.tail : _*) 
val dfOfInt = df.select(intCols.head, intCols.tail : _*)

来源

2017-06-05 12:01:22 eliasah

谢谢，这就是我正在看。我做的错误是，不包括sql.types。{IntegerType，StringType} 我在做以下并获得空列表 VAL TST = rawdata.schema.filter（C => c.dataType == “StringType”）代替 VAL TST = rawdata.schema.filter （c => c.dataType == StringType）非常感谢。 Regards Bala –

使用dtypes运营商过滤按类型的列：

dtypes：数组[（字符串，字符串）]将所有列名及其数据类型作为数组返回。

这会给你一个更习惯于处理数据集模式的方式。

val rawdata = Seq(
    (1, "First Name", "Last Name", 43, 2000, "Dept", 0) 
).toDF("ID", "First Name", "Last Name", "Age", "DailyRate", "Dept", "DistanceFromHome") 
scala> rawdata.dtypes.foreach(println) 
(ID,IntegerType) 
(First Name,StringType) 
(Last Name,StringType) 
(Age,IntegerType) 
(DailyRate,IntegerType) 
(Dept,StringType) 
(DistanceFromHome,IntegerType)

我想这个分成两个变量（StringCols，IntCols）

（我宁愿坚持使用不可变值而不是如果你不介意）

val emptyPair = (Seq.empty[String], Seq.empty[String]) 
val (stringCols, intCols) = rawdata.dtypes.foldLeft(emptyPair) { case ((strings, ints), (name: String, typ)) => 
    typ match { 
    case _ if typ == "StringType" => (name +: strings, ints) 
    case _ if typ == "IntegerType" => (strings, name +: ints) 
    } 
}

StringCols应该有“First Name”，“Last Name”，“Dept”和IntCols应该有“ID”，“Age”，“DailyRate”，“DistanceFromHome”

您可以reverse集合，但我宁愿避免这样做，因为性能昂贵并且不会给您带来任何回报。

来源

2017-06-05 15:34:37

如何将列分成两组？

回答

相关问题