2017-08-16 16 views
2

我在我的数据集中有一个数组类型需要转换为字符串类型。 我以传统的方式尝试过。 我觉得我们可以做得更好。你能指导我吗? 输入数据集1如何将数组类型的数据集转换为Apache Spark中的字符串类型Java

+---------------------------+-----------+-------------------------------------------------------------------------------------------------+ 
    ManufacturerSource   |upcSource |productDescriptionSource                   |                                           | 
    +---------------------------+-----------+-------------------------------------------------------------------------------------------------+ 
    |3M       |51115665883|[c, gdg, whl, t27, 5, x, 1, 4, x, 7, 8, grindig, flap, wheels, 36, grit, 12, 250, rpm]   |                                           | 
    |3M       |51115665937|[c, gdg, whl, t27, q, c, 6, x, 1, 4, x, 5, 8, 11, grinding, flap, wheels, 36, grit, 10, 200, rpm]|                                            | 
    |3M       |0   |[3mite, rb, cloth, 3, x, 2, wd]                 |                                            | 
    |3M       |0   |[trizact, disc, cloth, 237aaa16x5, hole]               |                                            | 
    ------------------------------------------------------------------------------------------------------------------------------------------- 

预期输出数据集

 +---------------------------+-----------+---------------------------------------------------------------------------------------------------| 
    |ManufacturerSource   |upcSource |productDescriptionSource                   |                                           | 
    +---------------------------+-----------+---------------------------------------------------------------------------------------------------| 
    |3M       |51115665883|c gdg whl t27 5 x 1 4 x 7 8 grinding flap wheels 36 grit 12 250 rpm    |    |                                       | 
    |3M       |51115665937|c gdg whl t27 q c 6 x 1 4 x 5 8 11 grinding flap wheels 36 grit 10 200 rpm       |                                          | 
    |3M       |0   |3mite rb cloth 3 x 2 wd                  |                                           | 
    |3M       |0   |trizact disc cloth 237aaa16x5 hole                |                                           | 
    +-------------------------------------------------------------------------------------------------------------------------------------------| 

常规方法1

 Dataset<Row> afterstopwordsRemoved = 
     stopwordsRemoved.select("productDescriptionSource"); 
      stopwordsRemoved.show(); 

     List<Row> individaulRows= afterstopwordsRemoved.collectAsList(); 

     System.out.println("After flatmap\n"); 
     List<String> temp; 
     for(Row individaulRow:individaulRows){ 
     temp=individaulRow.getList(0); 
     System.out.println(String.join(" ",temp)); 
     } 

Approach2(不产生结果)

例外:未能执行用户定义的函数($ anonfun $ 27: (阵列) =>字符串)

 UDF1 untoken = new UDF1<String,String[]>() { 
     public String call(String[] token) throws Exception { 
      //return types.replaceAll("[^a-zA-Z0-9\\s+]", ""); 
      return Arrays.toString(token); 
     } 

     @Override 
     public String[] call(String t1) throws Exception { 
      // TODO Auto-generated method stub 
      return null; 
     } 
    }; 

    sqlContext.udf().register("unTokenize", untoken, DataTypes.StringType); 

    source.createOrReplaceTempView("DataSetOfTokenize"); 
    Dataset<Row> newDF = sqlContext.sql("select *,unTokenize(productDescriptionSource)FROM DataSetOfTokenize"); 
    newDF.show(4000,false); 

回答

2

我会使用concat_ws

sqlContext.sql("select *, concat_ws(' ', productDescriptionSource) FROM DataSetOfTokenize"); 

或:

import static org.apache.spark.sql.functions.*; 

df.withColumn("foo", collect_ws(" ", col("productDescriptionSource"))); 
+0

感谢重播其工作... –

相关问题