如何将数组类型的数据集转换为Apache Spark中的字符串类型Java

我在我的数据集中有一个数组类型需要转换为字符串类型。我以传统的方式尝试过。我觉得我们可以做得更好。你能指导我吗？输入数据集1如何将数组类型的数据集转换为Apache Spark中的字符串类型Java

+---------------------------+-----------+-------------------------------------------------------------------------------------------------+ 
    ManufacturerSource   |upcSource |productDescriptionSource                   |                                           | 
    +---------------------------+-----------+-------------------------------------------------------------------------------------------------+ 
    |3M       |51115665883|[c, gdg, whl, t27, 5, x, 1, 4, x, 7, 8, grindig, flap, wheels, 36, grit, 12, 250, rpm]   |                                           | 
    |3M       |51115665937|[c, gdg, whl, t27, q, c, 6, x, 1, 4, x, 5, 8, 11, grinding, flap, wheels, 36, grit, 10, 200, rpm]|                                            | 
    |3M       |0   |[3mite, rb, cloth, 3, x, 2, wd]                 |                                            | 
    |3M       |0   |[trizact, disc, cloth, 237aaa16x5, hole]               |                                            | 
    -------------------------------------------------------------------------------------------------------------------------------------------

预期输出数据集

 +---------------------------+-----------+---------------------------------------------------------------------------------------------------| 
    |ManufacturerSource   |upcSource |productDescriptionSource                   |                                           | 
    +---------------------------+-----------+---------------------------------------------------------------------------------------------------| 
    |3M       |51115665883|c gdg whl t27 5 x 1 4 x 7 8 grinding flap wheels 36 grit 12 250 rpm    |    |                                       | 
    |3M       |51115665937|c gdg whl t27 q c 6 x 1 4 x 5 8 11 grinding flap wheels 36 grit 10 200 rpm       |                                          | 
    |3M       |0   |3mite rb cloth 3 x 2 wd                  |                                           | 
    |3M       |0   |trizact disc cloth 237aaa16x5 hole                |                                           | 
    +-------------------------------------------------------------------------------------------------------------------------------------------|

常规方法1

 Dataset<Row> afterstopwordsRemoved = 
     stopwordsRemoved.select("productDescriptionSource"); 
      stopwordsRemoved.show(); 

     List<Row> individaulRows= afterstopwordsRemoved.collectAsList(); 

     System.out.println("After flatmap\n"); 
     List<String> temp; 
     for(Row individaulRow:individaulRows){ 
     temp=individaulRow.getList(0); 
     System.out.println(String.join(" ",temp)); 
     }

Approach2（不产生结果）

例外：未能执行用户定义的函数（$ anonfun $ 27：（阵列） =>字符串）

 UDF1 untoken = new UDF1<String,String[]>() { 
     public String call(String[] token) throws Exception { 
      //return types.replaceAll("[^a-zA-Z0-9\\s+]", ""); 
      return Arrays.toString(token); 
     } 

     @Override 
     public String[] call(String t1) throws Exception { 
      // TODO Auto-generated method stub 
      return null; 
     } 
    }; 

    sqlContext.udf().register("unTokenize", untoken, DataTypes.StringType); 

    source.createOrReplaceTempView("DataSetOfTokenize"); 
    Dataset<Row> newDF = sqlContext.sql("select *,unTokenize(productDescriptionSource)FROM DataSetOfTokenize"); 
    newDF.show(4000,false);

来源

2017-08-16 Shashi Kumar

我会使用concat_ws：

sqlContext.sql("select *, concat_ws(' ', productDescriptionSource) FROM DataSetOfTokenize");

或：

import static org.apache.spark.sql.functions.*; 

df.withColumn("foo", collect_ws(" ", col("productDescriptionSource")));

来源

2017-08-16 15:09:03 user8371915

感谢重播其工作... –

如何将数组类型的数据集转换为Apache Spark中的字符串类型Java

回答

相关问题