我想要实现在由几列的如下所示的数据集的程序:组数据集成基于该值不同的子数据集
+-----------+---------------+-------------------+-----------------------+
|Item_ID |Product_Name |Manufacturer_Name |Product_Description |
+-----------+---------------+-------------------+-----------------------+
|12345 |Pen |Cello |Ball Pen Soft Nib... |
|12346 |Pencil |Nataraja |Pencil HB Extra D... |
|42345 |Ruler |Nataraja |Scale No.1103 15c... |
|12677 |Sharpener |Nataraja |Pencil Shraperner... |
|12987 |Pen |Reynolds |Dot Pen Extra Gr... |
|44326 |Pen |Reynolds |Gel Pen German T... |
|13456 |Pen |Cello |Dot Pen 0.5mm Nib... |
|19876 |Eraser |Cello |Dust free Eraser ... |
|43246 |Ink Pen |Hero |Ink Pen Smooth Ha... |
+-----------+---------------+-------------------+-----------------------+
,我想基于所述Manufacturer_Name
组数据集等所示低于
Manufacturer = Cello
+-----------+---------------+-------------------+-----------------------+
|Item_ID |Product_Name |Manufacturer_Name |Product_Description |
+-----------+---------------+-------------------+-----------------------+
|12345 |Pen |Cello |Ball Pen Soft Nib... |
|13456 |Pen |Cello |Dot Pen 0.5mm Nib... |
|19876 |Eraser |Cello |Dust free Eraser ... |
+-----------+---------------+-------------------+-----------------------+
Manufacturer = Nataraja
+-----------+---------------+-------------------+-----------------------+
|Item_ID |Product_Name |Manufacturer_Name |Product_Description |
+-----------+---------------+-------------------+-----------------------+
|12346 |Pencil |Nataraja |Pencil HB Extra D... |
|42345 |Ruler |Nataraja |Scale No.1103 15c... |
|12677 |Sharpener |Nataraja |Pencil Shraperner... |
+-----------+---------------+-------------------+-----------------------+
Manufacturer = Reynolds
+-----------+---------------+-------------------+-----------------------+
|Item_ID |Product_Name |Manufacturer_Name |Product_Description |
+-----------+---------------+-------------------+-----------------------+
|12987 |Pen |Reynolds |Dot Pen Extra Gr... |
|44326 |Pen |Reynolds |Gel Pen German T... |
+-----------+---------------+-------------------+-----------------------+
Manufacturer = Hero
+-----------+---------------+-------------------+-----------------------+
|Item_ID |Product_Name |Manufacturer_Name |Product_Description |
+-----------+---------------+-------------------+-----------------------+
|43246 |Ink Pen |Hero |Ink Pen Smooth Ha... |
+-----------+---------------+-------------------+-----------------------+
我尝试使用下面的代码,它不会产生好的结果。帮我改进这个程序。以下是我使用的代码:
Dataset<Row> countsBy = src.select("Manufacturer_Name").distinct();
List<Row> lsts = countsBy.collectAsList();
for (Row lst : lsts) {
String man = lst.toString();
System.out.println("Records of " + man + " only");
Dataset<Row> mandataset = src.filter("Manufacturer_Name='" + man + "'");
mandataset.show();
}
你能具体谈谈糟糕的后果新的想法?它是缓慢还是错误? –
我希望数据集的子集能够在迭代部分之外使用。由于它是在本地声明的,并且在每次迭代时都被覆盖,所以我不能使用除最后一次迭代期间生成的子集以外的所有子集。 @AugustinBocken –