Sparklyr处理分类变量

我来自R背景来了，我习惯分类变量在后端（如因子）正在处理。使用Sparklyr时，使用string_indexer或onehotencoder会非常困惑。Sparklyr移交分类变量

例如，我有一些变量已经被编码为原始数据集中的数值变量，但它们实际上是分类的。我想用它们作为分类变量，但不知道我是否正确地做了。

library(sparklyr) 
library(dplyr) 
sessionInfo() 
sc <- spark_connect(master = "local", version = spark_version) 
spark_version(sc) 
set.seed(1)  
exampleDF <- data.frame (ID = 1:10, Resp = sample(c(100:205), 10, replace = TRUE), 
        Numb = sample(1:10, 10)) 

example <- copy_to(sc, exampleDF) 
pred <- example %>% mutate(Resp = as.character(Resp)) %>% 
       sdf_mutate(Resp_cat = ft_string_indexer(Resp)) %>% 
       ml_decision_tree(response = "Resp_cat", features = "Numb") %>% 
       sdf_predict() 
pred

该模型的预测不是绝对的。见下文。这是否意味着我还必须从预测转换回Resp_cat，然后转换为Resp？

R version 3.4.0 (2017-04-21) 
Platform: x86_64-redhat-linux-gnu (64-bit) 
Running under: CentOS Linux 7 (Core) 

spark_version(sc) 
[1] ‘2.1.1.2.6.1.0’ 

Source: table<sparklyr_tmp_74e340c5607c> [?? x 6] 
Database: spark_connection 
     ID Numb Resp Resp_cat id74e35c6b2dbb prediction 
    <int> <int> <chr> <dbl>   <dbl>  <dbl> 
1  1 10 150  8    0 8.000000 
2  2  3 191  4    1 4.000000 
3  3  4 146  9    2 9.000000 
4  4  9 125  5    3 5.000000 
5  5  8 107  2    4 2.000000 
6  6  2 110  1    5 1.000000 
7  7  5 133  3    6 5.333333 
8  8  7 154  6    7 5.333333 
9  9  1 170  0    8 0.000000 
10 10  6 143  7    9 5.333333

来源

2017-08-14 Kevin Zheng

一般来说，Spark在处理分类数据时依赖于列元数据。在你的管道中，这是由StringIndexer（ft_string_indexer）处理。 ML总是预测标签，而不是原始字符串。通常情况下，您可以使用ft_index_to_string提供的IndexToString变压器。

在Spark IndexToString中可以使用a provided list of labels或Column元数据。不幸的是sparklyr实现限制在两个方面：

It can use only metadata，这是不是在预测列设置。
ft_string_indexer丢弃训练好的模型，所以它不能用来提取lables。

有可能我错过了什么，但它看起来像你必须通过joining与转换后的数据手动映射的预测，例如：

pred %>% 
    select(prediction=Resp_cat, Resp_prediction=Resp) %>% 
    distinct() %>% 
    right_join(pred)

Joining, by = "prediction" 
# Source: lazy query [?? x 9] 
# Database: spark_connection 
    prediction Resp_prediction ID Numb Resp Resp_cat id777a79821e1e 
     <dbl>   <chr> <int> <int> <chr> <dbl>   <dbl> 
1   7    171  1  3 171  7    0 
2   0    153  2 10 153  0    1 
3   3    132  3  8 132  3    2 
4   5    122  4  7 122  5    3 
5   6    198  5  4 198  6    4 
6   2    164  6  9 164  2    5 
7   4    137  7  6 137  4    6 
8   1    184  8  5 184  1    7 
9   0    153  9  1 153  0    8 
10   1    184 10  2 184  1    9 
# ... with more rows, and 2 more variables: rawPrediction <list>, 
# probability <list>

说明：

pred %>% 
    select(prediction=Resp_cat, Resp_prediction=Resp) %>% 
    distinct()

创建从预测（编码标签）到原始标签。我们将Resp_cat重命名为prediction，以便它可以作为连接密钥，并且Resp至Resp_prediction可以避免与实际的Resp冲突。

最后我们采用正确的等值连接：

... %>% right_join(pred)

注意：

应指定树的类型：

ml_decision_tree(
    response = "Resp_cat", features = "Numb",type = "classification")

来源

2017-08-14 16:30:23 user6910411

这是一个很好的解决方法。谢谢！我希望Sparklyr能够在内部处理它，并且为此打开了一张[ticket]（https://github.com/rstudio/sparklyr/issues/928）。 –

Sparklyr移交分​​类变量

Sparklyr处理分类变量

回答

相关问题

Sparklyr移交分类变量