R中的级别 - 正确设置新数据集

我对包含因子变量的一组数据进行训练。这个变量有以下几个层次：

[1] "Economics" "Engineering" "Medicine" 
[4] "Accounting" "Biology"  "Computer Science" 
[7] "Physics"  "Law"   "Chemistry"

我的评价集有那些水平的一个子集：

[1] "Law"   "Medicine"

的随机森林包需要的水平是一样的，所以我曾尝试：

levels(evaluationSet$course) <- levels(trainingSet$course)

但是当我检查评估集中的行时，数值发生了变化：

evaluationSet[1:3,c('course')] 
# Gives "[1] Economics Engineering Economics", should give "[1] Law Medicine Law"

我是R的新手，但我认为这里发生的事情是因素是枚举集合。在评估集中，“法律”和“医学”在因子（分别为1和2）中用数字表示。当我应用新的关卡时，它会改变这些关卡映射到的值。

我发现这么几个类似的题目，并试图他们的建议，但没有运气：

evaluationSet <- droplevels(evaluationSet) 
levels(evaluationSet$course) <- levels(trainingSet$course) 
evaluationSet$course <- factor(evaluationSet$course)

我如何设置的级别是一样的设置，同时保留我的数据的价值的培训？

编辑：添加之前和水平后（evaluationSet $课程）头的结果（evaluationSet）< - 水平（trainingSet $课程）：

timestamp score age takenBefore course 
1 1374910975 0.87 18   0  law 
2 1374910975 0.81 21   0 medicine 
3 1374910975 0.88 21   0  law 
4 1374910975 0.88 21   0  law 
5 1374910975 0.74 22   0  law 
6 1374910975 0.76 23   1 medicine 

    timestamp score age takenBefore  course 
1 1374910975 0.87 18   0 economics 
2 1374910975 0.81 21   0 engineering 
3 1374910975 0.88 21   0 economics 
4 1374910975 0.88 21   0 economics 
5 1374910975 0.74 22   0 economics 
6 1374910975 0.76 23   1 engineering

来源

2013-07-30 pricj004

在尝试设置级别之前和之后，您能向我们展示head（evaluateSet）吗？ – Marius

@Marius问题编辑。 – pricj004

你是对的，因素只是每个数字都带有字符串标签的枚举。 –

如果您明确设置内factor()的水平，你应该有更好的运气：

eval = read.table(text=" timestamp score age takenBefore course 
1 1374910975 0.87 18   0  law 
2 1374910975 0.81 21   0 medicine 
3 1374910975 0.88 21   0  law 
4 1374910975 0.88 21   0  law 
5 1374910975 0.74 22   0  law 
6 1374910975 0.76 23   1 medicine", header=TRUE) 
eval$course = factor(eval$course, levels=c("economics", "engineering", "medicine", "law"))

结果：

> eval$course 
[1] law  medicine law  law  law  medicine 
Levels: economics engineering medicine law

来源

2013-07-30 01:09:32 Marius

工作。谢谢。 – pricj004

你的直觉基本上是正确的。问题的关键在于订单的关键问题。它们不是一个集合，而是一个映射。

下面是一个例子：

f <- factor(sample(letters[4:6],20,replace = TRUE)) 
> f 
[1] d e e d e e f d d f e e d d e e f e d d 
Levels: d e f 
> levels(f) 
[1] "d" "e" "f" 
> levels(f) <- letters[1:6] 
> f 
[1] a b b a b b c a a c b b a a b b c b a a 
Levels: a b c d e f

注意，当我们添加的水平，“第一”三个层面已经取代。相反，

> f <- factor(sample(letters[4:6],20,replace = TRUE)) 
> f 
[1] d f f e e d d f d d f d d e e e e f d e 
Levels: d e f 
> levels(f) <- c(letters[4:6],letters[1:3]) 
> f 
[1] d f f e e d d f d d f d d e e e e f d e 
Levels: d e f a b c

因此，您只需要尊重评估集中当前级别的排序。

想一想的一个方法是因素实际上只是一个整数向量。无论何种R代码1将对应于的第一个级别。而且由于它会按字母顺序排列，所以当您添加关卡时，您可能会混淆该关系。

来源

2013-07-30 01:12:13 joran

'as.numeric（your_factor）'也可能帮助你理解这一点，如果你在尝试设置关卡的不同方式前后尝试一下。 – Marius

R中的级别 - 正确设置新数据集

回答

相关问题