2011-11-08 43 views


Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. 
Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. 
Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3, etc. 


> dat[1:5,1:10] 

    V1 V2 V3 V4 V5  V6 V7  V8 V9  V10 
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624 
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928 
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119 
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091 
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521 


 topic1  topic2  topic3 
text1 proportion1 proportion2 proportion3 
text2 proportion1 proportion2 proportion3 


  0   2   7   8   10  12  13  16  18  20  21   23  24   27 
10.txt  0   0   0   0   0   0   0   0   0  0.1315621 0.03632624 0.3040853 0   0.4560785   
1001.txt 0   0   0   0.1699586 0   0.2099153 0.1692292 0   0  0.2660085 0   0   0   0 
1002.txt 0   0.1747023 0   0   0.1360454 0.0750711 0   0.3341721 0  0   0   0   0   0 
1003.txt 0.0186709 0   0   0.2255179 0   0.5366148 0   0   0.138856 0   0   0   0   0 
1005.txt 0.2214441 0   0.1776052 0   0   0   0   0.2363206 0  0   0   0   0.1914769 0 


dat<-read.table("topics.txt", header=F, sep="\t") 
datnames<-subset(dat, select=2) 
dat2<-subset(dat, select=3:length(dat)) 
y <- data.frame(topic=character(0),proportion=character(0),text=character(0)) 
for(i in seq(1, length(dat2), 2)){ 
x<-cbind(x, datnames) 
colnames(x)<-c("topic","proportion", "text") 
y<-rbind(y, x) 

# Right at this step at the end of the block 
# I get this message that may indicate the problem: 
# Error in c(in c("topic", "proportion", "text") : unused argument(s) ("text") 

y[is.na(y)] <- 0 
xdat<-xtabs(proportion ~ text+topic, data=y) 
write.table(xdat, file="topicMatrix.txt", sep="\t", eol = "\n", quote=TRUE, col.names=TRUE, row.names=TRUE) 

我会非常感谢我如何能得到这个代码工作的任何建议。我的问题可能与this one有关,也可能与this one有关,但我还没有技能立即使用这些问题的答案。


除非你提供真正的数据结构,否则你不会得到太多的帮助....一个用于这些比例的数字。使用dput(head(dat,20)) –


感谢提示,我添加了一些内容。 – Ben


我还应该在使用'rm(list = ls(all = TRUE))'稍微改变了这个问题,以便在他的块结束时,错误信息变成“在[.data.frame'(dat2,,i:z)中出错:未定义的列被选中”。无论如何,我认为@Ramnath的答案是一个很有前途的选择。 – Ben




dat <-read.table(as.is = TRUE, header = FALSE, textConnection(
    "Num1 text1 topic1 proportion1 topic2 proportion2 topic3 proportion3 
    Num2 text2 topic1 proportion1 topic2 proportion2 topic3 proportion3 
    Num3 text3 topic1 proportion1 topic2 proportion2 topic3 proportion3")) 

nam <- c('num', 'text', 
    paste(c('topic', 'proportion'), rep(1:NTOPICS, each = 2), sep = "")) 

dat_l <- reshape(setNames(dat, nam), varying = 3:length(nam), direction = 'long', 
    sep = "") 
reshape2::dcast(dat_l, num + text ~ topic, value_var = 'proportion') 

num text  topic1  topic2  topic3 
1 Num1 text1 proportion1 proportion2 proportion3 
2 Num2 text2 proportion1 proportion2 proportion3 
3 Num3 text3 proportion1 proportion2 proportion3 



感谢您的建议,我可以重现您的示例并使其适用于我的全套数据。如果我们将'dat_l < - 重塑(setNames(dat,nam),vary = 3:8,direction ='long',sep =“”)'改为'dat_l < - reshape(setNames(dat,nam),变化= 3 :((NTOPICS * 2)+2),方向='长',sep =“”)'这似乎使它在处理不同数量的主题时更加通用和高效。 – Ben


你是对的。我编辑我的解决方案来反映这一点。 – Ramnath


更好,非常感谢! – Ben


你可以把它变成长格式,但要进一步要求真实的数据。提供数据后 编辑。仍然不确定MALLET产品的整体结构,但至少R功能已得到证明。如果存在重叠的主题,则这种方法具有“比例”总和的“特征”。取决于可能有​​利或不利的数据布局。

dat <-read.table(textConnection(" V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624 
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928 
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119 
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091 
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521 
ldat <- reshape(dat, idvar=1:2, varying=list(topics=c("V3", "V5", "V7", "V9"), 
              props=c("V4", "V6", "V8", "V10")), 
    > ldat 
      V1  V2 time V3   V4 
0.10.txt.1 0 10.txt 1 27 0.45607850 
1.1001.txt.1 1 1001.txt 1 20 0.26600850 
2.1002.txt.1 2 1002.txt 1 16 0.33417210 
3.1003.txt.1 3 1003.txt 1 12 0.53661480 
4.1005.txt.1 4 1005.txt 1 16 0.23632060 
0.10.txt.2 0 10.txt 2 23 0.30408530 
1.1001.txt.2 1 1001.txt 2 12 0.20991530 
2.1002.txt.2 2 1002.txt 2 2 0.17470230 
3.1003.txt.2 3 1003.txt 2 8 0.22551790 
4.1005.txt.2 4 1005.txt 2 0 0.22144410 
0.10.txt.3 0 10.txt 3 20 0.13156210 
1.1001.txt.3 1 1001.txt 3 8 0.16995860 
2.1002.txt.3 2 1002.txt 3 10 0.13604540 
3.1003.txt.3 3 1003.txt 3 18 0.13885610 
4.1005.txt.3 4 1005.txt 3 24 0.19147690 
0.10.txt.4 0 10.txt 4 21 0.03632624 
1.1001.txt.4 1 1001.txt 4 13 0.16922928 
2.1002.txt.4 2 1002.txt 4 12 0.07507119 
3.1003.txt.4 3 1003.txt 4 0 0.01867091 
4.1005.txt.4 4 1005.txt 4 7 0.17760521 


> xtabs(V4 ~ V3 + V2, data=ldat) 
V3  10.txt 1001.txt 1002.txt 1003.txt 1005.txt 
    0 0.00000000 0.00000000 0.00000000 0.01867091 0.22144410 
    2 0.00000000 0.00000000 0.17470230 0.00000000 0.00000000 
    7 0.00000000 0.00000000 0.00000000 0.00000000 0.17760521 
    8 0.00000000 0.16995860 0.00000000 0.22551790 0.00000000 
    10 0.00000000 0.00000000 0.13604540 0.00000000 0.00000000 
    12 0.00000000 0.20991530 0.07507119 0.53661480 0.00000000 
    13 0.00000000 0.16922928 0.00000000 0.00000000 0.00000000 
    16 0.00000000 0.00000000 0.33417210 0.00000000 0.23632060 
    18 0.00000000 0.00000000 0.00000000 0.13885610 0.00000000 
    20 0.13156210 0.26600850 0.00000000 0.00000000 0.00000000 
    21 0.03632624 0.00000000 0.00000000 0.00000000 0.00000000 
    23 0.30408530 0.00000000 0.00000000 0.00000000 0.00000000 
    24 0.00000000 0.00000000 0.00000000 0.00000000 0.19147690 
    27 0.45607850 0.00000000 0.00000000 0.00000000 0.00000000 

感谢您的快速建议。我可以重现你的结果。如何将它推广到30个(或100个或更多)主题? – Ben


如果列名非常规则,那么“变化”参数可以是'topics = paste(“V”,seq(1,100,by = 2),sep =“”)'和'props = paste “V”,seq(2,100,by = 2),sep =“”)' –


感谢您的快速帮助。不幸的是,我看不出为什么你的建议不适合我,但@Ramnath的代码完成了工作,所以我很乐意结案。再次感谢。 – Ben



dat <- read.table(text = "V1 V2 V3 V4 V5  V6 V7  V8 V9  V10 
1 0 10.txt 27 0.4560785 23 0.3040853 20 0.1315621 21 0.03632624 
2 1 1001.txt 20 0.2660085 12 0.2099153 8 0.1699586 13 0.16922928 
3 2 1002.txt 16 0.3341721 2 0.1747023 10 0.1360454 12 0.07507119 
4 3 1003.txt 12 0.5366148 8 0.2255179 18 0.1388561 0 0.01867091 
5 4 1005.txt 16 0.2363206 0 0.2214441 24 0.1914769 7 0.17760521") 

dat$V11 <- rep(NA, 5) # my real data has this extra unwanted col 
dat <- data.table(dat) 

# get document number 
docnum <- dat$V1 
# get text number 
txt <- dat$V2 

# remove doc num and text num so we just have topic and props 
dat1 <- dat[ ,c("V1","V2", paste0("V", ncol(dat))) := NULL] 
# get topic numbers 
n <- ncol(dat1) 
tops <- apply(dat1, 1, function(i) i[seq(1, n, 2)]) 
# get props 
props <- apply(dat1, 1, function(i) i[seq(2, n, 2)]) 

# put topics and props together 
tp <- lapply(1:ncol(tops), function(i) data.frame(tops[,i], props[,i])) 
names(tp) <- txt 
# make into long table 
dt <- data.table::rbindlist(tp) 
dt$doc <- unlist(lapply(txt, function(i) rep(i, ncol(dat1)/2))) 
dt$docnum <- unlist(lapply(docnum, function(i) rep(i, ncol(dat1)/2))) 

# reshape to wide 
setkey(dt, tops...i., doc) 
out <- dt[CJ(unique(tops...i.), unique(doc))][, as.list(props...i.), by=tops...i.] 
setnames(out, c("topic", as.character(txt))) 

# transpose to have table of docs (rows) and columns (topics) 
tout <- data.table(t(out)) 
setnames(tout, unname(as.character(tout[1,]))) 
tout <- tout[-1,] 
row.names(tout) <- txt 

# replace NA with zero 
tout[is.na(tout)] <- 0 



      0   2   7   8  10   12  13  16  18 
1: 0.00000000 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000 
2: 0.00000000 0.0000000 0.0000000 0.1699586 0.0000000 0.20991530 0.1692293 0.0000000 0.0000000 
3: 0.00000000 0.1747023 0.0000000 0.0000000 0.1360454 0.07507119 0.0000000 0.3341721 0.0000000 
4: 0.01867091 0.0000000 0.0000000 0.2255179 0.0000000 0.53661480 0.0000000 0.0000000 0.1388561 
5: 0.22144410 0.0000000 0.1776052 0.0000000 0.0000000 0.00000000 0.0000000 0.2363206 0.0000000 
      20   21  23  24  27 
1: 0.1315621 0.03632624 0.3040853 0.0000000 0.4560785 
2: 0.2660085 0.00000000 0.0000000 0.0000000 0.0000000 
3: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000 
4: 0.0000000 0.00000000 0.0000000 0.0000000 0.0000000 
5: 0.0000000 0.00000000 0.0000000 0.1914769 0.0000000 