2016-04-24 228 views
0

我有2个数据帧,第一列是一个列表(df A),另一列的第一列包含列表中的项目,但在某些情况下每行有多个项目(df B)。 我想要做的就是去通过,并从一个DF每个项目创建新行什么,发生在DF B的第一列根据另一个数据帧中的列创建新的数据帧行

DF一

dfA 
    Index X 
1 1 alpha 
2 2 beta 
3 3 gamma 
4 4 delta 

DF乙

dfB 
    list X 
1 1 4 alpha 
2 3 2 1 beta 
3 4 1 2 gamma 
4 3  delta 

期望

dfC 
    Index x 
1 1  alpha 
2 4  alpha 
3 3  beta 
4 2  beta 
5 1  beta 
6 4  gamma 
7 1  gamma 
8 2  gamma 
9 3  delta 

我使用的实际数据: DF一

dput(head(allwines)) 
structure(list(Wine = c("Albariño", "Aligoté", "Amarone", "Arneis", 
"Asti Spumante", "Auslese"), Description = c("Spanish white wine grape that makes crisp, refreshing, and light-bodied wines.", 
"White wine grape grown in Burgundy making medium-bodied, crisp, dry wines with spicy character.", 
"From Italy’s Veneto Region a strong, dry, long- lived red, made from a blend of partially dried red grapes.", 
"A light-bodied dry wine the Piedmont Region of Italy", "From the Piedmont Region of Italy, A semidry sparkling wine produced from the Moscato di Canelli grape in the village of Asti", 
"German white wine from grapes that are very ripe and thus high in sugar" 
)), .Names = c("Wine", "Description"), row.names = c(NA, 6L), class = "data.frame") 

DF乙

> dput(head(cheesePairing)) 
structure(list(Wine = c("Cabernet Sauvignon\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Pinot Noir\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Sauvignon Blanc\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Zinfandel", 
"Chianti\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Pinot Noir\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Sangiovese", 
"Chardonnay", "Bardolino\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Malbec\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Riesling\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Rioja\r\n        \r\n       \r\n      \r\n       \r\n        \r\n         Sauvignon Blanc", 
"Tempranillo", "Asti Spumante"), Cheese = c("Abbaye De Belloc Cheese", 
"Ardrahan cheese", "Asadero cheese", "Asiago cheese", "Azeitao", 
"Baby Swiss Cheese"), Suggestions = c("Pair with apples, sliced pears OR a sampling of olives and thin sliced salami. Pass around slices of baguette.", 
"Serve with a substantial wheat cracker and apples or grapes.", 
"Rajas (blistered chile strips) fresh corn tortillas", "Table water crackers, raw nuts (almond, walnuts)", 
"Nutty brown bread, grapes", "Server with dried fruits, whole grain, nutty breads, nuts" 
)), .Names = c("Wine", "Cheese", "Suggestions"), row.names = c(NA, 
6L), class = "data.frame") 
+0

如果您可以编辑您的问题以将您的示例数据包含在R可解析格式中将会很有帮助。例如。 'dput(dfA)'和'dput(dfB)'。 –

+0

@CurtF。我添加了我的示例数据,我担心它可能太混乱了,所以我删除了它并将其编入示例。 –

+1

我不确定'DFA'的用途是什么。 'DFB'中的葡萄酒中有一些额外的空格,所以你可以将它们替换为逗号'cheesePairing $ Wine < - gsub('\\ s {2,}',',',df $ Wine)'现在使用[这个问题](http://stackoverflow.com/questions/28285169/split-comma-separated-column-entry-into-rows)或其他类似的答案之一 – rawr

回答

2

为了解决柯特的答案,我想我找到了一个更有效的解决方案......假设我正确地解释了你的目标。

我的评论代码是在下面。您应该能够按原样运行并获得所需的dfC。有一点需要注意的是,我假设(根据您的实际数据)分隔符分裂dfB $索引是“\ r \ n”。

# set up fake data 
dfA<-data.frame(Index=c('1','2','3','4'), X=c('alpha','beta','gamma','delta')) 
dfB<-data.frame(Index=c('1 \r\n 4','3 \r\n 2 \r\n 1','4 \r\n 1 \r\n 2','3'), X=c('alpha','beta','gamma','delta')) 

dfA$Index<-as.character(dfA$Index) 
dfA$X<-as.character(dfA$X) 
dfB$Index<-as.character(dfB$Index) 
dfB$X<-as.character(dfB$X) 


dfB_index_parsed<-strsplit(x=dfB$Index,"\r\n") # split Index of dfB by delimiter "\r\n" and store in a list 
names(dfB_index_parsed)<-dfB$X 
dfB_split_num<-lapply(dfB_index_parsed, length) # find the number of splits per row of dfB and store in a list 
dfB_split_num_vec<-do.call('c', dfB_split_num) # convert number of splits above from list to vector 

g<-do.call('c',dfB_index_parsed) # store all split values in a single vector 
g<-gsub(' ','',g) # remove trailing/leading spaces that occur after split 
names(g)<-rep(names(dfB_split_num_vec), dfB_split_num_vec) # associate each split Index from dfB with X from dfB 
g<-g[g %in% dfA$Index] # check which dfB$Index are in dfA$Index 

dfC<-data.frame(Index=g, X=names(g)) # construct data.frame 
+0

当我运行这个我结束了一个空白的数据框,但是前几个步骤似乎正朝着正确的方向发展。出于某种原因,拼抢创造了很多额外的\ r \ n,所以通过第三步,拆分数量完全关闭。我会尝试删除任何空白,并看看是否有帮助 –

+0

哦,很奇怪。我只是将我的代码复制到另一个R会话中,并且运行良好。无论如何,很高兴听到它有所帮助。我认为strsplit()+ gsub()函数对于解决这个问题的任何策略都是至关重要的。 regexpr()也可能有帮助。同时检查你正在使用的scraping软件包是否具有处理这些刮擦结果的内置函数。 – AOGSTA

+0

有没有证据表明这实际上更有效率?无论哪种方式,很好的答案和我+1。 –

0

首先,让我提供一个功能回答你的问题。我怀疑我的答案是非常有效的,但它有效。

# construct toy data 
dfA <- data.frame(index = 1:4, X = letters[1:4]) 

dfB <- data.frame(X = letters[1:4]) 
dfB$list_elements <- list(c(1, 4), c(3, 2, 1), c(4, 1, 2), c(3)) 

# define function that provides solution 

unlist_merge_df <- function(listed_df, reference_df){ 
    # reference_df assumed to have columns "X" and "index" 
    # listed_df assumed to have column "list_elements" 
    df_out <- data.frame(index = c(), X = c()) 
    my_list <- listed_df$list_elements 
    for(idx in 1:length(my_list)){ 
     df_out <- rbind(df_out, 
         data.frame(index = my_list[[idx]], 
            X = listed_df[idx, 'X']) 
         ) 
    } 
    return(df_out) 
} 

# call the function 
dfC <- unlist_merge_df(dfB, dfA) 

# show output in human and R-parseable formats 
dfC 

dput(dfC) 

输出是:

index X 
1 1 a 
2 4 a 
3 3 b 
4 2 b 
5 1 b 
6 4 c 
7 1 c 
8 2 c 
9 3 d 

structure(list(index = c(1, 4, 3, 2, 1, 4, 1, 2, 3), X = structure(c(1L, 
1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L), .Label = c("a", "b", "c", "d" 
), class = "factor")), .Names = c("index", "X"), row.names = c(NA, 
9L), class = "data.frame") 

其次,让我说,你所处的情况不是很desireable。如果你能避免它,你可能应该。要么完全不使用数据框,只使用列表,或者完全避免构建列出的数据框(如果可以的话),并直接构造所需的输出。

+1

谢谢,我知道这不是一个理想的情况。我通过网络抓取获得了数据,并试图让它可用于数据库,但它看起来像我可能不得不在数据库查询中做出适当的结果并更加明确 –

相关问题