2016-11-26 54 views
2

有没有一种方法可以在数据行中搜索模式,然后将它们存储在新表的不同列中?例如,如果我需要从身体下面抽出量,钞票和硬币,你认为这是可能实现R上R中的文本挖掘搜索和提取信息

user_id |  ts |     body     | address |  
3633|  2016-09-29| A wallet with amount = $ 100 has been found with 4 bills and 5 coins| TEST |  
4266|  2016-07-20| A purse having amount = $ 150 has been found with 40 bills and 15 coins| NAME | 
7566|  2016-07-20| A pocket having amount = $ 200 has been found with 4 bills and 5 coins| HELLO | 

期望的结果(这是期望的结果

user_id | Amount | Bills| Coins| 
3633  | $100 | 4 |  5| 
4266  | $150 | 40 | 15| 
7566  | $200 | 10 | 10| 
+0

是的,这是可能的。你会想要使用正则表达式。见'?regex'。对[此效果]有些东西(http://stackoverflow.com/questions/14159690/regex-grep-strings-containing-us-currency)。 –

回答

0

下面是一个解决方案stringrlapply,但必须有更多。首先子集只有user.idbody柱将类似以下内容:

df <- data.frame(user.id = c(3633, 4266, 7566), 
     body = c("A wallet with amount = $ 100 has been found with 4 bills and 5 coins", 
       "A purse having amount = $ 150 has been found with 40 bills and 15 coins", 
       "A pocket having amount = $ 200 has been found with 4 bills and 5 coins")) 

现在,我们将应用正则表达式的df所有行的数字解压缩到一个列表中,选择不公开,转化为矩阵指定列名,转置和cbinduser.id从原始数据帧。

library(stringr) 
mat <- t(matrix(unlist(lapply(df, str_match_all, "[0-9]+")[2]), nrow = nrow(df))) 
colnames(mat) <- c("Amount", "Bills", "Coins") 
outputdf <- cbind(df[1], mat) 

这给:

> outputdf 
# user.id Amount Bills Coins 
#1 3633 100  4  5 
#2 4266 150 40 15 
#3 7566 200  4  5 

我敢肯定,大概有这样做太的更合适的方法。