2017-08-31 72 views
1

我有几千个*.csv文件(所有文件都有唯一的名称),但文件中的标题列相同 - 比如"Timestamp""System_Name""CPU_ID"等...
我的问题是我怎么能取代"System_Name"(这是一个系统名称像"as12535.org.at"或任何其他字符组合,并匿名此?我很感激任何提示或点右方向...
下面的CSV文件的结构...R - 通过列表中的data.frames循环 - 修改列(列表元素)的字符

"Timestamp","System_Name","CPU_ID","User_CPU","User_Nice_CPU","System_CPU","Idle_CPU","Busy_CPU","Wait_IO_CPU","User_Sys_Pct" 
"1161025010002000","as06240.org.xyz:LZ","-1","1.83","0.00","0.56","97.28","2.72","0.33","3.26" 
"1161025010002000","as06240.org.xyz:LZ","-1","1.83","0.00","0.56","97.28","2.72","0.33","3.26" 
"1161025010002000","as06240.org.xyz:LZ","-1","1.83","0.00","0.56","97.28","2.72","0.33","3.26" 

我试过用R包anonymizer,它在矢量级别上工作正常,但是我遇到了这样的问题,因为我在R中读取了数千个csv文件 - 我尝试的是以下内容 - 创建包含所有csv文件作为列表中的数据框。

initialize a list 
r.path <- setwd("mypath") 
ldf <- list() 

# creates the list of all the csv files in my directory - but filter for 
# files with Unix in the filename for testing. 
listcsv <- dir(pattern = ".UnixM.") 

for (i in 1:length(listcsv)){ 
ldf[[i]] <- read.csv(file = listcsv[i]) 
} 

我扭我的大脑死亡,因为我无法匿名的System_Name列,甚至可以通过列表(ldf)和该数据帧的元素替换某些字符(伪匿名)和环路很名单。

我的目录ldf(包含单CSV文件DF)是这样的:

summary(ldf) 
Length Class  Mode 
[1,] 5  data.frame list 
[2,] 5  data.frame list 
[3,] 5  data.frame list 

showing the structure of my list, containing all files contents as dataframe

如何我现在可以在所有的CSV文件,更改阅读或匿名的整个或甚至是"System_Name"列的一部分,并且为我的目录中的每个CSV执行此操作,在R中进行循环?不需要是超级优雅的 - 很高兴当它:-)

+0

使用'lapply'到你想要的功能列表中。我不知道anonymizer如何工作,在假设的情况下,函数就像'anonymizer(column)':'lapply(list,function(x)anonymizer(x $ System_Name))' –

回答

2

的工作做一个常见的模式是:

df <- do.call(
    rbind, 
    lapply(dir(pattern = "UnixM"), 
     read.csv, stringsAsFactors = FALSE) 
) 
df$System_Name <- anonymizer::anonymize(df$System_Name) 

它不同于你试图什么,因为它将所有数据帧绑定在一起,然后匿名。

当然,您可以将所有内容都保存在列表中,例如@S Rivero所建议的。它看起来像:

listdf <- lapply(
    dir(pattern = "UnixM"), 
    function(filename) { 
    df <- read.csv(filename, stringsAsFactors = FALSE) 
    df$System_Name <- anonymizer::anonymize(df$System_Name) 
    df 
    } 
)