2017-04-24 46 views
0

我有80,000个XML文件,它们应该使用相同的格式。但是,情况显然不是这样。因此,我试图识别文件中的所有现有节点和子节点。确定列表中所有可能的父母和孩子

我已经使用XML包将XML文件导入为列表,并在下面描述了我的输入和我所需的输出。

输入(名单列表):

XML1 <- list(name = "Company Number 1", 
      adress = list(street = "JP Street", number = "12"), 
      product = "chicken") 

XML2 <- list(name = "Company Number 2", 
      company_adress = list(street = "House Street", number = "93"), 
      invoice = list(quantity = "2", product = "phone")) 

XML3 <- list(company_name = "Company Number 3", 
      adress = list(street = "Lake Street", number = "1"), 
      invoice = list(quantity = "2", product = "phone", list(note = "Phones are refurbished"))) 

输出(树形结构跨文件与出现的次数在叶子):

List of 5 
$ name   : num 2 
$ company_name : num 1 
$ adress  :List of 2 
    ..$ street: num 2 
    ..$ number: num 2 
$ company_adress:List of 2 
    ..$ street: num 1 
    ..$ number: num 1 
$ invoice  :List of 3 
    ..$ quantity: num 2 
    ..$ product : num 2 
    ..$   :List of 1 
    .. ..$ note: num 1 
$ product  : num 1 

是否有一个包,可以沿着这条线做一些事情,还是我需要写一个自己做这个的函数?

回答

0

我编写了一个解决问题的递归循环。这不是优雅的,但它有诀窍。

该函数采用嵌套列表和空向量。

# Summary tree for storing results 
summary_tree <- list() 

# Function 
tree_merger <- function(tree, position) { 
    # Testing if at the leaf of a tree 
    if (is.character(tree) | is.null(tree)) { 
    print("DONE") 
    } else { 
    # Position in tree 
    if (length(position) == 0) { 
     # Names of nodes 
     tree_names <- names(tree) 

     # Adding one to each name 
     for (i in 1:length(tree_names)) { 
     if (is.null(summary_tree[[tree_names[i]]])) { 
      summary_tree[[tree_names[i]]] <<- list(1) 
     } else { 
      summary_tree[[tree_names[i]]] <<- list(summary_tree[[tree_names[i]]][[1]] + 1) 
     } 

     # Running function on new tree 
     tree_merger(tree[[tree_names[i]]], c(position, tree_names[i])) 
     } 
    } else { 
     # Names of nodes 
     tree_names <- names(tree) 

     # Finding position in tree to save information 
     position_string <- NULL 
     for (p in position) { 
     position_string <- paste(position_string, "[[\"", p, "\"]]", sep = "") 
     } 
     position_string <- paste("summary_tree", position_string, sep = "") 

     # Adding one to each position 
     for (i in 1:length(tree_names)) { 
     position_string_full <<- paste(position_string, "[[\"", tree_names[i], "\"]]", sep = "") 

     # Adding to position 
     if(is.null(eval(parse(text=position_string_full)))) { 
     eval(parse(text=paste(position_string_full, "<<- list(1)"))) 
     } else { 
      eval(parse(text=paste(position_string_full, "<<- list(", position_string_full ,"[[1]] + 1)"))) 
     } 

     # Running function on new tree 
     tree_merger(tree[[tree_names[i]]], c(position, tree_names[i])) 
     } 
    } 
    } 
} 

如果有人遇到同样的问题,应该注意的是,应该可能改变关于如何退出递归的代码。对于我的XML文件,所有“叶子”都以字符串或NULL结束。在其他列表中,它可以是其他类型的值。