导入和分析R中的非矩形.csv文件

我正在从Mathematica转向R，我不需要在导入期间预测数据结构，特别是我不需要在导入之前预测数据的矩形性。导入和分析R中的非矩形.csv文件

我有许多文件格式.csv文件内容如下：

tasty,chicken,cinnamon 
not_tasty,butter,pepper,onion,cardamom,cayenne 
tasty,olive_oil,pepper 
okay,olive_oil,onion,potato,black_pepper 
not_tasty,tomato,fenugreek,pepper,onion,potato 
tasty,butter,cheese,wheat,ham

行有不同的长度，将只包含字符串。

在R中，我该如何处理这个问题？

你试过了什么？

我试着read.table：

dataImport <- read.table("data.csv", header = FALSE) 
class(dataImport) 
##[1] "data.frame" 
dim(dataImport) 
##[1] 6 1 
dataImport[1] 
##[1] tasty,chicken,cinnamon 
##6 Levels: ...

我解释这从文档，其配料为不同行的每个列表中的单数列。我可以提取前三行，如下所示，每一行的classfactor但似乎包含更多的数据比我的期望：

dataImport[c(1,2,3),1] 
## my rows 
rowOne <- dataImport[c(1),1]; 
class(rowOne) 
## "factor" 
rowOne 
## [1] tasty,chicken,cinnamon 
## 6 Levels: not_tasty,butter,cheese [...]

这是因为据我追求这个问题，现在我会感谢关于此数据结构的适用性read.table的建议。

我的目标是按每行的第一个元素对数据进行分组，并分析每种类型配方之间的差异。在情况下，它可以帮助影响数据结构的建议，在数学我会做到以下几点：

dataImport=Import["data.csv"]; 
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]

回答讨论

@ G.Grothendieck已经在使用reshape2包使用read.table和后续处理提供了解决方案 - 这看起来非常有用，我稍后会进行调查。这里的一般建议解决了我的问题，因此接受。

@使用tm包MrFlick的建议是使用DataframeSource

来源

2015-05-03 Martin John Hadley

在导入数据后，您想对数据做什么？ R的“数据”结构和基本功能大部分对矩形数据效果最好。你只是想要一个字符向量列表？你想如何分析差异？ – MrFlick

@MrFlick我有兴趣分析哪些成分是最常见的每个类别（美味，not_tasty），这需要统计等我已经微不足道的问题一点点，以减少问题。我在Mathematica中使用的实际数据作为半机器学习示例的一部分.. –

这实际上与mathematica没有任何关系吗？你能删除那个标签吗？ – agentp

函数read.table请稍后再分析是有用的read.table有fill=TRUE：

d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)

，并提供：

> d1 
     V1  V2  V3  V4   V5  V6 
1  tasty chicken cinnamon        
2 not_tasty butter pepper onion  cardamom cayenne 
3  tasty olive_oil pepper        
4  okay olive_oil  onion potato black_pepper   
5 not_tasty tomato fenugreek pepper  onion potato 
6  tasty butter cheese wheat   ham

读。为NAS表

或填补空白单元格与NA值添加na.strings = ""：

d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")

捐赠：

> d2 
     V1  V2  V3  V4   V5  V6 
1  tasty chicken cinnamon <NA>   <NA> <NA> 
2 not_tasty butter pepper onion  cardamom cayenne 
3  tasty olive_oil pepper <NA>   <NA> <NA> 
4  okay olive_oil  onion potato black_pepper <NA> 
5 not_tasty tomato fenugreek pepper  onion potato 
6  tasty butter cheese wheat   ham <NA>

长形

如果你想在长形式：

library(reshape2) 
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3]) 
long <- long[order(long$id), ]

给予：

> long 
    id  V1  value 
1 1  tasty  chicken 
7 1  tasty  cinnamon 
2 2 not_tasty  butter 
8 2 not_tasty  pepper 
14 2 not_tasty  onion 
20 2 not_tasty  cardamom 
26 2 not_tasty  cayenne 
3 3  tasty olive_oil 
9 3  tasty  pepper 
4 4  okay olive_oil 
10 4  okay  onion 
16 4  okay  potato 
22 4  okay black_pepper 
5 5 not_tasty  tomato 
11 5 not_tasty fenugreek 
17 5 not_tasty  pepper 
23 5 not_tasty  onion 
29 5 not_tasty  potato 
6 6  tasty  butter 
12 6  tasty  cheese 
18 6  tasty  wheat 
24 6  tasty   ham

宽形式0/1二元变量

为了表示可变部为0/1，二元变量尝试：

wide <- cast(id + V1 ~ value, data = long) 
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])

给这个：

screenshot

列表中的数据帧

不同的表示将在一个数据帧中的以下列表，以便是字符向量的列表：

ag <- aggregate(value ~., transform(long, value = as.character(value)), c) 
ag <- ag[order(ag$id), ] 

giving: 

> ag 
    id  V1         value 
4 1  tasty      chicken, cinnamon 
1 2 not_tasty butter, pepper, onion, cardamom, cayenne 
5 3  tasty      olive_oil, pepper 
3 4  okay olive_oil, onion, potato, black_pepper 
2 5 not_tasty tomato, fenugreek, pepper, onion, potato 
6 6  tasty    butter, cheese, wheat, ham 

> str(ag) 
'data.frame': 6 obs. of 3 variables: 
$ id : int 1 2 3 4 5 6 
$ V1 : chr "tasty" "not_tasty" "tasty" "okay" ... 
$ value:List of 6 
    ..$ 15: chr "chicken" "cinnamon" 
    ..$ 1 : chr "butter" "pepper" "onion" "cardamom" ... 
    ..$ 17: chr "olive_oil" "pepper" 
    ..$ 11: chr "olive_oil" "onion" "potato" "black_pepper" 
    ..$ 6 : chr "tomato" "fenugreek" "pepper" "onion" ... 
    ..$ 19: chr "butter" "cheese" "wheat" "ham"

来源

2015-05-03 17:23:48

感谢您的这一点，这当然会使我的数据在'data.frame'中可用。我会坚持接受不要劝阻别人，因为你永远不知道人们会提出什么建议。 –

增加了长而宽的形式。 –

我不认为搡你的数据到data.frame或data.table会帮助你很多，因为这两种形式通常都假定为矩形数据。如果你只是想要一个字符向量列表，你可以用它来读取它们。

strsplit(readLines("data.csv"), ",")

这一切都取决于你要与你在阅读后的数据做什么。如果你计划使用现有的功能，就他们所期望的输入？

听起来像你可能正在跟踪每个这些食谱的术语。也许适当的数据结构将成为用于文本挖掘的tm包中的“语料库”。

来源

2015-05-03 17:32:19 MrFlick

我最终使用'tm'进行分析，但是使用'DataframeSource'来构建一个语料库 - 请问为什么你的建议是不能沿着这条路线走下去？ –

这只是说你的数据看起来不像data.frame。它不是长方形的;这些专栏没有任何意义/意义。把它放在data.frame中没有任何好处。您只需添加一堆NA值即可填写。您还可以使用其他来源将数据导入到'tm'中，并且您甚至可以为您的数据看起来像是定义自己的数据源。仅仅因为它是“数据”并不意味着它必须进入数据框架。 – MrFlick

了解并非所有的数据都是针对data.frame，只是不熟悉R数据结构。在Mathematica中，我将每行都导入为一个List，然后将它们重新排列为Association，它们本质上都是字典的数据，如“012” >;数据[“美味”]'这会给我所有的美味钥匙。感谢您的回复 –

导入和分析R中的非矩形.csv文件

回答

相关问题