如何用R中的多段头解析文本文件？

我需要解析有头的多个部分ASCII文件。该模式片断如下如何用R中的多段头解析文本文件？

Name1 | header1 | header2 | header3 
header1| 11 | x1 
Name2 | header1 | header2 | header3 
header1| 2.5 | x2 
header1| 3.7 | x3 
header1| 4.2 | x4 
Name3 | header1 | header2 | header3 
header1| 34 | x5 
header1| 37 | x6 
etc.

我的任务是计算方差从头1数据：

Names | Variances 
------------------------- 
Name1 | var(11) # =NA 
Name2 | var(c(2.5,3.7,4.2)) 
Name3 | var(c(34,37)) 
etc.

如何处理此类R中的文件？

我真正的文件是比较复杂：

HD 4478 | velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or|  Reference  | 
velocities |V | -23.00  5.20 |D ( )|s , ,O ,  |   |  | |1992A&AS...95..541F| 
BD +41 43| velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or|  Reference  | 
velocities |V | 18.40  7.40 |D ( )|s , ,O ,  |   |  | |2007AN....328..889K| 
velocities |v | 18.4    |D ( 3)| , , ,  |   |NN  | |1979IAUS...30...57E| 
velocities |v | 15.2    | ( 4)| , , ,  |   |  | |1970MmRAS..72..233H| 
HIP 8855 | velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or|  Reference  | 
velocities |V | -10.00  7.40 |D ( )|s , ,O ,  |   |  | |1999A&AS..137..451G| 
HD 215441 | velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or|  Reference  | 
velocities |v | -5.5    | ( 11)| , , ,  |   |  | |1969ApJ...156..967P| 
velocities |v |      | ( 18)| , , ,  |   |V  | |1960ApJ...132..521B| 
HD 147010 | velocities |typ| Value R m.e. |A (Nmes)|na,Q,dom , res D| Obs.date | Rem. |Or|  Reference  | 
velocities |V | -3.96  1.41 |B ( )|s , ,O ,  |   |  | |2012ApJ...745...56D| 
velocities |V | -8.20  3.10 |C ( )|s , ,O ,  |   |  | |2006AstL...32..759G| 
velocities |v | -9     |C ( 3)| , , ,  |   |NN  | |1953GCRV..C......0W| 
velocities |v | -8.8    | ( 3)| , , ,  |   |  | |1950ApJ...111..221W|

期望的结果是：

Names | Variances 
------------------------- 
HD 4478 | var(-23.00) # =NA 
BD +41 43| var(c(18.40,18.4,15.2)) 
HIP 8855 | var(-10.00) # =NA 
HD 215441| var(-5.5) # =NA 
HD 147010| var(c(-3.96,-8.20,-9,-8.8))

来源

2015-09-16 drastega

你在Windows？ –

没有，运行OS X Yosemite – drastega

'readLines'始终是一个选择。 – DGKarlsson

的主要问题是正确读取数据。也许这个格式是在某处指定的？然而，读您的样本数据的几行内是可能的：

# read your ascii-file 
asciitxt = readClipboard() 
# find the headers (starting with "Name") 
headers = which(grepl("^Name", asciitxt)) 
# split asciitext in groups 
asciitxt = split(asciitxt, cumsum(seq_along(asciitxt) %in% headers)) 
# read asciitext as dataframe 
l.in = lapply(asciitxt, function(x) read.table(text=x, header=T, sep="|", fill=T, stringsAsFactors=F)) 
# name the elements of your list 
names(l.in) = sapply(l.in, function(x) names(x)[1]) 
# do your calculations 
sapply(l.in, function(x) var(x$header1))

与实际数据的问题是，你需要计算的值不是一个变量分离。例如在第2行中，变量“typ”不仅包含值“23.00”，而且包含字符串“23.00 5.20”。 read.table后，你必须以某种方式降低你的变量“typ”。看看包tidyr ::提取。

来源

2015-09-17 12:26:37 MarkusN

asciitxt =拆分（asciitxt，cumsum（grepl的ASCII输出（”^Name“，asciitxt）））也可以分组分割asciitext。谢谢@MarkusN – drastega

如何用R中的多段头解析文本文件？

回答

相关问题