2015-09-30 75 views
0

我有一个包含代码及其说明的文件。代码总是一个简短的(3-6个字符)字符串,与以下说明中的空格分开。描述通常是几个词(也包含空格)。 这里是例如:R:读取第一列,然后读取其余的部分

LIISS License Issued 
LIMOD License Modified 
LIPASS License Assigned (Partial Assignment) 
LIPND License Assigned (Partition/Disaggregation) 
LIPPND License Issued from a Partial/P&D Assignment 
LIPUR License Purged 
LIREIN License Reinstated 
LIREN License Renewed 

我想读它与在第一列中的代码,并在第二个中的描述有2列数据帧。我如何用R来做到这一点?

+2

后可再现的例子。 –

回答

2

你可以使用stri_split_fixed()stringi

library(stringi) 
as.data.frame(stri_split_fixed(readLines("x.txt"), " ", n = 2, simplify = TRUE)) 
#  V1           V2 
# 1 LIISS        License Issued 
# 2 LIMOD        License Modified 
# 3 LIPASS  License Assigned (Partial Assignment) 
# 4 LIPND License Assigned (Partition/Disaggregation) 
# 5 LIPPND License Issued from a Partial/P&D Assignment 
# 6 LIPUR        License Purged 
# 7 LIREIN       License Reinstated 
# 8 LIREN        License Renewed 

这里我们使用readLines()读取文件(由"x.txt"所示)。然后stri_split_fixed()表示我们想要分割一个空间,并且要求n = 2列作为回报(从而仅在第一个空间分割)。 simplify = TRUE用于返回一个矩阵而不是一个列表。

数据:x.txt

writeLines("LIISS License Issued 
LIMOD License Modified 
LIPASS License Assigned (Partial Assignment) 
LIPND License Assigned (Partition/Disaggregation) 
LIPPND License Issued from a Partial/P&D Assignment 
LIPUR License Purged 
LIREIN License Reinstated 
LIREN License Renewed", "x.txt") 
+0

工作!谢谢,理查德! –

2

我们可以看这个使用readLines,然后使用sub

#read the lines with readLines 
lines <- readLines('pavel.txt') 
#match one or more spaces followed by one or more characters 
#replace with `''` to extract the non-space characters at the beginning. 
str1 <- sub('\\s+.*', '', lines) 
#match non space characters from the beginning (`^[^ ]+`) followed by space 
#replace with `''` to extract the characters that follow after the space. 
str2 <- sub('^[^ ]+\\s+', '', lines) 
out <- data.frame(v1= str1, v2=str2, stringsAsFactors=FALSE) 
head(out,3) 
#  v1         v2 
#1 LIISS      License Issued 
#2 LIMOD      License Modified 
#3 LIPASS License Assigned (Partial Assignment) 

创建一个两列data.frame或者另一种选择读取数据集作为一列,在从library(tidyr)extract。我们使用捕获组来提取每列中所需的字符。这里的([^ ]+)匹配一个或多个非空格,并用圆括号捕捉,后面跟着一个或多个空格(我们删除),然后使用第二个捕获组来提取其余字符。

library(tidyr) 
extract(read.table('pavel.txt', sep=','), V1, 
       into= c('V1', 'V2'), '([^ ]+)\\s+(.*)') 
#  V1           V2 
#1 LIISS        License Issued 
#2 LIMOD        License Modified 
#3 LIPASS  License Assigned (Partial Assignment) 
#4 LIPND License Assigned (Partition/Disaggregation) 
#5 LIPPND License Issued from a Partial/P&D Assignment 
#6 LIPUR        License Purged 
#7 LIREIN       License Reinstated 
#8 LIREN        License Renewed 

或者,我们可以更换,第一空间,然后使用read.csvsep=','

read.table(text=sub(' ', ',', readLines('pavel.txt')), sep=',') 
#  V1           V2 
#1 LIISS        License Issued 
#2 LIMOD        License Modified 
#3 LIPASS  License Assigned (Partial Assignment) 
#4 LIPND License Assigned (Partition/Disaggregation) 
#5 LIPPND License Issued from a Partial/P&D Assignment 
#6 LIPUR        License Purged 
#7 LIREIN       License Reinstated 
#8 LIREN        License Renewed 

如果我们正在使用Linux,那么awk可以用freaddata.tableread.csv/read.table管道。

library(data.table) 
fread("awk '{sub(\" \", \",\", $0)}1' pavel.txt", header=FALSE) 
#  V1           V2 
#1: LIISS        License Issued 
#2: LIMOD        License Modified 
#3: LIPASS  License Assigned (Partial Assignment) 
#4: LIPND License Assigned (Partition/Disaggregation) 
#5: LIPPND License Issued from a Partial/P&D Assignment 
#6: LIPUR        License Purged 
#7: LIREIN       License Reinstated 
#8: LIREN        License Renewed 
相关问题