R：读取第一列，然后读取其余的部分

我有一个包含代码及其说明的文件。代码总是一个简短的（3-6个字符）字符串，与以下说明中的空格分开。描述通常是几个词（也包含空格）。这里是例如：R：读取第一列，然后读取其余的部分

LIISS License Issued 
LIMOD License Modified 
LIPASS License Assigned (Partial Assignment) 
LIPND License Assigned (Partition/Disaggregation) 
LIPPND License Issued from a Partial/P&D Assignment 
LIPUR License Purged 
LIREIN License Reinstated 
LIREN License Renewed

我想读它与在第一列中的代码，并在第二个中的描述有2列数据帧。我如何用R来做到这一点？

来源

2015-09-30 Pavel Anni

后可再现的例子。 –

你可以使用stri_split_fixed()从stringi

library(stringi) 
as.data.frame(stri_split_fixed(readLines("x.txt"), " ", n = 2, simplify = TRUE)) 
#  V1           V2 
# 1 LIISS        License Issued 
# 2 LIMOD        License Modified 
# 3 LIPASS  License Assigned (Partial Assignment) 
# 4 LIPND License Assigned (Partition/Disaggregation) 
# 5 LIPPND License Issued from a Partial/P&D Assignment 
# 6 LIPUR        License Purged 
# 7 LIREIN       License Reinstated 
# 8 LIREN        License Renewed

这里我们使用readLines()读取文件（由"x.txt"所示）。然后stri_split_fixed()表示我们想要分割一个空间，并且要求n = 2列作为回报（从而仅在第一个空间分割）。 simplify = TRUE用于返回一个矩阵而不是一个列表。

数据：x.txt

writeLines("LIISS License Issued 
LIMOD License Modified 
LIPASS License Assigned (Partial Assignment) 
LIPND License Assigned (Partition/Disaggregation) 
LIPPND License Issued from a Partial/P&D Assignment 
LIPUR License Purged 
LIREIN License Reinstated 
LIREN License Renewed", "x.txt")

来源

2015-09-30 03:23:44

工作！谢谢，理查德！ –

我们可以看这个使用readLines，然后使用sub

#read the lines with readLines 
lines <- readLines('pavel.txt') 
#match one or more spaces followed by one or more characters 
#replace with `''` to extract the non-space characters at the beginning. 
str1 <- sub('\\s+.*', '', lines) 
#match non space characters from the beginning (`^[^ ]+`) followed by space 
#replace with `''` to extract the characters that follow after the space. 
str2 <- sub('^[^ ]+\\s+', '', lines) 
out <- data.frame(v1= str1, v2=str2, stringsAsFactors=FALSE) 
head(out,3) 
#  v1         v2 
#1 LIISS      License Issued 
#2 LIMOD      License Modified 
#3 LIPASS License Assigned (Partial Assignment)

创建一个两列data.frame或者另一种选择读取数据集作为一列，在从library(tidyr)extract。我们使用捕获组来提取每列中所需的字符。这里的([^ ]+)匹配一个或多个非空格，并用圆括号捕捉，后面跟着一个或多个空格（我们删除），然后使用第二个捕获组来提取其余字符。

library(tidyr) 
extract(read.table('pavel.txt', sep=','), V1, 
       into= c('V1', 'V2'), '([^ ]+)\\s+(.*)') 
#  V1           V2 
#1 LIISS        License Issued 
#2 LIMOD        License Modified 
#3 LIPASS  License Assigned (Partial Assignment) 
#4 LIPND License Assigned (Partition/Disaggregation) 
#5 LIPPND License Issued from a Partial/P&D Assignment 
#6 LIPUR        License Purged 
#7 LIREIN       License Reinstated 
#8 LIREN        License Renewed

或者，我们可以更换,第一空间，然后使用read.csv与sep=','。

read.table(text=sub(' ', ',', readLines('pavel.txt')), sep=',') 
#  V1           V2 
#1 LIISS        License Issued 
#2 LIMOD        License Modified 
#3 LIPASS  License Assigned (Partial Assignment) 
#4 LIPND License Assigned (Partition/Disaggregation) 
#5 LIPPND License Issued from a Partial/P&D Assignment 
#6 LIPUR        License Purged 
#7 LIREIN       License Reinstated 
#8 LIREN        License Renewed

如果我们正在使用Linux，那么awk可以用fread从data.table或read.csv/read.table管道。

library(data.table) 
fread("awk '{sub(\" \", \",\", $0)}1' pavel.txt", header=FALSE) 
#  V1           V2 
#1: LIISS        License Issued 
#2: LIMOD        License Modified 
#3: LIPASS  License Assigned (Partial Assignment) 
#4: LIPND License Assigned (Partition/Disaggregation) 
#5: LIPPND License Issued from a Partial/P&D Assignment 
#6: LIPUR        License Purged 
#7: LIREIN       License Reinstated 
#8: LIREN        License Renewed

来源

2015-09-30 03:16:33 akrun

R：读取第一列，然后读取其余的部分

回答

相关问题