有分裂列更有效的方式

有执行此函数read.table时不正确导入几个值：有分裂列更有效的方式

hs.industry <- read.table("https://download.bls.gov/pub/time.series/hs/hs.industry", header = TRUE, fill = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE)

具体而言，有在industry_code和industry_name结合在一起形成几个值industry_code列中的单个值（不知道为什么）。由于每industry_code是4个位数，我的做法分裂和正确的是：

for (i in 1:nrow(hs.industry)) { 
    if (isTRUE(nchar(hs.industry$industry_code[i]) > 4)) { 
    hs.industry$industry_name[i] <- gsub("[[:digit:]]","",hs.industry$industry_code[i]) 
    hs.industry$industry_code[i] <- gsub("[^0-9]", "",hs.industry$industry_code[i]) 
    } 
}

我觉得这是非常innificent，但我不知道用什么办法会更好。

谢谢！

来源

2017-03-06 Michael

问题是，行29和30（第28和29行，如果我们不计算标题）出现格式错误。他们使用4个空格而不是正确的制表符。需要额外的数据清理。

使用readLines在原始文本阅读，更正格式错误，然后在清理表中读取：

# read in each line of the file as a list of character elements 
hs.industry <- readLines('https://download.bls.gov/pub/time.series/hs/hs.industry') 

# replace any instances of 4 spaces with a tab character 
hs.industry <- gsub('\\W{4,}', '\t', hs.industry) 

# collapse together the list, with each line separated by a return character (\n) 
hs.industry <- paste(hs.industry, collapse = '\n') 

# read in the new table 
hs.industry <- read.table(text = hs.industry, sep = '\t', header = T, quote = '')

来源

2017-03-06 18:45:48 jdobres

谢谢！你能否解释崩溃的必要性？ – Michael

当您使用带有“text”参数的read.table'时，文本必须是单个字符串，而不是字符串列表。因此，我们用换行符折叠字符串列表（其中每个项目代表原始文本的一行）。 – jdobres

你不应该遍历每个实例，而不是只确定这是有问题的GSUB只有那些条目的条目：

replace_indx <- which(nchar(hs.industry$industry_code) > 4) 
hs.industry$industry_name[replace_indx] <- gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx]) 
hs.industry$industry_code[replace_indx] <- gsub("\\D+", "", hs.industry$industry_code[replace_indx])

我也用"\\d+\\s+"改善字符串替换，在这里我也更换空格：

gsub("[[:digit:]]","",hs.industry$industry_code[replace_indx]) 
# [1] " Dimension stone"   " Crushed and broken stone" 

gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx]) 
# [1] "Dimension stone"   "Crushed and broken stone"

来源

2017-03-06 18:44:10 Djork

有分裂列更有效的方式

回答

相关问题