下面的代码应该工作下列有关数据被格式化的方式的一些假设:
- 你的开始日期是在“YYYY-MM-DD”或“YYYY-MM-DD”的格式,并 其次是一个逗号,
- 您的结束日期是相同的格式为您的起始日期,并 一个逗号,
- 你的变量名遵循一个结束日期,不含 数字。
正如Oriol Mirosa所暗示的那样,这些假设可能不成立。
# Your string
string = "1942-10-06,1996-03-31Snow Depth (in/mm)1942-11-01,1996-03-31Snowfall (in/mm)1942-10-01,1997-12-27Growing Degree DaysHeating Degree DaysAverage Temperature (F/C)Maximum Temperature (F/C)1950-08-01,1970-03-31Observation Time Temperature (F/C)1942-10-01,1997-12-27Minimum Temperature (F/C)1942-10-01,1996-03-31Precipitation (in/mm)"
# Extract text matching Assumptions 1-3, respectively, above
library(stringr)
start_dates = str_extract_all(string, "[0-9]{4}-[0-9]{2}-[0-9]{2},")
end_dates = str_extract_all(string, ",[0-9]{4}-[0-9]{2}-[0-9]{2}")
var_names = str_extract_all(string,
",[0-9]{4}-[0-9]{2}-[0-9]{2}([^[0-9]])+")
# Remove the irrelevant bits (e.g., leading/trailing commas)
start_dates = as.Date(gsub(",", "", unlist(start_dates))) #remove ","
end_dates = as.Date(gsub(",", "", unlist(end_dates))) #remove ","
var_names = gsub(",[0-9]{4}-[0-9]{2}-[0-9]{2}", "", unlist(var_names))
# Put into table
X = data.frame("Start_date" = start_dates,
"End_date" = end_dates,
"Var_name" = var_names)
你已经试过了什么?这看起来就像一个简单的工作正则表达式 –
这是“生长度DaysHeating Degree天平均温度(F/C)'一个变量?如果不是这样,那么这个模式并没有真正重复。 –