如何在R中将一个数据帧转换为另一个数据帧？

我已经下载了txt。来自Kenneth R. French图书馆的文件，可通过链接http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_48_ind_port.html找到。如何在R中将一个数据帧转换为另一个数据帧？

我需要使用这些所谓的SIC代码根据行业因素将我的样本分为不同的投资组合。下载的文件是这样的：

 1 Food 
     0100-0199 Agric production - crops 
     0200-0299 Agric production - livestock 
     0700-0799 Agricultural services 
     0900-0999 Fishing, hunting & trapping 
     2000-2009 Food and kindred products 
     2010-2019 Meat products 
     2020-2029 Dairy products 
     2030-2039 Canned-preserved fruits-vegs 
     2040-2046 Flour and other grain mill products 
     2047-2047 Dog and cat food 
     2048-2048 Prepared feeds for animals 
     2050-2059 Bakery products 
     2060-2063 Sugar and confectionery products 
     2064-2068 Candy and other confectionery 
     2070-2079 Fats and oils 
     2080-2080 Beverages 
     2082-2082 Malt beverages 
     2083-2083 Malt 
     2084-2084 Wine 
     2085-2085 Distilled and blended liquors 
     2086-2086 Bottled-canned soft drinks 
     2087-2087 Flavoring syrup 
     2090-2092 Misc food preps 
     2095-2095 Roasted coffee 
     2096-2096 Potato chips 
     2097-2097 Manufactured ice 
     2098-2099 Misc food preparations 
     5140-5149 Wholesale - groceries & related prods 
     5150-5159 Wholesale - farm products 
     5180-5182 Wholesale - beer, wine 
     5191-5191 Wholesale - farm supplies 

     2 Mines 
     1000-1009 Metal mining 
     1010-1019 Iron ores 
     1020-1029 Copper ores 
     1030-1039 Lead and zinc ores 
     1040-1049 Gold & silver ores 
     1060-1069 Ferroalloy ores 
     1080-1089 Mining services 
     1090-1099 Misc metal ores 
     1200-1299 Bituminous coal 
     1400-1499 Mining and quarrying non-metalic minerals 
     5050-5052 Wholesale - metals and minerals 

     3 Oil 
     1300-1300 Oil and gas extraction 
     1310-1319 Crude petroleum & natural gas 
     1320-1329 Natural gas liquids 
     1380-1380 Oil and gas field services 
     1381-1381 Drilling oil & gas wells 
     1382-1382 Oil-gas field exploration 
     1389-1389 Oil and gas field services 
     2900-2912 Petroleum refining 
     5170-5172 Wholesale - petroleum and petro prods 

     4 Clths 
     2200-2269 Textile mill products 
     2270-2279 Floor covering mills 
     2280-2284 Yarn and thread mills 
     2290-2295 Misc textile goods 
     2296-2296 Tire cord and fabric 
     2297-2297 Nonwoven fabrics 
     2298-2298 Cordage and twine 
     2299-2299 Misc textile products 
     2300-2390 Apparel and other finished products 
     2391-2392 Curtains, home furnishings 
     2393-2395 Textile bags, canvas products 
     2396-2396 Auto trim 
     2397-2399 Misc textile products 
     3020-3021 Rubber and plastics footwear 
     3100-3111 Leather tanning and finishing 
     3130-3131 Boot, shoe cut stock, findings 
     3140-3149 Footware except rubber 
     3150-3151 Leather gloves and mittens 
     3963-3965 Fasteners, buttons, needles, pins 
     5130-5139 Wholesale - apparel

我想要做的事情是创建数据帧，其中第一列给出了行业的域名（例如，食品，采矿和矿物等）和第二列中列出了与这个行业相关的所有SIC代码（标准工业代码）（因为大多数SIC代码是以5130-5139的方式给出的，这使得它更难一些）。

这个数据框会让我的分析更容易实现。

任何建议将是非常可观的。

来源

2014-02-28 Jack

我会考虑像谷歌瑞风（离线和免费的）真实数据预处理工具。 R并不适合这类任务，即使你可以用R来完成，但是会带来更多的痛苦。 – ATN

我认为使用其他程序来处理这个问题更好，因为你的数据看起来不像数据框（你有像“4 Clths”之类的东西）。不是一种非常有效的方法，但是您可以手动执行此操作。我可以看到所有的SIC代码都是以xxxx-xxxx的形式出现的，后面跟着一个空格。所以如果你使用sep =“”来读取文件，那么第一列应该是你的SIC代码，第二列应该是你的行业名称（我不确定是否所有的名字都是单个字符串，在你的例子中，他们是），剩下的就是他们卖的东西了？ –

这将产生一个2列的数据帧df.new，其中包含在列2中的逗号分隔的代码：

df <- read.fwf("Siccodes48.txt", widths=c(3, 7, 60), stringsAsFactors=FALSE) 
df <- df[!is.na(df$V3), ] 
library(zoo) 
df$V1 <- na.locf(df$V1) 
l <- split(df, df$V1) 
l <- setNames(lapply(l, function(x) { 
    m <- regexec("([0-9]{4})-([0-9]{4}) .*", x$V3[-1]) # omit headline 
    r <- regmatches(x$V3[-1], m) 
    fromTo <- t(sapply(r, "[", 2:3)) 
    paste(sprintf("%04d", unlist(mapply(":", fromTo[, 1], fromTo[, 2]))), collapse=", ") 
}), sapply(l, "[", 1, 3)) 
df.new <- data.frame(name=names(l), sic=unlist(l))

来源

2014-02-28 14:11:02 lukeA

我很惊讶。它非常强大。谢谢你。 – Jack

怎么样？

df<-readLines("Siccodes48.txt") 
df<-data.frame(t=df[df!=""])    # delete blanks and make data frame 
df$prefix<-c(substr(df$t,1,10))   # break out the prefix (first 10 char) 
df$index<-cumsum(df$prefix!="   ") # make an index 
ind<-df[df$prefix!="   ",]   # make an index table 
ind$desc<-substring(ind$t,11,100)   # parse descriptions 
final<-merge(ind[,c("index","desc")],  # merge the index table 
      df[df$prefix=="   ",c("index","t")], # with all non-title rows of the list 
      by="index")         # by index 

head(final,10) 

    index   desc              t 
1  1 Agriculture      0100-0199 Agric production - crops 
2  1 Agriculture     0200-0299 Agric production - livestock 
3  1 Agriculture       0700-0799 Agricultural services 
4  1 Agriculture       0910-0919 Commercial fishing 
5  1 Agriculture     2048-2048 Prepared feeds for animals 
6  2 Food Products      2000-2009 Food and kindred products 
7  2 Food Products         2010-2019 Meat products 
8  2 Food Products        2020-2029 Dairy products 
9  2 Food Products     2030-2039 Canned-preserved fruits-vegs 
10  2 Food Products   2040-2046 Flour and other grain mill products

您还可以添加到这个代码闯入了一个单独的列：

final$codes<-substr(gsub("   "," ",final$t),2,10)

来源

2014-02-28 14:15:43 Troy

非常感谢您的时间！它提供了宝贵的见解。 – Jack

如何在R中将一个数据帧转换为另一个数据帧？

回答

相关问题