2017-04-10 42 views
-1

我试图将两列字符数据转换为因子,因此我可以分析它们的“级别”。无法将字符列转换为R中的数据类型因子

问题出在代码的最后。 两列之一处理得很好。当我运行“levels”命令时会发现一些字符串。

> levels(austinCrime2014_data_selected_zips$highestOffenseDesc) 
[1] "AGG ROBBERY BY ASSAULT" "AGG ROBBERY/DEADLY WEAPON" "BURG NON RESIDENCE SHEDS" "BURGLARY NON RESIDENCE" 
[5] "BURGLARY OF RESIDENCE"  "ROBBERY BY ASSAULT"  "ROBBERY BY THREAT" 

当我运行的另一列“级别”,我看到它出现在数据从字符转换为有麻烦 - >因素的数据类型。

> levels(austinCrime2014_data_selected_zips$NIBRS_OffenseDesc) 
[1] "Burglary/\nBreaking & Entering" "Robbery" 

我希望有人能帮助我理解这里发生了什么,以及如何纠正它。

这里是我一起工作的代码:

library(data.table) 
library(readr) 
library(dplyr) 

#### 
#### Import 2014 neighborhood economic data 
#### 
# Import data 
austin2014_data_raw <- read_csv('https://data.austintexas.gov/resource/hcnj-rei3.csv', na = '') 
glimpse(austin2014_data_raw) 
nrow(austin2014_data_raw) 

# Clean it: Remove NAs 
austin2014_data <- na.omit(austin2014_data_raw) 
nrow(austin2014_data) # now there's one less row. 

columnSelection <- c("Zip Code", "Population below poverty level", "Median household income", "Unemployment", "Median rent", "Percentage of rental units in poor condition") 

## Our neighborhood economic data subset 
austin2014_data_selection <- subset(austin2014_data, select=columnSelection) 
names(austin2014_data_selection) 

# Extract the zip codes for mapping & comparison with crime data 
zipCodesOfData <- austin2014_data_selection$`Zip Code` 



#### 
#### Import crime data 
#### 

# Import data 
austinCrime2014_data_raw <- read_csv('https://data.austintexas.gov/resource/7g8v-xxja.csv', na = '') 
glimpse(austinCrime2014_data_raw) 
nrow(austinCrime2014_data_raw) 

# Select and rename required columns 
columnSelection_Crime <- c("GO Location Zip", "GO Highest Offense Desc", "Highest NIBRS/UCR Offense Description") 
austinCrime_dataset <- select(austinCrime2014_data_raw, one_of(columnSelection_Crime)) 
names(austinCrime_dataset) <- c("zipcode", "highestOffenseDesc", "NIBRS_OffenseDesc") 
glimpse(austinCrime_dataset) 
nrow(austinCrime_dataset) 

# Filter crime data by zipcodes available in the neighborhood economic data subset 
austinCrime2014_data_selected_zips <- filter(austinCrime_dataset, zipcode %in% zipCodesOfData) 
glimpse(austinCrime2014_data_selected_zips) 
nrow(austinCrime2014_data_selected_zips) 
typeof(austinCrime2014_data_selected_zips) 

#### 
#### Convert our crime data subset from string/char data into factorized data so we can see levels 
#### 

# let's make the character data columns c("highestOffenseDesc", "NIBRS_OffenseDesc") into factors so we can check its levels 
glimpse(austinCrime2014_data_selected_zips) # characters 
cols <- c("highestOffenseDesc", "NIBRS_OffenseDesc") # columns with character datatype to convert to factor datatype 
austinCrime2014_data_selected_zips[cols] <- lapply(austinCrime2014_data_selected_zips[cols], factor) 
glimpse(austinCrime2014_data_selected_zips) # factors 

View(austinCrime2014_data_selected_zips) 
levels(austinCrime2014_data_selected_zips$highestOffenseDesc) #--> looks good 
levels(austinCrime2014_data_selected_zips$NIBRS_OffenseDesc) # output is weird: "Burglary/\nBreaking & Entering" "Robbery" 
+0

的问题是,你需要做的字符数据的更清洁和摆脱的\ n。 – Elin

回答

1

有与转换没有问题。它只是向您展示实际存在的内容:数据表的“单元格”包含一个新的行字符:\n

如果你想清理它,你可以使用gsub来替换转义字符。或者可能只是为该级别分配一个新名称。

到这里看看:Remove escapes from a string, or, "how can I get \ out of the way?"

+0

谢谢。作为一个相关的后续,我在同一个代码中的变量有问题。该变量是“zipCodesOfData”。当我使用这个命令:“查看(zipCodesOfData)”我得到这个奇怪的输出:http://imgur.com/tvbK0wz 我希望你可能知道是什么原因造成这个问题......这是非常奇怪的B/C它是就好像有一种包含整个邮编列表字符串的“鬼”单元格。 –

+0

@PatrickMeaney,什么是'class(zipCodesOfData)'?我的猜测是这是一个蹒跚,而且你期待数字。你所看到的不是数据中的单元格,而是列的标题。要么是这个,要么'view'确实是很奇怪的因素。我个人从不使用'view',所以我不是最好的人问。 –

相关问题