我试图将两列字符数据转换为因子,因此我可以分析它们的“级别”。无法将字符列转换为R中的数据类型因子
问题出在代码的最后。 两列之一处理得很好。当我运行“levels”命令时会发现一些字符串。
> levels(austinCrime2014_data_selected_zips$highestOffenseDesc)
[1] "AGG ROBBERY BY ASSAULT" "AGG ROBBERY/DEADLY WEAPON" "BURG NON RESIDENCE SHEDS" "BURGLARY NON RESIDENCE"
[5] "BURGLARY OF RESIDENCE" "ROBBERY BY ASSAULT" "ROBBERY BY THREAT"
当我运行的另一列“级别”,我看到它出现在数据从字符转换为有麻烦 - >因素的数据类型。
> levels(austinCrime2014_data_selected_zips$NIBRS_OffenseDesc)
[1] "Burglary/\nBreaking & Entering" "Robbery"
我希望有人能帮助我理解这里发生了什么,以及如何纠正它。
这里是我一起工作的代码:
library(data.table)
library(readr)
library(dplyr)
####
#### Import 2014 neighborhood economic data
####
# Import data
austin2014_data_raw <- read_csv('https://data.austintexas.gov/resource/hcnj-rei3.csv', na = '')
glimpse(austin2014_data_raw)
nrow(austin2014_data_raw)
# Clean it: Remove NAs
austin2014_data <- na.omit(austin2014_data_raw)
nrow(austin2014_data) # now there's one less row.
columnSelection <- c("Zip Code", "Population below poverty level", "Median household income", "Unemployment", "Median rent", "Percentage of rental units in poor condition")
## Our neighborhood economic data subset
austin2014_data_selection <- subset(austin2014_data, select=columnSelection)
names(austin2014_data_selection)
# Extract the zip codes for mapping & comparison with crime data
zipCodesOfData <- austin2014_data_selection$`Zip Code`
####
#### Import crime data
####
# Import data
austinCrime2014_data_raw <- read_csv('https://data.austintexas.gov/resource/7g8v-xxja.csv', na = '')
glimpse(austinCrime2014_data_raw)
nrow(austinCrime2014_data_raw)
# Select and rename required columns
columnSelection_Crime <- c("GO Location Zip", "GO Highest Offense Desc", "Highest NIBRS/UCR Offense Description")
austinCrime_dataset <- select(austinCrime2014_data_raw, one_of(columnSelection_Crime))
names(austinCrime_dataset) <- c("zipcode", "highestOffenseDesc", "NIBRS_OffenseDesc")
glimpse(austinCrime_dataset)
nrow(austinCrime_dataset)
# Filter crime data by zipcodes available in the neighborhood economic data subset
austinCrime2014_data_selected_zips <- filter(austinCrime_dataset, zipcode %in% zipCodesOfData)
glimpse(austinCrime2014_data_selected_zips)
nrow(austinCrime2014_data_selected_zips)
typeof(austinCrime2014_data_selected_zips)
####
#### Convert our crime data subset from string/char data into factorized data so we can see levels
####
# let's make the character data columns c("highestOffenseDesc", "NIBRS_OffenseDesc") into factors so we can check its levels
glimpse(austinCrime2014_data_selected_zips) # characters
cols <- c("highestOffenseDesc", "NIBRS_OffenseDesc") # columns with character datatype to convert to factor datatype
austinCrime2014_data_selected_zips[cols] <- lapply(austinCrime2014_data_selected_zips[cols], factor)
glimpse(austinCrime2014_data_selected_zips) # factors
View(austinCrime2014_data_selected_zips)
levels(austinCrime2014_data_selected_zips$highestOffenseDesc) #--> looks good
levels(austinCrime2014_data_selected_zips$NIBRS_OffenseDesc) # output is weird: "Burglary/\nBreaking & Entering" "Robbery"
的问题是,你需要做的字符数据的更清洁和摆脱的\ n。 – Elin