2017-10-13 20 views




编辑:这里是我正在使用的数据的样子。我试图复制西尔格和罗宾逊的书Tidy Text的分析,但使用意大利歌剧的librettos。

character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO") 
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!") 
sample_df = data.frame(character, line) 

character line 
FIGARO Cinque... dieci.... venti... trenta... trentasei...quarantatre 
SUSANNA Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello. 
CONTE  Susanna, mi sembri agitata e confusa. 
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia! 


tribble <- sample_df %>% 
      unnest_tokens(word, line) 
# Get rid of stop words 
# I had to make my own list of stop words for 18th century Italian opera 
itstopwords <- data_frame(text=mystopwords) 
names(itstopwords)[names(itstopwords)=="text"] <- "word" 
tribble2 <- tribble %>% 


text word 
FIGARO cinque 
FIGARO dieci 
FIGARO venti 
FIGARO trenta 



你好,请阅读[这](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)和编辑你的问题。了解更多关于你的数据是什么样的以及你做了什么会使其他用户能够帮助你。 – shea






tidy_austen <- janeaustenr::austen_books() %>% 
    group_by(book) %>% 
    mutate(linenumber = row_number()) %>% 
    ungroup() %>% 
    unnest_tokens(word, text) 

#> # A tibble: 725,055 x 3 
#>     book linenumber  word 
#>     <fctr>  <int>  <chr> 
#> 1 Sense & Sensibility   1  sense 
#> 2 Sense & Sensibility   1   and 
#> 3 Sense & Sensibility   1 sensibility 
#> 4 Sense & Sensibility   3   by 
#> 5 Sense & Sensibility   3  jane 
#> 6 Sense & Sensibility   3  austen 
#> 7 Sense & Sensibility   5  1811 
#> 8 Sense & Sensibility   10  chapter 
#> 9 Sense & Sensibility   10   1 
#> 10 Sense & Sensibility   13   the 
#> # ... with 725,045 more rows 


nested_austen <- tidy_austen %>% 
    nest(word) %>% 
    mutate(text = map(data, unlist), 
     text = map_chr(text, paste, collapse = " ")) 

#> # A tibble: 62,272 x 4 
#>     book linenumber    data 
#>     <fctr>  <int>   <list> 
#> 1 Sense & Sensibility   1 <tibble [3 x 1]> 
#> 2 Sense & Sensibility   3 <tibble [3 x 1]> 
#> 3 Sense & Sensibility   5 <tibble [1 x 1]> 
#> 4 Sense & Sensibility   10 <tibble [2 x 1]> 
#> 5 Sense & Sensibility   13 <tibble [12 x 1]> 
#> 6 Sense & Sensibility   14 <tibble [13 x 1]> 
#> 7 Sense & Sensibility   15 <tibble [11 x 1]> 
#> 8 Sense & Sensibility   16 <tibble [12 x 1]> 
#> 9 Sense & Sensibility   17 <tibble [11 x 1]> 
#> 10 Sense & Sensibility   18 <tibble [15 x 1]> 
#> # ... with 62,262 more rows, and 1 more variables: text <chr> 


nested_austen %>% 
#> # A tibble: 62,272 x 1 
#>                 text 
#>                 <chr> 
#> 1            sense and sensibility 
#> 2              by jane austen 
#> 3                1811 
#> 4               chapter 1 
#> 5 the family of dashwood had long been settled in sussex their estate 
#> 6 was large and their residence was at norland park in the centre of 
#> 7  their property where for many generations they had lived in so 
#> 8 respectable a manner as to engage the general good opinion of their 
#> 9 surrounding acquaintance the late owner of this estate was a single 
#> 10 man who lived to a very advanced age and who for many years of his 
#> # ... with 62,262 more rows