用纯文本输入的纯文本输入的简单部分标记

我正在使用tidytext（和tidyverse）分析一些文本数据（如Tidy Text Mining with R）。用纯文本输入的纯文本输入的简单部分标记

我输入的文本文件，myfile.txt，看起来是这样的：

# Section 1 Name 
Lorem ipsum dolor 
sit amet ... (et cetera) 
# Section 2 Name 
<multiple lines here again>

与60层左右的部分。

我想生成一个列section_name与字符串"Category 1 Name"或"Category 2 Name"作为相应的行的值。例如，我有

library(tidyverse) 
library(tidytext) 
library(stringr) 

fname <- "myfile.txt" 
all_text <- readLines(fname) 
all_lines <- tibble(text = all_text) 
tidiedtext <- all_lines %>% 
    mutate(linenumber = row_number(), 
     section_id = cumsum(str_detect(text, regex("^#", ignore_case = TRUE)))) %>% 
    filter(!str_detect(text, regex("^#"))) %>% 
    ungroup()

这增加了一列中tidiedtext对于每行相应的节号。

是否可以添加一行到调用mutate()添加这样的列？还是有另一种方法我应该使用？

来源

2017-02-23 weinerjm

下面是使用grepl为简单起见，if_else和tidyr::fill的方法，但原始方法没有任何问题;它与tidytext书中使用的非常相似。另外请注意，添加行号后进行筛选会导致一些不存在的情况。如果重要，请在filter之后添加行号。

library(tidyverse) 

text <- '# Section 1 Name 
Lorem ipsum dolor 
sit amet ... (et cetera) 
# Section 2 Name 
<multiple lines here again>' 

all_lines <- data_frame(text = read_lines(text)) 

tidied <- all_lines %>% 
    mutate(line = row_number(), 
      section = if_else(grepl('^#', text), text, NA_character_)) %>% 
    fill(section) %>% 
    filter(!grepl('^#', text)) 

tidied 
#> # A tibble: 3 × 3 
#>       text line   section 
#>       <chr> <int>   <chr> 
#> 1   Lorem ipsum dolor  2 # Section 1 Name 
#> 2 sit amet ... (et cetera)  3 # Section 1 Name 
#> 3 <multiple lines here again>  5 # Section 2 Name

或者，如果你只是想格式化你已经拿到了号码，只需添加section_name = paste('Category', section_id, 'Name')到您的电话mutate。

来源

2017-02-23 21:34:41 alistaire

谢谢！这几乎是我正在寻找的。 – weinerjm

我不希望有你重写你的整个脚本，但我刚刚发现的问题有趣，想添加一个基础R暂定：

parse_data <- function(file_name) { 
    all_rows <- readLines(file_name) 
    indices <- which(grepl('#', all_rows)) 
    splitter <- rep(indices, diff(c(indices, length(all_rows)+1))) 
    lst <- split(all_rows, splitter) 
    lst <- lapply(lst, function(x) { 
    data.frame(section=x[1], value=x[-1], stringsAsFactors = F) 
    }) 
    line_nums = seq_along(all_rows)[-indices] 
    df <- do.call(rbind.data.frame, lst) 
    cbind.data.frame(df, linenumber = line_nums) 
}

测试名为ipsum_data.txt文件：

parse_data('ipsum_data.txt')

产量：

text      section   linenumber 
Lorem ipsum dolor   # Section 1 Name 2   
sit amet ... (et cetera) # Section 1 Name 3   
<multiple lines here again> # Section 2 Name 5

文件ipsum_data.txt包含：

# Section 1 Name 
Lorem ipsum dolor 
sit amet ... (et cetera) 
# Section 2 Name 
<multiple lines here again>

我希望这证明有用。

来源

2017-02-23 22:17:51 Abdou

感谢您的回复。这非常有帮助。重写脚本对我来说没什么大不了的，但我认为另一种解决方案更多的是我在简洁性方面寻找的东西。 – weinerjm

用纯文本输入的纯文本输入的简单部分标记

回答

相关问题