如果在400万观测数据文件的每一行中出现约2000字的一个，我正在使用R和写脚本来计算脚本。具有观察值（df）的数据集包含两列，一列包含文本（df $ lead_paragraph），另一列包含日期（df $ date）。如果在400万观测数据集的每一行中出现一个字，则计数

使用以下内容，我可以计算列表（p）中的任何单词是否出现在df文件的lead_paragraph列的每一行中，并将答案作为新列输出。

df$pcount<-((rowSums(sapply(p, grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)

但是，如果我包括一览P太多的话，运行代码崩溃R.

我的备用策略是简单地碎裂成片，但我不知道是否有一个更好的，这里使用更优雅的编码解决方案。我的倾向是使用for循环，但是我读的所有内容都表明这不是R的首选。我对R很新，并且不是一个很好的编码器，所以如果不清楚，我很抱歉。

df$pcount1<-((rowSums(sapply(p[1:100], grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1) 
    df$pcount2<-((rowSums(sapply(p[101:200], grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1) 
    ... 
    df$pcount22<-((rowSums(sapply(p[2101:2200], grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)

来源

2017-08-28 chydock

一些事情/提示，但绝对不是解决方案（还）。首先，数据越大，离开基数R越好（也许使用'data.tables'？）。其次，我会使用'any'函数，在这种情况下，您可以跳过'rowSums'部分，以及不等式和乘法。第三，你知道这些单词是否会随机出现，或者是否有某种模式，即在开始或结束时？如果是的话，这将大大简化事情。最后，尝试解析文本，摆脱不必要的内存使用。 –

目标是计算每行中存在的'p'中任何字符串的出现次数吗？这样： '对于数据帧x的每一行，计算P中任何字符串的N个出现次数并将其合计到一个新行中？ –

@CarlBoneri - 是的，最终，我只需要知道p中的任何字符串是否出现在给定的数据行中（二进制，真/假），但计数就足够了。 – chydock

我没有完成这个......但是这应该指向正确的方向。使用data.table包的速度更快，但希望这可以让您了解该过程。

我使用这是从http://www.norvig.com/big.txt提取到一个名为nrv_df

library(stringi) 

> head(nrv_df) 
                  lead_para  date 
1  The Project Gutenberg EBook of The Adventures of Sherlock Holmes 2018-11-16 
2           by Sir Arthur Conan Doyle 2019-06-05 
3       15 in our series by Sir Arthur Conan Doyle 2017-08-08 
4 Copyright laws are changing all over the world Be sure to check the 2014-12-17 
5 copyright laws for your country before downloading or redistributing 2016-09-13 
6       this or any other Project Gutenberg eBook 2015-06-15 

> dim(nrv_df) 
[1] 103598  2 

I then randomly sampled words from the entire body to get 2000 unique words 
> length(p) 
[1] 2000 
> head(p) 
[1] "The"  "Project" "Gutenberg" "EBook"  "of"   "Adventures" 
> tail(p) 
[1] "accomplice" "engaged" "guessed" "row"  "moist"  "red"

然后data.frame 随机日期和字符串，以利用stringi包，并使用正则表达式来匹配完整情况下重新创建数据集的话，我加入每一串的矢量p，并且用|然后崩溃，所以我们正在寻找之前或之后有word-boundary 任何言语：

> p_join2 <- stri_join(sprintf("\\b%s\\b", p), collapse = "|") 
> p_join2 

[1] "\\bThe\\b|\\bProject\\b|\\bGutenberg\\b|\\bEBook\\b|\\bof\\b|\\bAdventures\\b|\\bSherlock\\b|\\bHolmes\\b|\\bby\\b|\\bSir\\b|\\bArthur\\b|\\bConan\\b|\\bDoyle\\b|\\b15\\b|\\bin\\b|\\bour\\b|\\bseries\\b|\\bCopyright\\b|\\blaws\\b|\\bare\\b|\\bchanging\\b|\\ball\\b|\\bover\\b|\\bthe\\b|\\bworld\\b|\\bBe\\b|\\bsure\\b|\\bto\\b|\\bcheck\\b|\\bcopyright\\b|\\bfor\\b|\\byour\\b|\\bcountry\\b|..."

，然后简单地算的话，你可以做nrv_df$counts <-添加此为一列...

> stri_count_regex(nrv_df$lead_para[25000:26000], p_join2, stri_opts_regex(case_insensitive = TRUE)) 
[1] 12 11 8 13 7 7 6 7 6 8 12 1 6 7 8 3 5 3 5 5 5 4 7 5 5 5 5 5 10 2 8 13 5 8 9 7 6 5 7 5 9 8 7 5 7 8 5 6 0 8 6 
[52] 3 4 0 10 7 9 8 4 6 8 8 7 6 6 6 0 3 5 4 7 6 5 7 10 8 10 10 11

编辑：

因为它是没有结果发现数量匹配... 首先为每个段落做功并检测p2中是否存在lead_paragraph

f <- function(i, j){ 
    if(any(stri_detect_fixed(i, j, omit_no_match = TRUE))){ 
     1 
    }else { 
     0 
    } 
}

现在...在Linux上使用parallel库。而且，只有测试1000行，因为它是一个例子给了我们：

library(parallel) 
library(stringi) 
> rst <- mcmapply(function(x){ 
    f(i = x, j = p2) 
}, vdf2$lead_paragraph[1:1000], 
mc.cores = detectCores() - 2, 
USE.NAMES = FALSE) 
> rst 
    [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
    [70] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[139] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 
[208] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[277] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[346] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 
[415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[484] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[553] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[622] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[691] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[760] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[829] 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
[898] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 
[967] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

来源

2017-08-29 00:26:12

两种解决方案都很好，谢谢。根据需要，当我包含整个数据集（4mil行）时不会导致崩溃，也不需要分割分析。 – chydock

这也适用于：

：比以前的解决方案快

library(corpus) 

# simulate the problem as in @carl-boneri's answer 
lead_para <- readLines("http://www.norvig.com/big.txt") 

# get a random sample of 2000 word types 
types <- text_types(lead_para, collapse = TRUE) 
p <- sample(types, 2000) 

# find whether each entry has at least one of the terms in `p` 
ix <- text_detect(lead_para, p)

即使只使用单核，它的20倍以上

system.time(ix <- text_detect(lead_para, p)) 
## user system elapsed 
## 0.231 0.008 0.240 

system.time(rst <- mcmapply(function(x) f(i = x, j = p_join2), 
          lead_para, mc.cores = detectCores() - 2, 
          USE.NAMES = FALSE)) 
## user system elapsed 
## 11.604 0.240 5.805

来源

2017-10-04 22:44:36

如果在400万观测数据集的每一行中出现一个字，则计数

回答

编辑：

相关问题