在R中是否有文字处理函数在字级上进行操作？

我正试图在R中找到一组函数，它将在字级上运行。例如一个可以返回单词位置的函数。例如，给定以下sentence和query在R中是否有文字处理函数在字级上进行操作？

sentence <- "A sample sentence for demo" 
query <- "for"

该函数将返回4. for是4个字。
如果我可以得到一个效用函数，这将允许我在左右方向上延伸query，这将是非常好的。例如extend(query, 'right')将返回for demo和extend(query, 'left')将返回sentence for

我已经通过了的功能如grep，gregexp，从stringr包等字。所有人似乎都在角色层面上运作。

来源

2017-04-02 Imran Ali

退房' stringr :: word'。如：word（string，start = 1L，end = start，sep = fixed（“”））'。你也可以用'end = -2L'来得到最后两个单词。 – p0bs

我写我自己的功能，如果在sentence发现indexOf方法返回word的索引，否则返回-1，很像java indexOf()

indexOf <- function(sentence, word){ 
    listOfWords <- strsplit(sentence, split = " ") 
    sentenceAsVector <- unlist(listOfWords) 

    if(word %in% sentenceAsVector == FALSE){ 
    result=-1 
    } 
    else{ 
    result = which(sentenceAsVector==word) 
    } 
    return(result) 
}

的extend方法是否工作正常，但很长的看起来不像R代码。如果query是句子的边界上的字，即第一个字或最后一个字，前两个单词或最后两个单词返回

extend <- function(sentence, query, direction){ 
    listOfWords = strsplit(sentence, split = " ") 
    sentenceAsVector = unlist(listOfWords) 
    lengthOfSentence = length(sentenceAsVector) 
    location = indexOf(sentence, query) 
    boundary = FALSE 
    if(location == 1 | location == lengthOfSentence){ 
    boundary = TRUE 
    } 
    else{ 
    boundary = FALSE 
    } 
    if(!boundary){ 
    if(location> 1 & direction == "right"){ 
     return(paste(sentenceAsVector[location], 
        sentenceAsVector[location + 1], 
        sep=" ") 
    ) 
    } 
    else if(location < lengthOfSentence & direction == "left"){ 
     return(paste(sentenceAsVector[location - 1], 
        sentenceAsVector[location], 
        sep=" ") 
    ) 

    } 
    } 
    else{ 
    if(location == 1){ 
     return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " ")) 
    } 
    if(location == lengthOfSentence){ 
     return(paste(sentenceAsVector[lengthOfSentence - 1], 
        sentenceAsVector[lengthOfSentence], sep = " ")) 
    } 
    } 
}

来源

2017-04-06 19:46:09

正如我在我的评论中提到的，stringr在这些情况下很有用。

library(stringr) 

sentence <- "A sample sentence for demo" 
wordNumber <- 4L 

fourthWord <- word(string = sentence, 
        start = wordNumber) 

previousWords <- word(string = sentence, 
         start = wordNumber - 1L, 
         end = wordNumber) 

laterWords <- word(string = sentence, 
        start = wordNumber, 
        end = wordNumber + 1L)

而这个收益率：

> fourthWord 
[1] "for" 
> previousWords 
[1] "sentence for" 
> laterWords 
[1] "for demo"

我希望帮助你。

来源

2017-04-02 16:32:25 p0bs

如果使用scan，它将在空格分开输入：

> s.scan <- scan(text=sentence, what="") 
Read 5 items 
> which(s.scan == query) 
[1] 4

极品what=""告诉扫描期望字符而不是数字输入。如果您的输入是完整的英语句子，则可能需要使用gsub和patt="[[:punct:]]"来替换标点符号。如果您尝试对词类进行分类或处理大型文档，可能还需要查看tm（文本挖掘）软件包。

来源

2017-04-02 17:52:53

答案取决于你的意思是一个“字”是什么。如果您的意思是以空格分隔的标记，那么@ imran-ali的答案可以正常工作。如果你的意思是由Unicode定义的词，特别注意标点符号，那么你需要更复杂的东西。

下正确处理标点符号：

library(corpus) 
sentence <- "A sample sentence for demo" 
query <- "for" 

# use text_locate to find all instances of the query, with context 
text_locate(sentence, query) 
## text    before    instance    after    
## 1 1     A sample sentence for  demo    

# find the number of tokens before, then add 1 to get the position 
text_ntoken(text_locate(sentence, query)$before) + 1 
## 4

如果有多个匹配这也适用于：

sentence2 <- "for one, for two! for three? for four" 
text_ntoken(text_locate(sentence2, query)$before) + 1 
## [1] 1 4 7 10

我们可以确认这是正确的：

text_tokens(sentence2)[[1]][c(1, 4, 7, 10)] 
## [1] "for" "for" "for" "for"

来源

2017-10-04 22:04:08

在R中是否有文字处理函数在字级上进行操作？

回答

相关问题