2017-04-02 15 views
0

我正试图在R中找到一组函数,它将在字级上运行。例如一个可以返回单词位置的函数。例如,给定以下sentencequery在R中是否有文字处理函数在字级上进行操作?

sentence <- "A sample sentence for demo" 
query <- "for" 
  1. 该函数将返回4. for是4个字。

  2. 如果我可以得到一个效用函数,这将允许我在左右方向上延伸query,这将是非常好的。 例如extend(query, 'right')将返回for demoextend(query, 'left')将返回sentence for

我已经通过了的功能如grep,gregexp,从stringr包等字。所有人似乎都在角色层面上运作。

+0

退房' stringr :: word'。如:word(string,start = 1L,end = start,sep = fixed(“”))'。你也可以用'end = -2L'来得到最后两个单词。 – p0bs

回答

0

我写我自己的功能,如果在sentence发现indexOf方法返回word的索引,否则返回-1,很像java indexOf()

indexOf <- function(sentence, word){ 
    listOfWords <- strsplit(sentence, split = " ") 
    sentenceAsVector <- unlist(listOfWords) 

    if(word %in% sentenceAsVector == FALSE){ 
    result=-1 
    } 
    else{ 
    result = which(sentenceAsVector==word) 
    } 
    return(result) 
} 

extend方法是否工作正常,但很长的看起来不像R代码。如果query是句子的边界上的字,即第一个字或最后一个字,前两个单词或最后两个单词返回

extend <- function(sentence, query, direction){ 
    listOfWords = strsplit(sentence, split = " ") 
    sentenceAsVector = unlist(listOfWords) 
    lengthOfSentence = length(sentenceAsVector) 
    location = indexOf(sentence, query) 
    boundary = FALSE 
    if(location == 1 | location == lengthOfSentence){ 
    boundary = TRUE 
    } 
    else{ 
    boundary = FALSE 
    } 
    if(!boundary){ 
    if(location> 1 & direction == "right"){ 
     return(paste(sentenceAsVector[location], 
        sentenceAsVector[location + 1], 
        sep=" ") 
    ) 
    } 
    else if(location < lengthOfSentence & direction == "left"){ 
     return(paste(sentenceAsVector[location - 1], 
        sentenceAsVector[location], 
        sep=" ") 
    ) 

    } 
    } 
    else{ 
    if(location == 1){ 
     return(paste(sentenceAsVector[1], sentenceAsVector[2], sep = " ")) 
    } 
    if(location == lengthOfSentence){ 
     return(paste(sentenceAsVector[lengthOfSentence - 1], 
        sentenceAsVector[lengthOfSentence], sep = " ")) 
    } 
    } 
} 
0

正如我在我的评论中提到的,stringr在这些情况下很有用。

library(stringr) 

sentence <- "A sample sentence for demo" 
wordNumber <- 4L 

fourthWord <- word(string = sentence, 
        start = wordNumber) 

previousWords <- word(string = sentence, 
         start = wordNumber - 1L, 
         end = wordNumber) 

laterWords <- word(string = sentence, 
        start = wordNumber, 
        end = wordNumber + 1L) 

而这个收益率:

> fourthWord 
[1] "for" 
> previousWords 
[1] "sentence for" 
> laterWords 
[1] "for demo" 

我希望帮助你。

1

如果使用scan,它将在空格分开输入:

> s.scan <- scan(text=sentence, what="") 
Read 5 items 
> which(s.scan == query) 
[1] 4 

极品what=""告诉扫描期望字符而不是数字输入。如果您的输入是完整的英语句子,则可能需要使用gsubpatt="[[:punct:]]"来替换标点符号。如果您尝试对词类进行分类或处理大型文档,可能还需要查看tm(文本挖掘)软件包。

0

答案取决于你的意思是一个“字”是什么。如果您的意思是以空格分隔的标记,那么@ imran-ali的答案可以正常工作。如果你的意思是由Unicode定义的词,特别注意标点符号,那么你需要更复杂的东西。

下正确处理标点符号:

library(corpus) 
sentence <- "A sample sentence for demo" 
query <- "for" 

# use text_locate to find all instances of the query, with context 
text_locate(sentence, query) 
## text    before    instance    after    
## 1 1     A sample sentence for  demo    

# find the number of tokens before, then add 1 to get the position 
text_ntoken(text_locate(sentence, query)$before) + 1 
## 4 

如果有多个匹配这也适用于:

sentence2 <- "for one, for two! for three? for four" 
text_ntoken(text_locate(sentence2, query)$before) + 1 
## [1] 1 4 7 10 

我们可以确认这是正确的:

text_tokens(sentence2)[[1]][c(1, 4, 7, 10)] 
## [1] "for" "for" "for" "for" 
相关问题