与R匹配的字符串：寻找最佳匹配

我有两个单词向量。与R匹配的字符串：寻找最佳匹配

Corpus<- c('animalada', 'fe', 'fernandez', 'ladrillo') 

Lexicon<- c('animal', 'animalada', 'fe', 'fernandez', 'ladr', 'ladrillo')

我需要在词汇和语料库之间做出最好的匹配。我尝试了很多方法。这是其中之一。

library(stringr) 

match<- paste(Lexicon,collapse= '|^') # I use the stemming method (snowball), so the words in Lexicon are root of words 

test<- str_extrac_all (Corpus,match,simplify= T) 

test 

[,1] 
[1,] "animal" 
[2,] "fe" 
[3,] "fe" 
[4,] "ladr"

不过，本场比赛应该是：

[1,] "animalada" 
[2,] "fe" 
[3,] "fernandez" 
[1,] "ladrillo"

相反，与之匹配的是与第一个词在我的词汇按字母顺序排列。顺便说一下，这些向量是我拥有的更大列表的样本。

我没有尝试使用正则表达式（），因为我不确定它是如何工作的。也许解决方案就是这样。

你能帮我解决这个问题吗？感谢您的帮助。

来源

2017-09-23 pch919

您可以通过字符数订购Lexicon图案有，按递减顺序，所以最好的比赛是第一位的：

match<- paste(Lexicon[order(-nchar(Lexicon))], collapse = '|^') 

test<- str_extract_all(Corpus, match, simplify= T) 

test 
#  [,1]  
#[1,] "animalada" 
#[2,] "fe"  
#[3,] "fernandez" 
#[4,] "ladrillo"

来源

2017-09-23 01:54:24 Psidom

我正在用真正的Lexicon测试你的答案。我稍后会通知结果。谢谢你们俩 – pch919

您可以只使用match功能。

Index <- match(Corpus, Lexicon) 

Index 
[1] 2 3 4 6 

Lexicon[Index] 
[1] "animalada" "fe" "fernandez" "ladrillo"

来源

2017-09-23 01:59:20 Santosh

我试过这两种方法，正确的是@Psidorm建议的。如果使用函数match()，则会在单词的任何部分找到匹配项，而不是开头的必要项。例如：

Corpus<- c('tambien') 
Lexicon<- c('bien') 
match(Corpus,Lexicon)

结果是'tambien'，但这是不正确的。

再次感谢您的帮助！

来源

2017-09-27 03:16:36 pch919

与R匹配的字符串：寻找最佳匹配

回答

相关问题