2017-04-22 42 views
0

我尝试生成字符串n克PHP对于我使用此功能从:https://gist.github.com/Xeoncross/5366393PHP分割字符串的n-gram的Unicode字符问题

function Bigrams($word){ 
    $ngrams = array(); 
    $len = strlen($word); 
    for($i=0;$i+1<$len;$i++){ 
     $ngrams[$i]=$word[$i].$word[$i+1]; 
    } 
    return $ngrams; 
} 

$word = "abcdefg"; 

print_r(Bigrams($word)); 

那OK回报预期的n-gram:

[0] => ab 
[1] => bc 
[2] => cd 
[3] => de 
[4] => ef 
[5] => fg 

但对于某些Unicode字符不会返回预期:

例如:为$字= “洛里亚” 回报:

[0] => L� 
[1] => ò 
[2] => �r 
[3] => ri 

或为$字= “пожалуйста” 回报:

[0] => п 
[1] => �� 
[2] => о 
[3] => �� 
[4] => ж 
[5] => �� 
[6] => а 
[7] => �� 
[8] => л 

不知道如何解决这个问题?

回答

1

使用Unicode取向字符串函数

function Bigrams($word){ 
    $ngrams = array(); 
    $len = mb_strlen($word); 
    for($i=0;$i+1<$len;$i++){ 
     $ngrams[$i]=mb_substr($word, $i, 2); 
    } 
    return $ngrams; 
} 

$word = "пожалуйста"; 

print_r(Bigrams($word)); 

结果

Array 
(
    [0] => по 
    [1] => ож 
    [2] => жа 
    [3] => ал 
    [4] => лу 
    [5] => уй 
    [6] => йс 
    [7] => ст 
    [8] => та 
)