2011-07-01 136 views
-1

我有以下代码,但是。它太慢PHP优化性能

<?php 
class Ngram { 

const SAMPLE_DIRECTORY = "samples/"; 
const GENERATED_DIRECTORY = "languages/"; 
const SOURCE_EXTENSION = ".txt"; 
const GENERATED_EXTENSION = ".lng"; 
const N_GRAM_MIN_LENGTH = "1"; 
const N_GRAM_MAX_LENGTH = "6"; 

public function __construct() { 
    mb_internal_encoding('UTF-8'); 
    $this->generateNGram(); 
} 

private function getFilePath() { 
    $files = array(); 
    $excludes = array('.', '..'); 
    $path = rtrim(self::SAMPLE_DIRECTORY, DIRECTORY_SEPARATOR . '/'); 
    $files = scandir($path); 
    $files = array_diff($files, $excludes); 
    foreach ($files as $file) { 

     if (is_dir($path . DIRECTORY_SEPARATOR . $file)) 
      fetchdir($path . DIRECTORY_SEPARATOR . $file, $callback); 
     else if (!preg_match('/^.*\\' . self::SOURCE_EXTENSION . '$/', $file)) 
      continue; 
     else 
      $filesPath[] = $path . DIRECTORY_SEPARATOR . $file; 
    } 
    unset($file); 
    return $filesPath; 
} 
protected function removeUniCharCategories($string){ 
    //Replace punctuation(' " # % & ! . : , ? ¿) become space " " 
    //Example : 'You&me', become 'You Me'. 
    $string = preg_replace("/\p{Po}/u", " ", $string); 
    //-------------------------------------------------- 
    $string = preg_replace("/[^\p{Ll}|\p{Lm}|\p{Lo}|\p{Lt}|\p{Lu}|\p{Zs}]/u", "", $string); 
    $string = trim($string); 
    $string = mb_strtolower($string,'UTF-8'); 
    return $string; 
} 
private function generateNGram() { 
    $files = $this->getFilePath(); 
    foreach($files as $file) { 
     $file_content = file_get_contents($file, FILE_TEXT); 
     $file_content = $this->removeUniCharCategories($file_content); 
     $words = explode(" ", $file_content); 
     $tokens = array(); 
     foreach ($words as $word) { 
      $word = "_" . $word . "_"; 
      $length = mb_strlen($word, 'UTF-8'); 
      for ($i = self::N_GRAM_MIN_LENGTH, $min = min(self::N_GRAM_MAX_LENGTH, $length); $i <= $min; $i++) { 
       for ($j = 0, $li = $length - $i; $j <= $li; $j++) { 
        $token = mb_substr($word, $j, $i, 'UTF-8'); 
        if (trim($token, "_")) { 
         $tokens[] = $token; 
        } 
       } 
      } 
     } 
     unset($word); 
     $tokens = array_count_values($tokens); 
     arsort($tokens); 
     $ngrams = array_slice(array_keys($tokens), 0); 
     file_put_contents(self::GENERATED_DIRECTORY . str_replace(self::SOURCE_EXTENSION, self::GENERATED_EXTENSION, basename($file)), implode(PHP_EOL, $ngrams)); 
    } 
    unset($file); 
} 
} 
$ii = new Ngram(); 
?> 

如何使它快速? 谢谢

+0

[代码审查(http://codereview.stackexchange.com/)可能是更好的地方张贴了这个问题... – Xaerxess

+0

谢谢:)对于错过的地方感到抱歉 – Ahmad

回答

-1

PHP的foreach {}比{}慢得多(最多16次)。尝试替换generateNGram()函数中的thoses。

另外,你可以将你的代码从generateNGram()函数复制到你的构造函数中。它将防止对功能的无用呼叫。

+0

“PHP的foreach {}比{}需要更慢(最多16次);}”另外,你可以将你的代码从generateNGram()函数复制到你的构造函数中。将阻止一个无用的函数调用。“可以忽略不计,但在构造函数中有太多东西是一个非常不好的习惯 – KingCrunch

+0

我同意构造函数的事情,但据我所知,foreach并不是一件好事,除非你想在保持ID的同时获取多维数组数组: foreach($ this as $ id => $ that){} –

+0

好的,因为你没有,我搜索了自己并找到了一些东西,这与你宣传的东西完全相反:http://www.phpbench.com /(需要向下滚动一点)。 – KingCrunch