2012-09-20 22 views
0

我有一个函数可以将html去掉,并将这些单词放在一个数组中,然后使用array_count_values。我试图报告每个词的出现次数。阵列输出非常混乱。我试图清理它,而且我无处可去。我想删除电话号码,并且由于某些原因,短语被推在一起。第一个数组似乎也是空的,但isset()或empty()似乎没有解除它。清理字数组

$body = $this->get_response($domain); 
       $body = preg_replace('/<body(.*?)>/i', '<body>', $body); 
       $body = preg_replace('#</body>#i', '</body>', $body); 

       $openTag = '<body>'; 
       $start = strpos($body, $openTag); 
       $start += strlen($openTag); 

       $closeTag = '</body>'; 
       $end = strpos($body, $closeTag); 

       // Return if cannot cut-out the body 
       if ($end <= $start || $start === false || $end === false) { 
        $this->setValue(''); 
        return; 
       } 

       $body = substr($body, $start, $end - $start); 
       $body = preg_replace(array(
         '@<script[^>]*?>.*?</script>@si', // Strip out javascript 
         '@<style[^>]*?>.*?</style>@siU',  // Strip style tags properly 
         '@<![\s\S]*?--[ \t\n\r]*>@',   // Strip multi-line comments including CDATA 
         '/style=([\"\']??)([^\">]*?)\\1/siU',// Strip inline style attribute 
         ), '', $body); 

       $body = strip_tags($body); 
       $body = array_filter(explode(' ', $body), create_function('$str', 'return strlen($str) > 2;')); 
       $body = array_map('trim', $body); 
       $words = $body; 

       $i = 0; 

       $words = array_count_values($words); 

       foreach($words as $word){ 

        if (empty($word)) unset($words[$i]); 
        $i++; 

       } 

       echo "<pre>"; 
        print_r($words); 
        echo "</pre>"; 

输出

Array 
(
    [] => 28 
    [333.444.5555] => 1 
    [facebook] => 2 
    [twitter] => 2 
    [linkedin] => 2 
    [youtube 

       googleplus] => 1 
    [About 

    History 
    Our] => 1 
    [Mission 
    Who] => 1 
    [This 
    That 
    Other] => 1 
    [Us 


English 

    FA 
    Football] => 1 
    [Media 
    Pay] => 2 
    [Per] => 4 
    [Think 
    Fast] => 2 
    [Marketing 
    Design] => 1 
    [Consulting 


Case] => 2 

回答

1

恐怕explode(' ', $body)是不够的,因为空间是不是唯一的空白字符。改为尝试preg_split

$body = array_filter(preg_split('/\s+/', $body), 
      create_function('$str', 'return strlen($str) > 2;')); 
+0

这样做。真棒。谢谢! – madphp