2013-10-06 28 views
3

说我有一个类似的值的数组:如何从PHP中的数组中获取框图关键数字?

$values = array(48,30,97,61,34,40,51,33,1); 

而且我希望这些值能够绘制箱线图类似如下:

$box_plot_values = array(
    'lower_outlier' => 1, 
    'min'   => 8, 
    'q1'    => 32, 
    'median'   => 40, 
    'q3'    => 56, 
    'max'   => 80, 
    'higher_outlier' => 97, 
); 

我将如何在PHP中做到这一点?

回答

5
function box_plot_values($array) 
{ 
    $return = array(
     'lower_outlier' => 0, 
     'min'   => 0, 
     'q1'    => 0, 
     'median'   => 0, 
     'q3'    => 0, 
     'max'   => 0, 
     'higher_outlier' => 0, 
    ); 

    $array_count = count($array); 
    sort($array, SORT_NUMERIC); 

    $return['min']   = $array[0]; 
    $return['lower_outlier'] = $return['min']; 
    $return['max']   = $array[$array_count - 1]; 
    $return['higher_outlier'] = $return['max']; 
    $middle_index    = floor($array_count/2); 
    $return['median']   = $array[$middle_index]; // Assume an odd # of items 
    $lower_values    = array(); 
    $higher_values   = array(); 

    // If we have an even number of values, we need some special rules 
    if ($array_count % 2 == 0) 
    { 
     // Handle the even case by averaging the middle 2 items 
     $return['median'] = round(($return['median'] + $array[$middle_index - 1])/2); 

     foreach ($array as $idx => $value) 
     { 
      if ($idx < ($middle_index - 1)) $lower_values[] = $value; // We need to remove both of the values we used for the median from the lower values 
      elseif ($idx > $middle_index) $higher_values[] = $value; 
     } 
    } 
    else 
    { 
     foreach ($array as $idx => $value) 
     { 
      if ($idx < $middle_index)  $lower_values[] = $value; 
      elseif ($idx > $middle_index) $higher_values[] = $value; 
     } 
    } 

    $lower_values_count = count($lower_values); 
    $lower_middle_index = floor($lower_values_count/2); 
    $return['q1']  = $lower_values[$lower_middle_index]; 
    if ($lower_values_count % 2 == 0) 
     $return['q1'] = round(($return['q1'] + $lower_values[$lower_middle_index - 1])/2); 

    $higher_values_count = count($higher_values); 
    $higher_middle_index = floor($higher_values_count/2); 
    $return['q3']  = $higher_values[$higher_middle_index]; 
    if ($higher_values_count % 2 == 0) 
     $return['q3'] = round(($return['q3'] + $higher_values[$higher_middle_index - 1])/2); 

    // Check if min and max should be capped 
    $iqr = $return['q3'] - $return['q1']; // Calculate the Inner Quartile Range (iqr) 
    if ($return['q1'] > $iqr)     $return['min'] = $return['q1'] - $iqr; 
    if ($return['max'] - $return['q3'] > $iqr) $return['max'] = $return['q3'] + $iqr; 

    return $return; 
} 
+0

改进非常欢迎 – Lilleman

+1

这真是太好了! –

+0

+1优秀的解决方案。 – Johnny

1

利勒曼的代码是辉煌的。我真的很感激他处理中位数和q1/q3的方式。如果我先回答这个问题,我会以一种更难但不必要的方式应对奇数和偶数的价值观。我的意思是如果4次使用4种不同的模式情况(计数(值),4)。但他的方式简洁而整齐。我很欣赏他的作品。

我想对max,min,higher_outliers和lower_outliers做一些改进。因为q1-1.5 * IQR只是下限,所以我们应该找到大于这个界限的最小值作为'min'。这是'最大'相同。此外,可能有多个异常值。所以我想根据利勒曼的工作做一些改变。谢谢。

function box_plot_values($array) 
{ 
    $return = array(
    'lower_outlier' => 0, 
    'min'   => 0, 
    'q1'    => 0, 
    'median'   => 0, 
    'q3'    => 0, 
    'max'   => 0, 
    'higher_outlier' => 0, 
); 

$array_count = count($array); 
sort($array, SORT_NUMERIC); 

$return['min']   = $array[0]; 
$return['lower_outlier'] = array(); 
$return['max']   = $array[$array_count - 1]; 
$return['higher_outlier'] = array(); 
$middle_index    = floor($array_count/2); 
$return['median']   = $array[$middle_index]; // Assume an odd # of items 
$lower_values    = array(); 
$higher_values   = array(); 

// If we have an even number of values, we need some special rules 
if ($array_count % 2 == 0) 
{ 
    // Handle the even case by averaging the middle 2 items 
    $return['median'] = round(($return['median'] + $array[$middle_index - 1])/2); 

    foreach ($array as $idx => $value) 
    { 
     if ($idx < ($middle_index - 1)) $lower_values[] = $value; // We need to remove both of the values we used for the median from the lower values 
     elseif ($idx > $middle_index) $higher_values[] = $value; 
    } 
} 
else 
{ 
    foreach ($array as $idx => $value) 
    { 
     if ($idx < $middle_index)  $lower_values[] = $value; 
     elseif ($idx > $middle_index) $higher_values[] = $value; 
    } 
} 

$lower_values_count = count($lower_values); 
$lower_middle_index = floor($lower_values_count/2); 
$return['q1']  = $lower_values[$lower_middle_index]; 
if ($lower_values_count % 2 == 0) 
    $return['q1'] = round(($return['q1'] + $lower_values[$lower_middle_index - 1])/2); 

$higher_values_count = count($higher_values); 
$higher_middle_index = floor($higher_values_count/2); 
$return['q3']  = $higher_values[$higher_middle_index]; 
if ($higher_values_count % 2 == 0) 
    $return['q3'] = round(($return['q3'] + $higher_values[$higher_middle_index - 1])/2); 

// Check if min and max should be capped 
$iqr = $return['q3'] - $return['q1']; // Calculate the Inner Quartile Range (iqr) 

$return['min'] = $return['q1'] - 1.5*$iqr; // This (q1 - 1.5*IQR) is actually the lower bound, 
              // We must compare every value in the lower half to this. 
              // Those less than the bound are outliers, whereas 
              // The least one that greater than this bound is the 'min' 
              // for the boxplot. 
foreach($lower_values as $idx => $value) 
{ 
    if($value < $return['min']) // when values are less than the bound 
    { 
     $return['lower_outlier'][$idx] = $value ; // keep the index here seems unnecessary 
                // but those who are interested in which values are outliers 
                // can take advantage of this and asort to identify the outliers 
    }else 
    { 
     $return['min'] = $value; // when values that greater than the bound 
     break; // we should break the loop to keep the 'min' as the least that greater than the bound 
    } 
} 

$return['max'] = $return['q3'] + 1.5*$iqr; // This (q3 + 1.5*IQR) is the same as previous. 
foreach(array_reverse($higher_values) as $idx => $value) 
{ 
    if($value > $return['max']) 
    { 
     $return['higher_outlier'][$idx] = $value ; 
    }else 
    { 
     $return['max'] = $value; 
     break; 
    } 
} 
    return $return; 
} 

我希望这可能有助于那些谁会对这个问题感兴趣。如果有更好的方法来知道哪些值是异常值,请给我加评论。谢谢!

0

我有一个不同的解决方案来计算较低和较高的胡须。与ShaoE的解决方案一样,它发现最小值大于或等于下限(Q1 - 1.5 * IQR),反之亦然。

我使用array_filter迭代数组,将值传递给回调函数,并返回一个只有回调值为true的值的数组(请参阅php.net's array_filter manual)。在这种情况下,返回大于下限的值并将其用作min的输入,其本身返回最小值。

// get lower whisker 
$whiskerMin = min(array_filter($array, function($value) use($quartile1, $iqr) { 
     return $value >= $quartile1 - 1.5 * $iqr; 
    })); 
// get higher whisker vice versa 
$whiskerMax = max(array_filter($array, function($value) use($quartile3, $iqr) { 
     return $value <= $quartile3 + 1.5 * $iqr; 
    })); 

请注意,它忽略了异常值,我只用正值对其进行了测试。