检测损坏数据的算法？

我不确定这是否是正确的地方要问，所以请原谅我，如果这个声音不相关。这是我的情况：检测损坏数据的算法？

我的数据集是continual在时间上，有一些errones data我需要处理。与邻居相比，他们的价值突然增加。

下面是数据集的一部分。如您所见，第四个值突然增加（28.3）。（值是在最后一栏）

19741212,0700,200,1,N, 4.6 
19741212,0800,190,1,N, 4.6 
19741212,0900,180,1,N, 5.7 
19741212,1000,160,1,N, 28.3 # wrong data, need interpolate from neighbors 
19741212,1100,170,1,N, 4.6 
19741212,1200,200,1,N, 5.1 
19741212,1300,230,1,N, 5.1

我需要identify它们，然后做interpolate从附近的数据来替换它们。我想知道是否有这个algorithm？

如果我要它从头开始实现的方法，我咬咬牙：

计算增量从接近数据点
选择合适的门槛为检测损坏的数据

但我不知道这是否是足够好，也许我忽略了其他部分，这将导致误报的数量巨大。

另外，我使用Python和Pandas来处理数据，所以相关的资源会很好。

来源

2015-08-29 cqcn1991

您还可以识别外围，您可以测试它们距离均值有多远，并设置标准偏差阈值。

基于https://stackoverflow.com/a/11686764/2477491，您将离群南带着：

def reject_outliers(data, m=2): # 2 is the std treshold, fit for your needs. 
    return data[abs(data - np.mean(data)) < m * np.std(data)] 

data[6] = reject_outliers(data[5]) # creates a new column with outliers set to Nan 

      0  1 2 3 4  5 6 
0 19741212 700 200 1 N 4.6 4.6 
1 19741212 800 190 1 N 4.6 4.6 
2 19741212 900 180 1 N 5.7 5.7 
3 19741212 1000 160 1 N 28.3 NaN 
4 19741212 1100 170 1 N 4.6 4.6 
5 19741212 1200 200 1 N 5.1 5.1 
6 19741212 1300 230 1 N 5.1 5.1

如果你在你的意甲趋势，你可能反而把它在时间的移动窗口，而不是整个意甲。

因此，关于在窗口上应用自定义函数，我通常使用scipy.ndimage.filters.generic_filter这也适用于1d数组，并返回一个标量应用函数在由脚印定义的移动窗口上。下面是关于如何只在1×3,1×足迹楠插值平均值的例子：

from scipy import ndimage as im 

def interpNan(win): # with win the 1x3 window 
    if win[1] != win[1]: # if center of footprint is a nan 
     return round(np.nanmean(win), 1) 
    else: 
     return round(win[1], 1) 

footprint = np.array([1,1,1]) 
data[7] = im.generic_filter(data[6], interpNan, footprint = footprint) 

      0  1 2 3 4  5 6 7 
0 19741212 700 200 1 N 4.6 4.6 4.6 
1 19741212 800 190 1 N 4.6 4.6 4.6 
2 19741212 900 180 1 N 5.7 5.7 5.7 
3 19741212 1000 160 1 N 28.3 NaN 5.2 
4 19741212 1100 170 1 N 4.6 4.6 4.6 
5 19741212 1200 200 1 N 5.1 5.1 5.1 
6 19741212 1300 230 1 N 5.1 5.1 5.1

[7行×8列]

您也可以合并两个功能toghether但对于质量分析，我不t并始终保留原始数据，有效数据和插值数据。

来源

2015-08-29 12:36:30 Delforge

一个来检测损坏的数据或异常值的方法是先计算轧制中位数（这是鲁棒的离群值）的一系列，然后计算实际的观测和滚动位之间的距离。滤除那些距离大于阈值的观测值。

# your data 
# ==================================== 
print(df) 


      A B C D  E 
19741212 700 200 1 N 4.6 
19741212 800 190 1 N 4.6 
19741212 900 180 1 N 5.7 
19741212 1000 160 1 N 28.3 
19741212 1100 170 1 N 4.6 
19741212 1200 200 1 N 5.1 
19741212 1300 230 1 N 5.1 

# roling median, 3-term moving windows 
# ================================================= 
res = pd.rolling_median(df['E'], window=3, center=True) 
print(res) 

19741212 NaN 
19741212 4.6 
19741212 5.7 
19741212 5.7 
19741212 5.1 
19741212 5.1 
19741212 NaN 
dtype: float64 

# threshold 20% from rolling median 
threshold = 0.2 
mask = abs(df['E'] - res)/res > threshold 
# replace outliers with rolling medians 
df.loc[mask, 'E'] = res[mask] 

print(df) 

      A B C D E 
19741212 700 200 1 N 4.6 
19741212 800 190 1 N 4.6 
19741212 900 180 1 N 5.7 
19741212 1000 160 1 N 5.7 
19741212 1100 170 1 N 4.6 
19741212 1200 200 1 N 5.1 
19741212 1300 230 1 N 5.1

来源

2015-08-29 10:58:47

检测损坏数据的算法？

回答

相关问题