2014-03-02 52 views

我有一个包含两列的大型数据集:时间戳纬度/经度。我想在一定程度上对坐标进行分组,以确定记录的不同位置的数量,将彼此之间的一定距离内的所有内容视为所有位置。基本上我想弄清楚这个数据集中有多少个不同的“地方”。 A good visual example is this我想在这里结束,但我不知道集群在哪里与我的数据集。对地理坐标数据集进行分组/排序


需要的聚类算法;例如见[这里](http://scikit-learn.org/stable/modules/clustering.html#clustering) –




# X= your Geo Array 

# Standardize features by removing the mean and scaling to unit variance 
X = StandardScaler().fit_transform(X) 

# Compute DBSCAN 
db = DBSCAN(eps=0.3, min_samples=3).fit(X) 

# eps -- The maximum distance between two samples 
# for them to be considered as in the same neighborhood. 
# min_samples -- The number of samples in a neighborhood for a point 
# to be considered as a core point. 

core_samples = db.core_sample_indices_ 
labels = db.labels_ 

# Number of clusters in labels, ignoring noise if present. 
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) 

感谢您的额外细节。足以让我从“哦,这是很多数学”转到“好,我可以做到这一点”。 –



要使用该函数,需要将一组点和分区数通过一个轴(例如X)。另一个轴上将使用相同的分区(例如,Y)。所以如果一个指定了3,那么将会创建9(3 * 3)个相同大小的分区。该函数首先通过一组点来查找绑定整个集合的最外边的X和Y(最小和最大)坐标。然后将最外面的X轴和Y轴之间的距离除以分区的数量以确定网格大小。



function TallyPoints(array points, int npartitions) 
    array partition = new Array(); 

    int max_x = 0, max_y = 0; 
    int min_x = MAX_INT, min_y = MAX_INT 

    // Find the bounding box of the points 
    foreach point in points 
     if (point.X > max_x) 
      max_x = point.X; 
     if (point.Y < min_x) 
      min_x = point.X; 
     if (point.Y > max_y) 
      max_y = point.Y; 
     if (point.Y < min_y) 
      min_y = point.Y; 

    // Get the X and Y axis lengths of the paritions 
    float partition_length_x = ((float) (max_x - min_x))/npartitions; 
    float partition_length_y = ((float) (max_y - min_y))/npartitions; 

    // Reduce the points to one point in each grid partition 
    // grid partition 
    for (int n = 0; n < npartitions; n++) 
     // Get the boundary of this grid paritition 
     int min_X = min_x + (n * partition_length_x); 
     int min_Y = min_y + (n * partition_length_y); 
     int max_X = min_x + ((n + 1) * partition_length_x); 
     int max_Y = min_y + ((n + 1) * partition_length_y); 

     // reduce and tally points 
     int  tally = 0; 
     boolean reduce = false; // set to true after finding the first point in the paritition 
     foreach point in points 
      // the point is in the grid parition 
      if (point.X >= min_x && point.X < max_x && 
       point.Y >= min_y && point.X < max_y) 
       // first point found 
       if (false == reduce) 
        reduce = true; 
        partition[ n ].point = point; // keep this as the single point for the grid 
        points.Remove(point); // remove the point from the list 

       // increment the tally count 

     // store the tally for the grid 
     partition[ n ].tally = tally; 

     // visualize the tallied point here (e.g., marker on Google Map) 