2014-03-02 52 views
0

我有一个包含两列的大型数据集:时间戳纬度/经度。我想在一定程度上对坐标进行分组,以确定记录的不同位置的数量,将彼此之间的一定距离内的所有内容视为所有位置。基本上我想弄清楚这个数据集中有多少个不同的“地方”。 A good visual example is this我想在这里结束,但我不知道集群在哪里与我的数据集。对地理坐标数据集进行分组/排序

+2

需要的聚类算法;例如见[这里](http://scikit-learn.org/stable/modules/clustering.html#clustering) –

回答

1

详图更多behzad.nouri的参考

# X= your Geo Array 

# Standardize features by removing the mean and scaling to unit variance 
X = StandardScaler().fit_transform(X) 

# Compute DBSCAN 
db = DBSCAN(eps=0.3, min_samples=3).fit(X) 

# HERE 
# eps -- The maximum distance between two samples 
# for them to be considered as in the same neighborhood. 
# min_samples -- The number of samples in a neighborhood for a point 
# to be considered as a core point. 

core_samples = db.core_sample_indices_ 
labels = db.labels_ 

# Number of clusters in labels, ignoring noise if present. 
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) 
+0

感谢您的额外细节。足以让我从“哦,这是很多数学”转到“好,我可以做到这一点”。 –

0

这个伪代码演示了如何一组点降低至每网格划分一个点,而在网格划分清点点的数量。如果您有一组点,其中某些区域很稀疏而其他点很密集,但想要显示点的均匀分布(例如在地图上),这可能很有用。

要使用该函数,需要将一组点和分区数通过一个轴(例如X)。另一个轴上将使用相同的分区(例如,Y)。所以如果一个指定了3,那么将会创建9(3 * 3)个相同大小的分区。该函数首先通过一组点来查找绑定整个集合的最外边的X和Y(最小和最大)坐标。然后将最外面的X轴和Y轴之间的距离除以分区的数量以确定网格大小。

然后,该函数遍历每个网格分区并检查集合中的每个点是否位于网格分区内。如果该点位于网格分区内,它将检查这是否是网格分区中遇到的第一个点。如果是,则设置标志以指示已找到第一点。否则,不是网格分区中的第一个点,该点将从点集中移除。

对于在分区中找到的每个点,该函数会递增计数。最后,当还原/清点每网格划分完成时,一个随后可以可视化计数的点(例如,在与讯号指示灯单点在地图上显示的标记):

function TallyPoints(array points, int npartitions) 
{ 
    array partition = new Array(); 

    int max_x = 0, max_y = 0; 
    int min_x = MAX_INT, min_y = MAX_INT 

    // Find the bounding box of the points 
    foreach point in points 
    { 
     if (point.X > max_x) 
      max_x = point.X; 
     if (point.Y < min_x) 
      min_x = point.X; 
     if (point.Y > max_y) 
      max_y = point.Y; 
     if (point.Y < min_y) 
      min_y = point.Y; 
    } 

    // Get the X and Y axis lengths of the paritions 
    float partition_length_x = ((float) (max_x - min_x))/npartitions; 
    float partition_length_y = ((float) (max_y - min_y))/npartitions; 

    // Reduce the points to one point in each grid partition 
    // grid partition 
    for (int n = 0; n < npartitions; n++) 
    { 
     // Get the boundary of this grid paritition 
     int min_X = min_x + (n * partition_length_x); 
     int min_Y = min_y + (n * partition_length_y); 
     int max_X = min_x + ((n + 1) * partition_length_x); 
     int max_Y = min_y + ((n + 1) * partition_length_y); 

     // reduce and tally points 
     int  tally = 0; 
     boolean reduce = false; // set to true after finding the first point in the paritition 
     foreach point in points 
     { 
      // the point is in the grid parition 
      if (point.X >= min_x && point.X < max_x && 
       point.Y >= min_y && point.X < max_y) 
      { 
       // first point found 
       if (false == reduce) 
       { 
        reduce = true; 
        partition[ n ].point = point; // keep this as the single point for the grid 
       } 
       else 
        points.Remove(point); // remove the point from the list 

       // increment the tally count 
       tally++; 
      } 
     } 

     // store the tally for the grid 
     partition[ n ].tally = tally; 

     // visualize the tallied point here (e.g., marker on Google Map) 
    } 
} 
相关问题