我想以此为榜样,我自己的一些数据:http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-pyDBSCAN评价 - 需要true_labels
我有麻烦搞清楚如何让我的“labels_true”变量作为DBSCAN预测评估的一部分。
这里是首先需要行吧:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
我有纬度&经度列,这我使用的数据如下:
coords = X_train.as_matrix(columns=['latitude', 'longitude'])
kms_per_radian = 6371.0088
epsilon = 1.5/kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
print num_clusters
#get returned 60
和
print("Homogeneity: %0.3f" % metrics.homogeneity_score(coords, cluster_labels))
是不适合我的线路。
X_train.head():
bathrooms bedrooms building_id description features interest_level latitude longitude manager_id price
10 1.5 3.0 53a5b119ba8f7b61d4e010512e0dfc85 A Brand New 3 Bedroom 1.5 bath ApartmentEnjoy ... [] medium 40.7145 -73.9425 5ba989232d0489da1b5f2c45f6688adc 3000.0
10000 1.0 2.0 c5c8a357cba207596b04d1afd1e4f130 [Doorman, Elevator, Fitness Center, Cats Allow... low 40.7947 -73.9667 7533621a882f71e25173b27e3139d83d 5465.0
100004 1.0 1.0 c3ba40552e2120b0acfc3cb5730bb2aa Top Top West Village location, beautiful Pre-w... [Laundry In Building, Dishwasher, Hardwood Flo... high 40.7388 -74.0018 d9039c43983f6e564b1482b273bd7b01 2850.0
100007 1.0 1.0 28d9ad350afeaab8027513a3e52ac8d5 Building Amenities - Garage - Garden - fitness... [Hardwood Floors, No Fee] low 40.7539 -73.9677 1067e078446a7897d2da493d2f741316 3275.0
100013 1.0 4.0 0 Beautifully renovated 3 bedroom flex 4 bedroom... [Pre-War] low 40.8241 -73.9493 98e13ad4b495b9613cef886d79a6291f 3350.0
据我所知,db.labels_是每个点所属太预测簇#。我想返回一个新的coords数组,其中包含预测的60个集群标签,另一个用于具有真实60个集群标签的度量标准,而不是每个点的旧纬度/经度。
请参阅[本页](http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation)并查找不需要地面实况数据的指标。 –