一般的, 我们用\(C_i\) 既聚类的中心来代表一个聚类
Euclidean distance(\(in R^2\)): \(dist(a, b)=\sqrt{(a_x-b_x)^2+(a_y-b_y)^2}\)
蔟内变差(用于衡量一个蔟\(C_i\)的质量):\(E=\sum_{i=1}^k\sum_{p\in{C_i}}dist(p,c_i)^2\)
Question:既然蔟内变差可以恒量一个蔟的质量,那么枚举完所有的可能,然后再采用最佳的划分不就好了吗?
Answer:不可以,这是个NP完全问题,即使固定蔟的个数#cluster和空间维度#dimension,依然开销巨大,所以我们要采用Greedy Algorithm(贪心方法)
算法: k-mean 用于划分k-mean算法,其中每个蔟的中心都用蔟中所有的对象均值来表示
Input:
Output:k cluster
Approach:
import numpy as np
import pandas as pd
data = pd.read_csv(‘../trip.csv‘)
data = data.iloc[:, 2:]
class KMeans():
def __init__(self, data, k, r):
self.data = data
self.k = k
self.label = np.empty(data.shape[0])
self.E = np.empty(data.shape[0])
self.centroids = self.__init(data, k, r)
self.__convergence = False
def __distance(self, x, y):
'''
Return Euclid distance of x and y.
'''
return sum((x-y)**2)**(1/2)
def __init(self, data, k_cluster, random_state):
'''
Select k objs as centroids from data set.
'''
rs = np.random.RandomState(random_state)
num, dim = data.shape
rand_id = rs.randint(num, size=k_cluster)
centroids = np.empty((k_cluster, dim))
for i, v in enumerate(rand_id):
centroids[i, :] = data.iloc[v, :]
return centroids
def __converge(self, pre):
return True if False not in (pre==self.centroids) else False
def __update(self):
for i in range(self.k):
centroids[i] = data.loc[label==i].mean(axis=0)
def fit(self):
while not self.__convergence:
pre_centroids = self.centroids.copy()
num, dim = data.shape
for i in range(num):
min_dist, neighbour = np.inf, -1
for j in range(self.k):
dist = self.__distance(self.data.iloc[i,:], self.centroids[j])
if dist < min_dist:
min_dist, neighbour = dist, j
self.label[i] = j
self.E[i] = self.__distance(self.data.iloc[i,:], self.centroids[j])**2
self.__convergence = self.__converge(pre_centroids)
原文:https://www.cnblogs.com/Frankhff/p/12466429.html