Kmeans的改进-kmeans++算法的聚类中心初始点选取和蓄水池采样算法

时间：2014-02-16 06:31:28 阅读：470 评论：0 收藏：0 [点我收藏+]

要解决的问题

kmeans算法存在的一个问题是初始中心的选取是随机的，造成聚类的结果也是随机的，一般的做法是进行多次重复整个聚类过程，然后选取聚类效果好的。Kmeans++算法可以很好的解决初始点的选取问题，本文简单进行了总结和实现，代码方面还有很多不完善的地方，仅供参考，欢迎拍砖。

算法流程

a). 首先从数据集中随机选取一个点作为中心点，并加入到中心点集合centers中

b). 对于数据集中的每个点i，都和集合centers中的点进行计算,得到最近距离d[i]，计算完之后得到sum(d[i])

c). 取一个随机值random,使random落在sum(d[i])内，然后random -= d[i] 直到random < 0的时候，这个i即为下一个中心点，将这个点加入到centers中

d). 重复b和c过程直到完成所有中心点的选取

算法分析

初始点的选取类似于加权的蓄水池采样，权重是和中心点的最近距离相关的，算法的复杂度为O(k*k*m*n)其中k为聚类中心的个数，m为数据集的样本数，n为数据样本的空间维度

算法实现

#!/usr/bin/python
#-*-coding:utf-8-*-


import sys
import random
import math
from decimal import *
"""
	@author:xyl
	This is an example for kmeans++ centers initialization
"""

"""
    init k centers at beginning
    points: data set to be clustered
    pNum: number of points in data set
    cNum: number of points to be selected
"""
def initCenters(points, pNum, cNum):
    centers = [] #points selected for initial centers
    firstCenterIndex = random.randint(0, pNum-1)
    centers.append(points[firstCenterIndex])
    distance = [] #save min distance with centers
    for cIndex in xrange(1, cNum):
        sum = 0.0
        for pIndex in xrange(0, pNum):
            dist = nearest(points[pIndex], centers, cIndex)
            distance.append(dist)
            sum += dist
        sum = random.uniform(0, sum)
        for pIndex in xrange(0, pNum):
            sum -= distance[pIndex]
            if sum > 0:continue
            centers.append(points[pIndex])
            break
    return centers


"""
    compute min distance of point and centers
    point: point in data set
    centers: selected centers 
    cIndex: number of centers already selected 
"""
def nearest(point, centers, cIndex):
    minDist = 65536.0 #should be a double large enough
    dist = 0.0
    for index in xrange(0, cIndex):
        dist = distance(point, centers[index])
        if minDist > dist:
            minDist = dist
    return minDist

"""
    compute distance between two point
    point: point in data set
    center: point selected as center
"""
def distance(point, center):
    dim = len(point)
    if dim != len(center):
        return 0.0#do something here
    a = 0.0
    b = 0.0
    c = 0.0
    for index in xrange(0, dim):
        a += point[index] * center[index]
        b += math.pow(point[index], 2)
        c += math.pow(center[index], 2)
    b = math.sqrt(b)
    c = math.sqrt(c)
    try:
        return a/(b*c)
    except Exception as e:
        print e#do something here
    return 0.0 

def test():
    points = []
    points.append([1,2,1,2,3,4,5])
    points.append([1,2,1,3,1,4,5])
    points.append([1,2,3,2,2,4,5])
    points.append([2,2,1,2,2,4,1])
    points.append([1,2,1,1,3,1,5])
    points.append([1,2,4,2,3,1,1])
    points.append([1,3,1,2,3,1,2])
    points.append([1,4,1,1,3,2,1])
    points.append([1,1,1,2,3,4,1])
    points.append([1,1,1,1,3,4,1])
    print initCenters(points, 10, 4)
if __name__ == "__main__":
    test()

蓄水池采样算法

关于这个算法可以百度下，比较经典的面试题目，这里想提的是其在ClouderaML的两个应用，比如分布式蓄水池采样和加权分布式蓄水池采样，有些算法看着很无趣，但是应用到具体的实践场景还是能让人眼前一亮，原文参考Algorithms Every Data Scientist Should Know: Reservoir Sampling

Kmeans的改进-kmeans++算法的聚类中心初始点选取和蓄水池采样算法

原文：http://blog.csdn.net/hotallen/article/details/19247387

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)