Monday, September 16, 2019

K-Means clustering - unsupervised learning


  • Collection of data we want to divide into meaningful groups or clusters.
  • The method attempts to split data data into K groups. each group is closest to a centroid point. hens there are K centroid points.
  • Can discover interesting groups of things - People, behaviors, etc.
  • How the algorithm works?
  1. randomly pick K centroids(K-means).
  2. assign each data point to a cenroid it closest to.
  3. recompute centroids based on the average position of each centroids points.
  4. iterate until you rich a threshold and centroids stop changing.


split data into 3 clusters



python code example using sklearn.cluster  KMeans:

from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
import numpy as np
import matplotlib.pyplot as plt

# create fake data of income/age for N people and k clusters
def createClusterdData(Npeople, k):
    np.random.seed(10)
    pointsPerCluster = float(Npeople)/k
    X=[]
    for i in range(k):
        # from 20000 to 200000
        incomeCentroid = np.random.uniform(20000, 200000)

        # from 20 to 70
        ageCentroid = np.random.uniform(20, 70)

        for j in range(int(pointsPerCluster)):
            X.append([np.random.normal(incomeCentroid, 10000.0), np.random.normal(ageCentroid, 2.0)])

    X = np.array(X)
    return X


# 100 people for 5 clusters
data = createClusterdData(100, 5)

# create a model
model = KMeans(n_clusters=4)

# scale the data to normalize it because there is a very large difference
# between the size numbers of age and income - scale function fit values 
# to the same scale
model = model.fit(scale(data))

print(model.labels_)

plt.figure(figsize=(8, 6))
# 2 columns income,age
plt.scatter(data[:, 0], data[:, 1],  c=model.labels_.astype(np.float))
plt.show()


4-means cluster




No comments:

Post a Comment