- Collection of data we want to divide into meaningful groups or clusters.
- The method attempts to split data data into K groups. each group is closest to a centroid point. hens there are K centroid points.
- Can discover interesting groups of things - People, behaviors, etc.
- How the algorithm works?
- randomly pick K centroids(K-means).
- assign each data point to a cenroid it closest to.
- recompute centroids based on the average position of each centroids points.
- iterate until you rich a threshold and centroids stop changing.
![]() |
| split data into 3 clusters |
python code example using sklearn.cluster KMeans:
from sklearn.cluster import KMeans from sklearn.preprocessing import scale import numpy as np import matplotlib.pyplot as plt # create fake data of income/age for N people and k clusters def createClusterdData(Npeople, k): np.random.seed(10) pointsPerCluster = float(Npeople)/k X=[] for i in range(k): # from 20000 to 200000 incomeCentroid = np.random.uniform(20000, 200000) # from 20 to 70 ageCentroid = np.random.uniform(20, 70) for j in range(int(pointsPerCluster)): X.append([np.random.normal(incomeCentroid, 10000.0), np.random.normal(ageCentroid, 2.0)]) X = np.array(X) return X # 100 people for 5 clusters data = createClusterdData(100, 5) # create a model model = KMeans(n_clusters=4) # scale the data to normalize it because there is a very large difference # between the size numbers of age and income - scale function fit values # to the same scale model = model.fit(scale(data)) print(model.labels_) plt.figure(figsize=(8, 6)) # 2 columns income,age plt.scatter(data[:, 0], data[:, 1], c=model.labels_.astype(np.float)) plt.show()


No comments:
Post a Comment