# K-mean clustering algorithm

K-means is a center-seeking unsupervised algorithm. It is an iterative non-deterministic method. If you are looking to segment or group or cluster anything, you could use K-means algorithm. For example, you might want to give different users different experiences.

If the data existing in a five-dimensional space where each axis corresponds to one attribute. So there’s a gender axis, an income axis, and so on. You can also label the various possible buckets along the corresponding axes, and if you did so, the resulting grid would consist of every possible bin—a bin for each possible combination of attributes. It’s highly unlikely to build a different marketing campaign for each bin. To solve this problem, k-means clustering algorithm can be used where k is the number of bins.

✓ How K-mean algorithm works,
(1) Initially, you randomly pick k centroids in d-space. Try to make them near the data but different from one another.
(2) Then assign each data point to the closest centroid.
(3) Move the centroids to the average location of the data points.
(4) Repeat the preceding two steps until the assignments don’t change, or change very little.

How the K-mean clustering algorithm converges is illustrated in this figure,

There is no good way to select the value of k, it has to be determined by running the algorithm multiple times.

Choosing k is more an art than a science, although there are bounds: 1<=k <=n, where n is number of data points.

Silhouette coefficient is used to calculate the quality of k-mean clustering.

There are convergence issues — the solution can fail to exist, if the algorithm falls into a loop. Recently, a modification was made on this algorithms in order to avoid the convergence issues. It’s done by optimizing the initial seeds.

The presence of outliers in the data may yield poor results. A good practice is to do a thorough data exploration in order to identify the data characteristics before running k-means.