Statistical Clustering.k-Means .*;
k-means clustering is a method of classifying/grouping items into k groups (where k is the number of pre-chosen groups). The grouping is done by minimizing the sum of squared distances (Euclidean distances) between items and the corresponding centroid.
A centroid is "the center of mass of a geometric object of uniform density", though here, we'll consider mean vectors as centroids.
Figure 1. A clustered scatter plot. The black dots are data points. The red lines illustrate the partitions created by the k-means algorithm. The blue dots represent the centroids which define the partitions.
The initial partitioning can be done in a variety of ways.
- Dynamically Chosen: This method is good when the amount of data is expected to grow. The initial cluster means can simply be the first few items of data from the set. For instance, if the data will be grouped into 3 clusters, then the initial cluster means will be the first 3 items of data.
- Randomly Chosen: Almost self-explanatory, the initial cluster means are randomly chosen values within the same range as the highest and lowest of the data values.
- Choosing from Upper and Lower Bounds: Depending on the types of data in the set, the highest and lowest (or at least the extremities) of the data range are chosen as the initial cluster means. The example below uses this method.
This outline of the algorithm assumes two clusters, and each individual's scores include two variables (as in the example above). Adding more clusters is as easy as adding another step like Step 4 or Step 5. Adding another variable for each individual is as easy as adding calculations within the type of step like Step 4 or Step 5.
Choose the number of clusters.
Set the initial partition, and the initial mean vectors for each cluster.
For each remaining individual...
Get averages for comparison to the Cluster 1:
Add individual's A value to the sum of A values of the individuals in Cluster 1, then divide by the total number of scores that were summed.
Add individual's B value to the sum of B values of the individuals in Cluster 1, then divide by the total number of scores that were summed.
Get averages for comparison to the Cluster 2:
Add individual's A value to the sum of A values of the individuals in Cluster 2, then divide by the total number of scores that were summed.
Add individual's B value to the sum of B values of the individuals in Cluster 2, then divide by the total number of scores that were summed.
If the averages found in Step 4 are closer to the mean values of Cluster 1, then this individual belongs to Cluster 1, and the averages found now become the new mean vectors for Cluster 1.
If closer to Cluster 2, then it goes to Cluster 2, along with the averages as new mean vectors.
If there are more individual's to process, continue again with Step 4. Otherwise go to Step 8.
Now compare each individual’s distance to its own cluster's mean vector, and to that of the opposite cluster. The distance to its cluster's mean vector should be smaller than it distance to the other vector. If not, relocate the individual to the opposite cluster.
If any relocations occurred in Step 8, the algorithm must continue again with Step 3, using all individuals and the new mean vectors.
If no relocations occurred, stop. Clustering is complete.
Again, in case the algorithm never settles on a final solution, it may be a good idea to implement a maximum number of iterations check.
K-Means Clustering Weaknesses
- With fewer samples of data, initial grouping will determine the cluster significantly.
- The number of clusters, k, must be determined before hand.
- With fewer samples of data, inaccurate clustering can occur.
- We never know which variable contributes more to the clustering process since we assume that each has the same weight.
- The accuracy of mathematical averaging weakens because of outliers, which may pull the centroid away from its true position.
- The results are clusters with circular or spherical shapes because of the use of distance.
- Include as many samples of data as possible (the more data, the more accurate the results).
- To avoid distortions caused by excessive outliers, the median can be used instead of the mode.