Clustering

Tristan C
6 min readAug 9, 2021

Within the world of data mining, one of the most common things that people will want to do after mining the data is have ways to categorize their data in a very specific way or look for patterns within the dataset itself. To help handle this idea, there are a plethora number of ways doing so, but one of the more common and popular ways to start grouping these objects and classes together is through the concept of clustering. At its core, clustering within data mining can be simply defined as taking objects from data and sorting them together into “classes” that involve objects that are very similar to each other. The best way to look at it is that clustering will take all the data and classify and organize them based on any similar characteristics and ideas that the objects themselves share. These similarities are extremely important when it comes to clustering, as the main goal of this entire system is to make sure that it finds the most similar items and groups them together and can help you find patterns, possible algorithms and more within your data.

While clustering is quite the popular choice of data mining workers to characterize their data, there are quite a few different ways in which you can define your clustering methods and techniques. The reason that there are so many ways that are readily available can be easily described by Saxena et al. within their research of several of the clustering techniques, defining that there really is “no such precise definition to the notion of “cluster” … (Saxena et al., 2017). Despite the various amounts of clustering techniques there are available, one thing that has been clearly agreed on is the two main groups of clustering techniques that have been used for quite a while: the hierarchical clustering techniques and the partitioning techniques.

Regarding the hierarchical clustering technique, it involves taking all the clusters in question that you would like to form and approaches it by creating patterns and curating the data in a hierarchical like approach (usually implying a top to bottom or bottom to top like structure). Inside of the hierarchical techniques, the two most common ways to follow this method is the agglomerative or divisive hierarchical cluster. Agglomerative takes the bottom-up approach, “building up clusters starting with single object and then merging these atomic clusters into larger and larger clusters…” while divisive takes on the top-down approach “which breaks up clusters containing all objects into smaller clusters, until each object forms a cluster on its own…” (Saxena et al., 2017). With the following methods in place, the end result based on the cluster analysis will usually lead users to have a structure called a dendrogram, a very specific hierarchical structure that is used to show the relationship between individual object / cluster.

Conversely, we have the Partition Clustering method, which takes a complete one-eighty approach when it comes to organizing and grouping the data together. For this method, it is described as taking the data into what is known as “K clusters” and then instead of structuring it, it will split across the space and show its relationship values usually by incorporating some sort of function. There are many different functions in question, the most popular being the Euclidean distance model which basically uses the absolute minimum amount of distance between two areas of clusters as a tangible number that can be given to both clusters. Some of the other particular models and methods that can be used with partition clustering includes k-means, PAM, CLARA, DBSCAN, etc.

There have been some really great things that have been done with these specific approaches and are still being used today throughout a variety of real-world implementations and also for research. One example of how clustering techniques is helping the world around us, we do not have to look much further than by just going onto the world wide web and itself. Usually, for a lot of websites, there is usually an API of sorts that is available that allows people to curate information from that website to pull information down or provide stuff from that website and be able to call to it when writing programs, bots for chat services, etc. However, sometimes the summaries and ideas of these APIs can get lost in translation or are just not as cohesive, and that is where the work from Katirtzis and company come into play. Within their paper, they work to find a way to curate call sequences from APIs to help provide snippets that help to provide what the APIs are targeting and how they work. As such, they decided to develop and use a method that have dubbed CLAMS (Clustering for API Mining of Snippets). With this method, it is explained that while mining, this method will “cluster a large set of usage examples based on their API calls, generate summarized versions for the top snippets…and select the most representative snippet from each cluster” (Katirtzis et al., 2018). Through various testing and examples, they were able to find out that using this technique is a huge help to create easy to use and readable snippets of code that can help to explain APIs, which is just a small example of what clustering techniques can do.

Another example usage that I would like to showcase involves one of the more popular methods of clustering, the k-means cluster. To briefly describe k-means clustering, this method helps to enables the user to define and curate groups between data sets by comparing the means (or average) of their individual values. After curating the average values, the items are then sorted into “k” type clusters and sorts them in a way that each cluster is closest to their nearest mean value. This particular method of sorting is considered an “unsupervised” algorithm meaning that it can learn from datasets that involve untagged data and you can also start and run this data without having to add any extra variables once it begins running.

While this method can be considered a much more popular one than other methods, this method does sadly have some drawbacks that need to be made sure is understood before diving into this method. Since this method draws a lot from the means of data values, you must make sure to always check your data, as this method is susceptible to unwarranted influence. If you have any type of values that have an extremely low or high value, you will have to worry about that and if it will cause you to have some inconsistent data. Another example issue that you may run into is described in an article about a proposed new k-means algorithm, stating that another thing to look out for is that there is a “serious drawback that its performance heavily depends on the initial starting conditions.” (Likas et al., 2003). Because of all of this, there have been people out there that are trying to alleviate these issues with specific algorithms and fixes. There is currently not an accepted or adopted fix or method out that there alleviates the issue, but there are several out there that are trying. With the article listed above, they have their own idea for an algorithm with the attempt to solve the issue by adding a bit of actual order and regulation to the clustering method instead of all of them being randomly selected.

However, despite all the drawbacks that have been listed above, k-means clustering is still a commonly used method for projects and reports around the world today. An example that can be provided of it being used in the real world today can come from the Middle East regions, where they are currently using k-means clustering to help them with the cell towers and sites that are located across the region. The article describes the main issue being that since there are several cell sites located across in extremely difficult locations and climates, and they were trying to make sure they can allocate a safe number of technicians to each individual area to make sure that things were being tended to properly. In order to then figure out the best way to tackle it, the articles authors decided that it was worth to adapt the k-means clustering method and algorithm “to partition a set of data points into a number of clusters…” (Gbadoubissa et al., 2020). Utilizing this specific algorithm along with a few other assets such as Geometric initialization, the authors were able to help propose an idea for how to optimize the amount of technicians per harder to assist cell towers / sites that not only gives each site a well-rounded amount of technicians but done so in a way that optimizes the maintenance operations and cost.

References

Gbadoubissa, J. E. Z., Ari, A. A. A., & Gueroui, A. M. (2020). Efficient k-means based clustering scheme for mobile networks cell sites management. Journal of King Saud University-Computer and Information Sciences, 32(9), 1063–1070.

Katirtzis, N., Diamantopoulos, T., & Sutton, C. (2018, April). Summarizing Software API Usage Examples Using Clustering Techniques. In FASE (pp. 189–206).

Likas, A., Vlassis, N., & Verbeek, J. J. (2003). The global k-means clustering algorithm. Pattern recognition, 36(2), 451–461.

Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., … & Lin, C. T. (2017). A review of clustering techniques and developments. Neurocomputing, 267, 664–681.

--

--

Tristan C

Blog / Reflections of my life while I'm earning my MS in Information Science