Check out all the on-demand sessions from the Intelligent Security Summit here.
AI clustering is the machine learning (ML) process of organizing data into subgroups with similar attributes or elements. Clustering algorithms tend to work well in environments where the answer does not need to be perfect, it just needs to be similar or close to be an acceptable match. AI clustering can be particularly effective in identifying patterns in unsupervised learning. Some common applications are in human resources, data analysis, recommendation systems and social science.
Data scientists, statisticians and AI scientists use clustering algorithms to seek answers that are close to other answers. They first use a training dataset to define the problem and then look for potential solutions that are similar to those generated with the training data.
One challenge is defining “closeness,” because the desired answer is usually generated with the training data. When the data has several dimensions, data scientists can also guide the algorithm by assigning weights to the different data columns in the equation used to define closeness. It is not uncommon to work with several different functions that define closeness.
When the closeness function, also called the similarity metric or distance measure, is defined, much of the work is storing the data in a way that it can be searched quickly. Some database designers create special layers to simplify that search. A key part of many algorithms is the distance metric that defines how far apart two data points may be.
Intelligent Security Summit On-Demand
Learn the critical role of AI & ML in cybersecurity and industry specific case studies. Watch on-demand sessions today.
Another approach involves turning the problem on its head and deliberately searching for the worst possible match. This is suited to problems such as anomaly detection in security applications, where the goal is to identify data elements that don’t fit in with the others.
Scientists and mathematicians have created different algorithms for detecting various types of clusters. Choosing the right solution for a specific problem is a common challenge.
The algorithms are not always definitive. Scientists may use methods that fall into only one classification, or they might employ hybrid algorithms that use techniques from multiple categories.
Categories of clustering algorithms include the following:
Note: Many database companies often use the word “clustering” in a different way. The word also can be used to describe a group of machines that work together to store data and answer queries. In that context, the clustering algorithms make decisions about which machines will handle the workload. To make matters more confusing, sometimes these data systems will also apply AI clustering algorithms to classify data elements.
Clustering algorithms are deployed as part of a wide array of technologies. Data scientists rely upon algorithms to help with classification and sorting.
For instance, a large number of applications for working with people can be more successful with better clustering algorithms. Schools may want to place students in class sections based on their talents and abilities. Clustering algorithms will put students with similar interests and needs together.
Some businesses want to separate their potential customers into different categories so that they can give the customers more appropriate service. Neophyte buyers can be offered extensive help so they can understand the products and the options. Experienced customers can be taken immediately to the offerings, and perhaps be given special pricing that’s worked for similar buyers.
There are many other examples from a diverse range of industries, like manufacturing, banking and shipping. All rely on the algorithms to separate the workload into smaller subsets that can get similar treatment. All of these options depend heavily on data collection.
How do distance metrics define the clustering algorithms? If a cluster is defined by the distances between data elements, the measurement of the distance is an essential part of the process. Many algorithms rely on standard ways to calculate the distance, but some rely on different formulas with different advantages.
Many find the idea of a “distance” itself confusing. We use the term so often to measure how far we must travel in a room or around the globe that it can feel odd to consider two data points — like describing a user’s preferences for ice cream or paint color — as being separated by any distance. But the word is a natural way to describe a number that measures how close the elements may be to each other.
Scientists and mathematicians generally rely on formulas that satisfy what they call the “triangle inequality.” That is, the distance between points A and B plus the distance between B and C is greater than or equal to the distance between A and C. When the formula guarantees this, the process gains more consistency. Some also rely on more rigorous definitions like “ultrametrics” that offer more complex guarantees. The clustering algorithms do not, strictly speaking, need to insist upon this rule because any formula that returns a number might do, but the results are generally better.
The statistics, data science and AI services offered by leading tech vendors include many of the most common clustering algorithms. The algorithms are implemented in the languages that make up the foundation of many of these platforms, which is often Python. Vendors include:
Established data specialists and a raft of startups are challenging the major vendors by offering clustering algorithms as part of broader data analysis packages and AI tools.
Teradata, Snowflake and Databricks are leading niche companies focused on helping enterprises manage the often relentless flows of data by building data lakes or data warehouses. Their machine learning tools support some of the standard clustering algorithms so data analysts can begin classification work as soon as the data enters the system.
Startups such as the Chinese firm Zilliz, with its Milvus open-source vector database, and Pinecone, with its SaaS vector database, are gaining traction as efficient ways to search for matches that can be very useful in clustering applications.
Some are also bundling algorithms with tools focused on particular vertical segments. They pre-tune the models and algorithms to work well with the type of problems common in that segment. Zest.ai and Affirm are two examples of startups that are building models for guiding lending. They don’t sell algorithms directly but rely on algorithms’ decisions to guide their product.
A number of companies use clustering algorithms to segment their customers and provide more direct and personalized solutions. You.com is a search engine company that relies on customized algorithms to provide users with personalized recommendations and search results. Observe AI aims to improve call centers by helping companies recognize the opportunities in offering more personalized options.
As with all AI, the success of clustering algorithms often depends on the quality and suitability of the data used. If the numbers yield tight clusters with large gaps in between, the clustering algorithm will find them and use them to classify new data with relative success.
The problems occur when there are not tight clusters, or the data elements end up in some gap where they are relatively equidistant between clusters. The solutions are often unsatisfactory because there’s no easy way to choose one cluster over another. One may be slightly closer according to the distance metric, but that may not be the answer that people want.
In many cases, the algorithms aren’t smart enough or flexible enough to accept a partial answer or one that chooses multiple classifications. While there are many real-world examples of people or things that can’t be easily classified, computer algorithms often have one field that can only accept one answer.
The biggest problems arise, though, when the data is too spread out and there are no clearly defined clusters. The algorithms may still run and generate results, but the answers will seem random and the findings will lack cohesion.
Sometimes it is possible to enhance the clusters or make them more distinct by adjusting the distance metric. Adding different weights for some fields or using a different formula may emphasize some parts of the data enough to make the clusters more clearly defined. But if these distinctions are artificial, the users may not be satisfied with the results.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.
Did you miss a session at Intelligent Security Summit? Head over to the on-demand library to hear insights from experts and learn the importance of cybersecurity in your organization.
© 2022 VentureBeat. All rights reserved.
We may collect cookies and other personal information from your interaction with our website. For more information on the categories of personal information we collect and the purposes we use them for, please view our Notice at Collection.
Leave a Reply
You must be logged in to post a comment.