Tableau 10.0 comes with k-means clustering as a built-in function so it is worthwhile talking about the use cases for clustering, how the algorithm works and why we chose to make it work the way it is.
K-means procedure splits the data into K segments. Each segment has a centroid that corresponds to the mean value for the members in that segment. The objective of the algorithm is to place the centroids such that the total of the sum of distances between centroids and members in respective segments is as small as possible.
A simple example
To demonstrate, in the toy example below, algorithm finds two groups with means 49 and 130 and assigns each point to the nearest mean. For example, a data point valued 85 would belong to the cluster on the left since it is closer to 49 than it is to 130.
We can clearly see that there are two groups by looking at the two humps in the histogram. There is no clear gap that could be treated to visually separate the data points however. K-means helps with the decision on where to split. Below is an example of what this might look like in 2 dimensions.
As simple as it may seem, even clustering on a single variable has many applications. For example it is commonly advised to not have more than 7 colors on a choropleth map. If you have a continuous measure and want to convert to a discrete color palette with a particular number of colors, you can achieve this via clustering. In fact, Jenks Natural Breaks method which is the common default on most mapping software is application of k-means on a single field.
But in many real life scenarios, data is much more complex than a single variable. Age, income, number of kids, average transaction amount, weight, blood pressure… can all tell you something about different ways to group your customers, patients, students…
Making sense of the results
Sometimes groupings in data make immediate sense. When clustering by income and age, one could come across a group that can be labeled as “young professionals”. Clustering feature comes with a Describe dialog that gives you summary statistics for each cluster to help with this process.
In UN’s development indicators dataset, using the Describe dialog, one can clearly see that Cluster 1, Cluster 2 and Cluster 3 correspond to Underdeveloped, Developing and Highly Developed countries respectively. By doing so we’re using k-means to compress the information that is contained in 3 columns and 180+ rows to just three labels.
Clustering can sometimes also find patterns your dataset may not be able to sufficiently explain by itself.
For example as you’re clustering health records, you may find two distinct groups and “why?” is not immediately clear and describable with the available data, which may lead you to ask more questions and maybe later realize that difference was because one group exercised regularly while the other didn’t, or one had an immunity to a certain disease or may even indicate things like fraudulent activity/drug abuse which otherwise you may not have noticed. Given it is hard to anticipate and collect all relevant data, such hidden patterns are not uncommon in real life.
A great real life example of this is from Bank of America. In the 90s Bank of America’s National Consumer Assets Group started exploring clustering to segment their customers. One cluster they found had a high portion of customers who either had a home equity line of credit or had a high propensity score for that product. Cluster included only 7 percent of the bank’s customers but it had more than a quarter of the customers with the highest equity propensity scores. The customers tended to be middle aged, married home owners with teenage children. 30% of them had both personal and business-oriented banking products meaning many were small business owners. They couldn’t figure out what this meant by just looking at the data. Survey of branch managers confirmed that these customers were borrowing against their home equity to fund the startup of a small business! None of the bank’s marketing material had been oriented that way. A new campaign around this idea got much better response from customers.
You will see many cases where clustering is used as a coarse summary of the structure in the data that helps move the data exploration in a direction worth exploring and being a tool for slicing the data (forcing a structure on the data) as opposed to serving an exact answer on a silver platter. Save and Reuse section provides yet another example that underlines this point.
Back to the algorithm…
Like any procedure there are many ways to implement k-means. Typical k-means implementations use random initializations (either random seed or random partitioning) which means two things 1) Subsequent runs over the same data will return different results 2) With a single random initialization, the results will most likely be suboptimal (stuck at a local minima) because it is random guessing. Common solution for (2) is to start over thousands of times and pick the best result which increases the chances of finding a good result at the expense of long wait times.
Our goal was to deliver great results that are repeatable with good performance. This is a tough balancing act. Quality and repeatability are must haves for trusting the results while good performance encourages experimentation with different filters, levels of detail, input variables etc. which is the path to more insights.
After testing a number of initialization methods we found the sweet spot with Howard-Harris method. Howard-Harris is a hierarchical divisive method that finds the variable of highest variance and splits the data in half (at the mean of that variable) and uses the means of each half as initial centroids to find 2 clusters, in our case, using Lloyd’s algorithm. Then it continues by splitting the cluster with highest variance (out of the first 2 clusters) using the same method to get to a 3 cluster solution and repeats this process until the desired number of clusters is reached.
Lloyd’s algorithm takes the initial centers from Howard-Harris and clusters all points around the nearest center then computes a new center for each cluster by averaging the points in the cluster. If the new center matches the center from the previous step, it returns the result, if not, assigns points to the new center (the mean) and repeats the process.
Distance between data points
In the previous sections, I used the phrases “sum of distances” and “to the nearest centroid” several times. But how is this distance measured?
There are many ways to measure distance. Manhattan, Mahalanobis, Chebyshev, Minkowski, Euclidian…
Tableau uses squared Euclidian distance for clustering. A good example for understanding Euclidian distance is to start with the simple, 2 dimensional case; the hypotenuse of a triangle. For a right triangle where legs/catheti are 3 and 4, Euclidian distance between these two corners would be SQRT(3^2 + 4^2)=5. This can be extended to more than 2 variables e.g. SQRT(x^2 + y^2 + z^2…).
Tableau doesn’t take the square root which makes it squared Euclidian distance. It is common to use squared Euclidian distance for k-means clustering.
But how could one compute distances with categorical fields?
Tableau uses Multiple Correspondence Analysis (MCA) to convert categories into numeric distances before clustering so you can use both numeric and categorical fields as inputs to clustering.
Assume you have a single column with 3 categories. Shoes, Dresses and Hats. The three categories don’t contain any true measureable distance information. They are just 3 different strings. Given this is the only available information, the assumption would be that they are at equal distance from each other. If you like thinking in pictures, you can imagine them to be the 3 corners of an equilateral triangle.
Occurrence count and if there is more than 1 categorical column, co-occurrences also impact the distances. For example if you have hospital admission form data, you would likely have some people who checked both female and pregnant boxes, some female and not pregnant, and potentially some, by mistake, marked themselves as male and pregnant. Male and Pregnant would be a very rare occurrence in the database so it would be further away from other points.
Tableau currently imposes a limit of max 25 unique categories per field and a total of 1000 unique categories. So you can use a maximum of 1000/25=40 categorical columns with 25 unique categories in each in a given clustering procedure.
Note that clustering accepts categorical fields as measures since they are meant to be used as attributes of the entities being clustered e.g. max education level attained for a person. When you use categorical fields that are finer than what’s being clustered e.g. list of products sold at a store while store is what’s being clustered ATTR([Field]) may return * for some rows. * is treated like NULL values which means those rows will end up in the “Not Clustered” group.
Wouldn’t some measures dominate the results?
If you have a table about country statistics, you might have a column for GDP and one for birth rate. Using the distance formula above, GDP would dominate the clustering results as it is orders of magnitude greater than birth rate. To avoid this problem, Tableau automatically scales all inputs. Of course, there are many ways to do scaling. Tableau uses min-max scaling (also known as normalization) that subtracts the minimum of the column from each value then divide the result by the difference between max and min of the column. As a result of this transformation each column get transformed to a range between 0 and 1. E.g. country with highest GDP would end up having a GDP of 1, lowest would be 0. Same is done for highest/lowest birth rates hence all variables get equally weighted. Standardization (z-score) is another common method used by many software packages that does k-means however min-max scaling is generally considered to be better for clustering per Stoddard (1979), Milligan & Cooper (1988), Dillon, Mulani & Frederick (1989) and Steinley (2004)’s research comparing different scaling methods.
How is optimal number of clusters determined?
There are many ways to find the optimal number of clusters. Some such as the elbow method involve eye-balling while others rely on picking the maximum or minimum value of a metric. You can find R packages that offer 50 different metrics and when you use them realize that all disagree with each other. This is because each method makes assumptions about the shape of the data and what makes a good cluster. There is abundant literature that compare these metrics in search for the methods that are most widely applicable and give good results and performance. We started with these studies and expanded upon them by running our own tests and decided to proceed with Calinski-Harabasz index based on the combination of its performance and how well it works with k-means.
Between group sum of squares (SSB) is the sum of squared distances between centroids of clusters and the centroid of the entire dataset. Larger values can be used as a sign of good separation between clusters.
Within group sum of squares (SSW) is the sum of squared distances between centroid of each cluster and the points in that cluster. Smaller values can be used as a sign of compactness and uniformity of individual clusters.
N is number of rows in the table and k is the number of clusters.
Tableau looks for the maximum value of Calinski-Harabasz index which happens when between group sum of squares is high and within group sum of squares is low. If equation only consisted of this ratio of course, the way to maximize it would be creating as many clusters as possible. This is where the second part (right side) of the equation comes in that one can think of as sort of a penalty for splitting into more clusters. This prevents further splits from happening unless a split is justified by enough reduction in error.
Tableau does k-means with different values of k ranging from 2 to 25 and compares each result with the previous. If current result is less than the previous, returns the previous result. Since it is looking for a local maximum, it will terminate early which means better performance and conservative estimates of cluster count.
When/why to manually set k?
It is very common to use clustering to segment data that has no visible “clumps”. Clothing manufacturers use it to slice data into Small, Medium, Large etc. clusters. Similarly it can help with product bundling decisions e.g. how many minutes, # SMS and GBs of data should be offered in a deal for “moderate users” segment?
Save and Reuse
Once you’re satisfied with the results, you can persist them by dragging the “Clusters” pill into the data pane. Resulting field can be used just like a group with rest of Tableau. You can rename/alias cluster groups even manually move items around.
Overriding results from an algorithm manually might sound weird at first but clustering is an exploratory process and it is perfectly OK to enhance results using external knowledge.
A good example of such use is the way Boston Globe divided 200 towns in eastern Massachusetts and southern New Hampshire into editorial zones. In 2003, The Boston Globe introduced geographically distinct versions of the paper with specialized content. Two days a week, readers would get local coverage for their area. The newspaper clustered towns into 4 groups based on demographic information but then applied external rules before coming up with the final result. For example, the zones needed to be geographically contiguous to optimize delivery routes and contain sufficient population to justify specialized content as well as being aligned with how the advertising deals were made which resulted in manually moving towns between groups such that it satisfies these constraints while also having similar demographics identified through clustering.
When does k-means not work well?
Like any statistical method k-means works well in some cases but not all. You can find many examples with a quick Google search showing some patterns that are very easy to detect visually but k-means doesn’t “find” them. Here are two of the typical examples used for these demonstrations.
Why is this happening?
K-means seeks the way to dividing the data into areas where each area contains similar items that can potentially be summarized/represented by its mean, such that you can look at a cluster’s centroid and give it a meaningful name e.g. enterprise customers.
If you look at the example with two circles, and just assume for a moment that one axis is weight, the other axis is height, what do the points in outer circle have common with each other? Outer circle contains both the tallest and the shortest person as well as heaviest and lightest. The members of outer circle don’t share any common attributes. One can say the same about the 3 spirals. In neither of these cases visually detected groupings are similar.
If one is looking to discover such patterns, k-means is not the right method. Patterns like this could be seen on maps (events following along a river or a road) or in image processing etc. and would require algorithms that are significantly slower than k-means that typically look at pairwise distances between points searching for continuity, dense areas vs. sparse areas etc. It is extremely unlikely that your business data will look anything like these.
When the goal is finding groups that share similar properties, k-means often does a good job of detecting them even if the areas have very different densities as shown in the example below as long as there is sufficient data.
K-means will give you convex/globular clusters since each point is assigned to the nearest centroid and if this works for your data and type of analysis (and in a lot of cases when data is being sliced/segmented into desired number of buckets it will), it can be safely applied to enhance your data exploration experience.
I can’t emphasize the importance of the word exploration enough. First of all, when looking for clusters, you will most likely want to write calculations, apply filters, aggregate your data… to convert your e.g. transaction level data to metrics that might potentially uncover clusters such as customer’s average spending or days since last visit. This is often referred to as feature building and is the step where many analyses prematurely fail. We built our clustering user experience to be very dynamic and interactive to encourage more experimentation to increase the likelihood of relevant metrics to be distilled. Secondly, examining results of clustering (e.g. using Describe dialog) is an integral part of the analysis. Of course this can be said for any statistics or data mining method. But with clustering, given you have the ability to augment the results with your domain knowledge, such exploration means you can tackle cases where algorithm doesn’t give you a perfect answer right out of the bat. For example if you have an elongated cluster where k-means split it in half to 20-25 year-old with spending 0-5K and 20-25 year-old with spending 5-10K but you believe it makes sense for it to be single cluster that is just 20-25 year-olds regardless of spending, then you can save the cluster pill and merge the two clusters, in just 5 clicks.
And that’s a wrap. I hope you find this information on the details of the algorithm and the use case examples helpful.