Correspondence Analysis in Tableau with R

Correspondence analysis is an exploratory data analysis method for discovering relationships between two or more categorical variables. It is very often used for visualizing survey data since if the matrix is large enough (which could be due to large number of variables but also possible with small number of variables with high cardinality) visual inspection of tabulated data or simple statistical analysis cannot sufficiently explain its structure. Correspondence analysis can remarkably simplify representation of such data by projecting both row and column variables into lower dimensional space that can often be visualized as a scatter plot at a small loss of fidelity.

Let’s take a look at an example. Below is the data from 2014 Auto Brand Perception survey by Consumer Reports where 1578 randomly selected adults were asked what they considered exemplar attributes for different car brands. Respondents picked all that apply from among the list that consisted of : Style, Performance, Quality, Safety, Innovation, Value  and Fuel Economy.

We can convert this data into a contingency table in R and do a chi-square test which tells us that there is statistically significant association between car brands and their perceived attributes.

chisq.test(table(yourDataFrameGoesHere))

But often this is not sufficient since my goal is to understand how different car makers are perceived to learn how people see my brand, how I compare with the competition, how to competitively position an existing product or bring a new product in the market to fill a gap.

Let’s visualize this as a cross-tab in Tableau.

2014 Auto Brand Perception Survey Results

Even though there are only 7 choices and a single question in the survey, this table is hard to interpret.

Let’s apply correspondence analysis and see what our scatter plot looks like. Here blue dots are cars. Blue points closer to each other are more similar than points farther away. Red items (e.g. Style being hovered over in the screenshot) are the attributes. The axes themselves do not distinguish independent dimensions for discriminating categories so attributes are useful in orienting yourself when looking at the chart and help assign names to different areas of the scatter plot. If you imagine a line extending from the center of the plot towards each of the red points, the distance of blue points to the lines indicate how related they are to the particular attribute. For example for Volvo, safety is the the perception that dominates. Same can be said for Kia and Value. But Subaru is considered safe, have good quality and value while Porsche and Ferrari are mostly associated with attributes Style and Performance, roughly the same amount.

Correspondence Analysis of Brand Perception Survey

This scatter plot explains 70% of the variance in the data. While it doesn’t capture everything, it is a lot easier to consume than cross-tabulation.

The rows and columns used in computing the principal axes of the low-dimensional representation are called active points. Passive (supplementary) points/variables are projected onto the plot but not taken into account when computing the structure of the plot itself. For example if there are two new cars in the market and you want to see their relative positioning in an existing plot, you can add them as supplementary points. If there are outliers, you can also choose to make them into supplementary points not to skew the results. Supplementary variables on the other hand are typically exogenous variables e.g. the age group or education level of the survey participant. In some cases you may prefer generating multiple plots instead e.g. one per gender. You can mark a column or row as supplementary using supcol and support arguments in ca function call e.g. ca(mydata,supcol=c(1,6)) makes 1st and 6th columns in the table supplementary.

You can add more to this chart to explore more. For example, you can put price of the car or safety rating on color and see whether they align with the perceived value or safety.  For example Tesla, Ford and Fiat are all associated with value while Tesla is not a budget car. Similarly Volvo and Tesla both have a 5 star safety rating but consumers associate Volvo much more with safety than any other brand. If you have multiple years of data, you can put years on the Pages Shelf and watch how perception changed over time, whether your marketing campaigns were effective in moving it in a direction you wanted.

Correspondence analysis use cases are not limited to social sciences and consumer research. In genetics for example microarray studies use MCA to identify potential relationships between genes. Let’s pick our next example from a different domain.

If there are multiple questions in your survey, you can use Multiple Correspondence Analysis (MCA) instead. Our data for this example contains categorical information about different organisms. Whether they fly, photosynthesize, have a spine….

Categorica attributes of different organisms

For a moment, imagine the first column doesn’t exist so you have no knowledge about what organism each row is. How easy would it be to understand if there are groups in the data based on these attributes?

Let’s apply MCA to this dataset. In this case I put the attributes in the secondary axis, hid their marks and made their labels larger. I also applied some jitter to deal with overlapping marks.

I can clearly see groups like birds, mammals, plants, fungi and shellfish. If the data wasn’t labeled, I would be able to associate them looking at the chart and by examining the common attributes of adjacent points start developing an understanding of what type of organisms they might be.

Multiple correspondence analysis applied to organisms dataset

You can download the sample workbook from HERE.

Quick Tip : Overlaying curves on Tableau scatter plots with R

Tableau provides a good set of trend line, reference line and band options but sometimes you want to overlay curves based on a custom equation. Logistic regression curves, sine curves, quantile regression curves…. And want these overlay curves to be smooth…

This is very easy to do by taking advantage of the technique I shared when building Coxcomb charts and radial treemaps. If you know the equation (or know how to get to it) and it can be described in Tableau’s calculation language you can do so using a table calculation. But doing the fit dynamically involves R and when you’re passing the data to R you need to do some basic NULL handling. Here are two examples showing what the results might look like. You can see that despite very few points in my dataset and large gaps between them, the curves look very smooth.

A sine curve and logistic regression curve overlay in Tableau

The key component is the bin field created on the variable that’s on the X axis. In Tableau bins can be used to trigger densification by turning on “Show Missing Values” option on the bin. Doing so adds NULL rows into the data backing up the visualization, which you can fill with values from table calculations (including SCRIPT_ functions). In your R script, you need to remove these artificially generated NULLs, in order not to confuse the curve fitting procedure you’re applying. 

I tied the bin size to a parameter so you can try different values to make the curves coarser or smoother.

If you want to take this technique it a bit further you could use one bin for each axis which will allow you to create a grid. Then you an treat each cell like a pixel in a raster and draw shaded areas such as contours.

Below you can see two examples of this. First one estimates the bivariate kernel density of the data points in the viz on-the-fly using R on the fly and draws the contours using the secondary axes of this dual (technically quadruple) axis chart.

Displaying kernel density as overlay in Tableau

The second chart uses the same data to fit a one-class SVM (support vector machine) which is often used for outlier/novelty detection with a Gaussian radial basis function then draws the decision boundaries using the secondary axes (blue ellipse). Suspected outliers are shown in red while inliers are shown in white.

Displaying the boundaries of a one-class SVM as overlay in Tableau

You can download the sample Tableau workbook from HERE

Quick Tip: Using Access crosstab queries in Tableau

Access databases are still commonly used by many businesses. A question I keep hearing is how to use crosstabs created in Access as data sources in Tableau as such queries are not displayed in the list of views & tables in the connection experience, nor can they be used via Custom SQL option in Tableau.

Let’s use the old Coffee Chain sample data source in this example with the following crosstab query and save it with the name “CoffeeChain Query”.

TRANSFORM Sum(factTable.Sales) AS SumOfSales
SELECT Product.[Product Type] AS ProductType
FROM (Product INNER JOIN factTable ON Product.ProductID = factTable.ProductID)
INNER JOIN Location ON factTable.[Area Code] = Location.[Area Code]
GROUP BY Product.[Product Type]
PIVOT Location.Market;

This query makes new columns from the unique values of Market, returning a table in the following form in Access :

Access crosstab query results

Copy-pasting this query into Tableau as Custom SQL will result in errors. Even though it is saved in the Access file, “CoffeeChain Query” won’t display in the list of views/tables either.

Making this table accessible from Tableau takes one little step that involves creating another query on top of this crosstab.

Creating a new Access query/view

You can follow Create > Query (wizard or designer depending on your preference) by doing a SELECT * from the “CoffeeChain Query” listed in Queries tab.

Configuring the new Access query

Let’s save this as “Cross-Tab Query”.

Saving the Access query/view

And now, if you connect to the Access file from Tableau, you will see that Cross-Tab Query is shown in the table list.

Access crosstab is now accessible from Tableau

Using your R models for in-database scoring

Let’s say you trained a predictive model using R integration in Tableau or your favorite R authoring environment.

If your data is very large you probably used a subset of your data for training and want to see how well it will do over the entire database.Or maybe you’re satisfied with the model but you want it to return you only the interesting results from the database e.g. customers who are most likely to churn.

In both scenarios, being able to push the model to the database means being able to rely on the database’s scalability and avoiding delays caused by data movement.

You may be asking, “but what if our database systems can’t run predictive models?” Luckily once trained, some of the very commonly used predictive methods can be expressed in SQL hence also in Tableau’s calculation language. All you need is a way to convert the model into a Tableau calculated field.

If you created a decision tree in R, you can use the following function to convert it to a Tableau formula:

decisionTreeToFormula <- function (tree) {

if (class(tree) != "rpart") {
warning(class(tree), " is not supported.Please use an rpart model.")
stop()
}

capture.output({rpartRules <- path.rpart(tree,rownames(tree$frame)[tree$frame$var=="<leaf>"])})
ruleTexts <- "IF "
operators <- c("<=",">=","<",">","=")
i <- 1

levelToCategory<-data.frame(labels=attr(tree,"ylevels"),catkey=rep(1:length(attr(tree,"ylevels"))))
nodeToLevel<-data.frame(catkey=tree$frame$yval,node=rownames(tree$frame))
mapping_table<-merge(levelToCategory,nodeToLevel,by='catkey')[,2:3]

for (rule in rpartRules) {
ruleText <- character(0)
for (component in rule) {
whichSeparator <- lapply(operators, function(x) length(unlist(strsplit(component,x)))) > 1
parts <- unlist(strsplit(component,(operators[whichSeparator])[1]))
if(!(parts[1]=="root")) {
if (is.finite(suppressWarnings(as.numeric(parts[2])))) {
ruleText <- c(ruleText,paste("[",parts[1],"]",(operators[whichSeparator])[1],parts[2],sep=""))
} else {
ruleText <- c(ruleText,paste0(" (",paste0("[",parts[1],"]","='",unlist(strsplit(parts[2],",")),"'",collapse=" OR "),")"))
}
}
}

ruleTexts <- c(ruleTexts, paste0(if(length(ruleTexts)>1){"\nELSEIF "}, paste(ruleText,collapse=" AND ")," THEN '" ,mapping_table[mapping_table$node==as.numeric(names
(rpartRules)[i]),]$labels,"'"))
if(i==length(rpartRules)) ruleTexts <- c(ruleTexts,"\nEND")
i <- i +1
}
tableauFormula <- paste(ruleTexts,collapse=" ")
return(tableauFormula)
}

You can add this to your Rserve.cfg so you can call the decisionTreeToFormula function  in your SCRIPT_ calls from Tableau or you can use it in your R environment. Either way it will give you some text which you can copy-paste into Tableau and use it as a formula that can run on the database.

In the example below, I fit a decision tree on a database of cars trying to predict the gas mileage (Bad|OK|Good). I used only a small portion of variables (just maker and year) to get a compact result back. Applying the function gives me a nice IF..THEN..ELSE formula that handles both numeric and categorical items. Now I can use this classifier on any database. As a dimension, as a filter… by just copy-pasting the formula into a calculated field in Tableau.

> myTree<-rpart(mpg~modelyear+maker,data=cars)
> decisionTreeToFormula(myTree)
IF ([maker]='America') AND [modelyear] < 79 THEN 'Bad'
ELSEIF ([maker]='America') AND [modelyear] >= 79 THEN 'Good'
ELSEIF ([maker]='Asia' OR [maker]='Europe') THEN 'OK'
END

What about a regression model? This post has that covered, too. You can achieve similar results with the following function :

regressionModelToFormula <- function(model, decimalPlaces) {
if ((class(model)[1] != "glm" && model$family$family != "binomial") & class(model)[1] != "lm") {
warning("Function is not applicable to this model type.")
stop()
}
factors=c("",gsub("[-^0-9]", "", names(unlist(model$xlevels))))[-1]
levels=c("",as.character(unlist(model$xlevels)))[-1]
coeff <- coefficients(model);
names=data.frame(factors,levels,varNames=paste(factors,levels,sep=""))
varCoeffMap=data.frame(varNames=names(coeff)[-1],coefficients=as.numeric(coeff[-1]))
catVars <-merge(names,varCoeffMap,all.x=TRUE)

if (nrow(catVars) !=0) {
contVars <- round(coeff[-which(names(coeff) %in% catVars$varNames)],decimalPlaces)
} else {
contVars <- round(coeff,decimalPlaces)
}
catFormula <- " "

for(catVar in unique(catVars$factors)){
curVar <- catVars[catVars$factors==catVar & !is.na(catVars$coefficients) ,]
catFormula <- paste("WHEN",curVar$levels, "THEN",round(curVar$coefficients,decimalPlaces), sep=" ", collapse="\n")
catFormula <- paste("+ CASE [",catVar,"] \n",catFormula," ELSE 0 \n END \n", sep="")
}

numericFormula <- paste(contVars[-1], "*" ,paste("[",names(contVars[-1]),"]",sep=""), "+", collapse = " ")
if (class(model)[1]=="lm") {
combinedFormula <- paste(numericFormula,contVars[1],catFormula)
}else{
combinedFormula <- paste("1 / (1 + exp(", numericFormula, contVars[1],catFormula, "))")
}
tableauFormula <- gsub(pattern=":", replacement="*", x=combinedFormula)
return(tableauFormula)
}

If you call this function from R on the logistic regression model I used in one of my earlier blog posts, below is what you will get. Copy-paste this into a Tableau calculated field and you’re ready to make predictions with your logistic regression model in your database.

> regressionModelToFormula(lrmodel,5)
1 / (1 + exp( 0.00628 * [gre] + 1.60393 * [gpa] -8.82164 +
CASE [rank]
WHEN 2 THEN -1.10904
WHEN 3 THEN -1.31259
WHEN 4 THEN -2.01912
ELSE 0
END))

And if you apply this function on the multiple linear regression model in the last sheet of the example workbook from my last blog post, what you get will resemble the following:

> regressionModelToFormula(mrmodel, 4)
1.6224 * [Economic Indicator X] + 0.597 * [Economic Indicator Y] -37.1103

I hope you find this useful.

Multivariate Forecasting in Tableau with R

Since version 8.0 it is very easy to generate forecasts in Tableau using exponential smoothing. But in some cases you may want to enrich your forecasts with external variables. For example you may have the government’s forecast for population growth, your own hiring plans, upcoming holidays*, planned marketing activities… which could all have varying levels of impact on your forecasts. Or sometimes you may just want to simulate different scenarios and ask “what if” questions e.g. what would my sales look like if I hired 10 more sales representatives? If interest rates went up by X percent, how would that impact my profits?

Let’s see how we can tackle both uses cases with the help of Autoregressive Integrated Moving Average with eXogenous variables (ARIMAX) models in R’s forecast package.

*In some cases seasonality may be sufficient to capture weekly cycles but not for moving events like Easter, Chinese New Year, Ramadan, Thanksgiving, Labor day etc.

Handling special events

In the image below the observed/historical demand is shown in blue. At the bottom you can see the holidays as green bars with height of 1 and non-holidays as bars of height 0. This representation is often referred to as dummy encoding.

Our data contains 2 holidays that happened in the past and 1 upcoming holiday. One can clearly see that the holidays are causing noticeable spikes in demand. For a store owner who doesn’t want to miss the next holiday opportunity by running out of stock early, it is very valuable to incorporate this important piece of information in the demand forecast.

Accounting for holidays and special events in forecasts

The formula for the forecast shown with the red line (which doesn’t take holidays into account) looks like the following:

SCRIPT_REAL("library(forecast);
append(.arg1[1:99],forecast(ts(.arg1[1:99], frequency=12), h=21)$mean)",AVG([Demand]))

First 99 points cover the historical data while last 21 are what’s being predicted. The data follows a 12 period cycle. Second line of R code appends the predicted values to the reported values to generate the full series.

The formula for the forecast shown with the green line (which incorporates the holidays) looks like the following:

SCRIPT_REAL("library(forecast);
data <- data.frame(demand=.arg1,holiday=.arg2);
training <-data[1:99,];
model<-auto.arima(training[,1],xreg=training[,2]);
append(training[,1],forecast(model,xreg=.arg2[100:120])$mean)", AVG([Demand]), ATTR([Holiday]))

In this script, exogenous regressors are passed to the function using the xreg argument. It can be clearly seen that both the spike due to holiday and the regular seasonality in demand are represented by this model. If there are multiple events with potentially different effects (positive vs. negative, different magnitudes etc.), they can be handled using the same method if added as separate variables.

What-If Analysis

Now that we covered handling events as additional regressors, let’s talk about we can apply the same methodologies to do what if analysis.

Below you can see three time series; Sales and 2 economic indicators. Assume that the two indicators contain the current best estimates of behavior for next few days but Sales for the same period is unknown however can be estimated using the two economic indicators.

Sales forecast based on the two economic indicators

What makes this visualization more interesting is that you can also adjust the value of economic indicators and the time frame these overrides apply. For example in the image below, indicator X has been increased by 15 units (dark blue peak) for the period April 13th and April 25th while indicatory Y has been reduced by 20 units (dark orange dip) for the period April 20th and May 1st. The impact can be clearly seen in the dark green portion of the line in the first chart. By adjusting the parameters in the dashboard one can perform what-if analysis and understand impact of likely future events, best/worst case scenarios etc.

What if analysis with time series forecasting

Let’s take a look at the R script that is used to create this view.

SCRIPT_REAL("library(forecast);
training <- ts(data.frame(sales=.arg1,indX=.arg2,indY=.arg3)[1:100,],frequency=7);
model<-auto.arima(training[,1],xreg=training[,-1]);
whatifdata <- data.frame(indX=.arg2,indY=.arg3)[101:120,];
append(training[,1],forecast(model,xreg=whatifdata)$mean)",
AVG([Sales]),AVG([Economic Indicator X]),AVG([Economic Indicator Y]))

As you can see, the formula is very similar to earlier examples. In this demo dataset, the first 100 rows are used for model fitting while the last 20 contain the sales forecast as well as the inputs for the sales forecast that are the what-if values defined in Economic indicator X and Y fields as a function of parameter entries.

You can find all of these examples in the Tableau workbook published HERE.

In the sample workbook, I also provided a sheet that compares the ARIMAX result to multiple linear regression to give a better sense of what you’re getting out of applying this particular method.

Understanding Clustering in Tableau 10

Tableau 10.0 comes with k-means clustering as a built-in function so it is worthwhile talking about the use cases for clustering, how the algorithm works and why we chose to make it work the way it is.

K-means procedure splits the data into K segments.  Each segment has a centroid that corresponds to the mean value for the members in that segment. The objective of the algorithm is to place the centroids such that the total of the sum of distances between centroids and members in respective segments is as small as possible.

A simple example

To demonstrate, in the toy example below, algorithm finds two groups with means 49 and 130 and assigns each point to the nearest mean. For example, a data point valued 85 would belong to the cluster on the left since it is closer to 49 than it is to 130.

A bimodal distribution

We can clearly see that there are two groups by looking at the two humps in the histogram. There is no clear gap that could be treated to visually separate the data points however. K-means helps with the decision on where to split. Below is an example of what this might look like in 2 dimensions.

Two clusters

As simple as it may seem, even clustering on a single variable has many applications. For example it is commonly advised to not have more than 7 colors on a choropleth map. If you have a continuous measure and want to convert to a discrete color palette with a particular number of colors, you can achieve this via clustering. In fact, Jenks Natural Breaks method which is the common default on most mapping software is application of k-means on a single field.

But in many real life scenarios, data is much more complex than a single variable. Age, income, number of kids, average transaction amount, weight, blood pressure… can all tell you something about different ways to group your customers, patients, students…

Making sense of the results

Sometimes groupings in data make immediate sense. When clustering by income and age, one could come across a group that can be labeled as “young professionals”. Clustering feature comes with a Describe dialog that gives you summary statistics for each cluster to help with this process.

In UN’s development indicators dataset, using the Describe dialog, one can clearly see that Cluster 1, Cluster 2 and Cluster 3 correspond to Underdeveloped, Developing and Highly Developed countries respectively. By doing so we’re using k-means to compress the information that is contained in 3 columns and 180+ rows to just three labels.

Centroids for clusters in the UN development indicators dataset

Clustering can sometimes also find patterns your dataset may not be able to sufficiently explain by itself. 

For example as you’re clustering health records, you may find two distinct groups and “why?” is not immediately clear and describable with the available data, which may lead you to ask more questions and maybe later realize that difference was because one group exercised regularly while the other didn’t, or one had an immunity to a certain disease or may even indicate things like fraudulent activity/drug abuse which otherwise you may not have noticed. Given it is hard to anticipate and collect all relevant data, such hidden patterns are not uncommon in real life.

A great real life example of this is from Bank of America.  In the 90s Bank of America’s National Consumer Assets Group started exploring clustering to segment their customers. One cluster they found had a high portion of customers who either had a home equity line of credit or had a high propensity score for that product. Cluster included only 7 percent of the bank’s customers but it had more than a quarter of the customers with the highest equity propensity scores. The customers tended to be middle aged, married home owners with teenage children. 30% of them had both personal and business-oriented banking products meaning many were small business owners. They couldn’t figure out what this meant by just looking at the data. Survey of branch managers confirmed that these customers were borrowing against their home equity to fund the startup of a small business! None of the bank’s marketing material had been oriented that way. A new campaign around this idea got much better response from customers.

You will see many cases where clustering is used as a coarse summary of the structure in the data that helps move the data exploration in a direction worth exploring and being a tool for slicing the data (forcing a structure on the data) as opposed to serving an exact answer on a silver platter.  Save and Reuse section provides yet another example that underlines this point.

Back to the algorithm…

Like any procedure there are many ways to implement k-means. Typical k-means implementations use random initializations (either random seed or random partitioning) which means two things 1) Subsequent runs over the same data will return different results 2) With a single random initialization, the results will most likely be suboptimal (stuck at a local minima) because it is random guessing. Common solution for (2) is to start over thousands of times and pick the best result which increases the chances of finding a good result at the expense of long wait times.

Our goal was to deliver great results that are repeatable with good performance. This is a tough balancing act. Quality and repeatability are must haves for trusting the results while good performance encourages experimentation with different filters, levels of detail, input variables etc. which is the path to more insights.

After testing a number of initialization methods we found the sweet spot with Howard-Harris method. Howard-Harris is a hierarchical divisive method that finds the variable of highest variance and splits the data in half (at the mean of that variable) and uses the means of each half as initial centroids to find 2 clusters, in our case, using Lloyd’s algorithm.  Then it continues by splitting the cluster with highest variance (out of the first 2 clusters) using the same method to get to a 3 cluster solution and repeats this process until the desired number of clusters is reached.

Lloyd’s algorithm takes the initial centers from Howard-Harris and clusters all points around the nearest center then computes a new center for each cluster by averaging the points in the cluster. If the new center matches the center from the previous step, it returns the result, if not, assigns points to the new center (the mean) and repeats the process.

Distance between data points

In the previous sections, I used the phrases “sum of distances” and “to the nearest centroid” several times.  But how is this distance measured?

There are many ways to measure distance. Manhattan, Mahalanobis, Chebyshev, Minkowski, Euclidian…

Tableau uses squared Euclidian distance for clustering. A good example for understanding Euclidian distance is to start with the simple, 2 dimensional case; the hypotenuse of a triangle. For a right triangle where legs/catheti are 3 and 4, Euclidian distance between these two corners would be SQRT(3^2 + 4^2)=5.  This can be extended to more than 2 variables e.g. SQRT(x^2 + y^2 + z^2…).

Tableau doesn’t take the square root which makes it squared Euclidian distance. It is common to use squared Euclidian distance for k-means clustering.

But how could one compute distances with categorical fields?

Tableau uses Multiple Correspondence Analysis (MCA) to convert categories into numeric distances before clustering so you can use both numeric and categorical fields as inputs to clustering.

Assume you have a single column with 3 categories. Shoes, Dresses and Hats. The three categories don’t contain any true measureable distance information. They are just 3 different strings. Given this is the only available information, the assumption would be that they are at equal distance from each other. If you like thinking in pictures, you can imagine them to be the 3 corners of an equilateral triangle.

Occurrence count and if there is more than 1 categorical column, co-occurrences also impact the distances. For example if you have hospital admission form data, you would likely have some people who checked both female and pregnant boxes, some female and not pregnant, and potentially some, by mistake, marked themselves as male and pregnant. Male and Pregnant would be a very rare occurrence in the database so it would be further away from other points.

Tableau currently imposes a limit of max 25 unique categories per field and a total of 1000 unique categories. So you can use a maximum of 1000/25=40 categorical columns with 25 unique categories in each in a given clustering procedure.

Note that clustering accepts categorical fields as measures since they are meant to be used as attributes of  the entities being clustered e.g. max education level attained for a person. When you use categorical fields that are finer than what’s being clustered e.g. list of products sold at a store while store is what’s being clustered ATTR([Field]) may return *  for some rows. * is treated like NULL values which means those rows will end up in the “Not Clustered” group.

Wouldn’t some measures dominate the results?

If you have a table about country statistics, you might have a column for GDP and one for birth rate. Using the distance formula above, GDP would dominate the clustering results as it is orders of magnitude greater than birth rate. To avoid this problem, Tableau automatically scales all inputs. Of course, there are many ways to do scaling. Tableau uses min-max scaling (also known as normalization) that subtracts the minimum of the column from each value then divide the result by the difference between max and min of the column. As a result of this transformation each column get transformed to a range between 0 and 1. E.g. country with highest GDP would end up having a GDP of 1, lowest would be 0. Same is done for highest/lowest birth rates hence all variables get equally weighted. Standardization (z-score) is another common method used by many software packages that does k-means however min-max scaling is generally considered to be better for clustering per Stoddard (1979), Milligan & Cooper (1988), Dillon, Mulani & Frederick (1989) and Steinley (2004)’s research comparing different scaling methods.

How is optimal number of clusters determined?

There are many ways to find the optimal number of clusters. Some such as the elbow method involve eye-balling while others rely on picking the maximum or minimum value of a metric. You can find R packages that offer 50 different metrics and when you use them realize that all disagree with each other. This is because each method makes assumptions about the shape of the data and what makes a good cluster. There is abundant literature that compare these metrics in search for the methods that are most widely applicable and give good results and performance. We started with these studies and expanded upon them by running our own tests and decided to proceed with Calinski-Harabasz index based on the combination of its performance and how well it works with k-means.

Calinski-Harabasz criterion
Between group sum of squares (SSB) is the sum of squared distances between centroids of clusters and the centroid of the entire dataset.  Larger values can be used as a sign of good separation between clusters.
 
Within group sum of squares (SSW) is the sum of squared distances between centroid of each cluster and the points in that cluster. Smaller values can be used as a sign of compactness and uniformity of individual clusters.

N is number of rows in the table and k is the number of clusters.
 
Tableau looks for the maximum value of Calinski-Harabasz index which happens when between group sum of squares is high and within group sum of squares is low.  If equation only consisted of this ratio of course, the way to maximize it would be creating as many clusters as possible. This is where the second part (right side) of the equation comes in that one can think of as sort of a penalty for splitting into more clusters. This prevents further splits from happening  unless a split is justified by enough reduction in error.
 
Tableau does k-means with different values of k ranging from 2 to 25 and compares each result with the previous. If current result is less than the previous, returns the previous result. Since it is looking for a local maximum, it will terminate early which means better performance and conservative estimates of cluster count.

When/why to manually set k?

It is very common to use clustering to segment data that has no visible “clumps”. Clothing manufacturers use it to slice data into Small, Medium, Large etc. clusters. Similarly it can help with product bundling decisions e.g. how many minutes, # SMS and GBs of data should be offered in a deal for “moderate users” segment?

Save and Reuse

Once you’re satisfied with the results, you can persist them by dragging the “Clusters” pill into the data pane. Resulting field can be used just like a group with rest of Tableau. You can rename/alias cluster groups even manually move items around.

Overriding results from an algorithm manually might sound weird at first but clustering is an exploratory process and it is perfectly OK to enhance results using external knowledge.
 
A good example of such use is the way Boston Globe divided 200 towns in eastern Massachusetts and southern New Hampshire into editorial zones. In 2003, The Boston Globe introduced geographically distinct versions of the paper with specialized content. Two days a week, readers would get local coverage for their area. The newspaper clustered towns into 4 groups based on demographic information but then applied external rules before coming up with the final result. For example, the zones needed to be geographically contiguous to optimize delivery routes and contain sufficient population to justify specialized content as well as being aligned with how the advertising deals were made which resulted in manually moving towns between groups such that it satisfies these constraints while also having similar demographics identified through clustering.

When does k-means not work well?

Like any statistical method k-means works well in some cases but not all. You can find many examples with a quick Google search showing some patterns that are very easy to detect visually but k-means doesn’t “find” them. Here are two of the typical examples used for these demonstrations.

Donut3 spirals

Why is this happening?

K-means seeks the way to dividing the data into areas where each area contains similar items that can potentially be summarized/represented by its mean, such that you can look at a cluster’s centroid and give it a meaningful name e.g. enterprise customers.

If you look at the example with two circles, and just assume for a moment that one axis is weight, the other axis is height, what do the points in outer circle have common with each other? Outer circle contains both the tallest and the shortest person as well as heaviest and lightest. The members of outer circle don’t share any common attributes. One can say the same about the 3 spirals. In neither of these cases visually detected groupings are similar.

If one is looking to discover such patterns, k-means is not the right method. Patterns like this could be seen on maps (events following along a river or a road) or in image processing  etc. and would require algorithms that are significantly slower than k-means that typically look at pairwise distances between points searching for continuity, dense areas vs. sparse areas etc. It is highly unlikely that your business data will look anything like these. If it does, it has potential for an M. Night Shyamalan movie. Spoiler: Aliens are not trying to communicate with the analysts. It is the database admin trying to be funny.

When the goal is finding groups that share similar properties, k-means often does a good job of detecting them even if the areas have very different densities as shown in the example below as long as there is sufficient data.

3 clusters with varying densities

K-means will give you convex/globular clusters since each point is assigned to the nearest centroid and if this works for your data and type of analysis  (and in a lot of cases when data is being sliced/segmented into desired number of buckets it will), it can be safely applied to enhance your data exploration experience.

I can’t emphasize the importance of the word exploration enough. First of all, when looking for clusters, you will most likely want to write calculations, apply filters, aggregate your data… to convert your e.g. transaction level data to metrics that might potentially uncover clusters such as customer’s average spending or days since last visit. This is often referred to as feature building and is the step where many analyses prematurely fail. We built our clustering user experience to be very dynamic and interactive to encourage more experimentation to increase the likelihood of relevant metrics to be distilled. Secondly, examining results of clustering (e.g. using Describe dialog) is an integral part of the analysis. Of course this can be said for any statistics or data mining method. But with clustering, given you have the ability to augment the results with your domain knowledge, such exploration means you can tackle cases where algorithm doesn’t give you a perfect answer right out of the bat. For example if you have an elongated cluster where k-means split it in half to 20-25 year-old with spending 0-5K and 20-25 year-old with spending 5-10K but you believe it makes sense for it to be single cluster that is just 20-25 year-olds regardless of spending, then you can save the cluster pill and merge the two clusters, in just 5 clicks.

And that’s a wrap. I hope you find this information on the details of the algorithm and the use case examples helpful.

Happy clustering!

Cause & Effect : A different way to explore temporal data in Tableau with R

Whether it is forecasting your quarterly sales or comparing historical data, working with time series data is big part of business analytics. Sometimes patterns are easy to see, while in others they might be elusive especially when comparing different time series with short/frequent cyclical patterns.

A common question could be understanding the impact of a marketing campaign. Did it make a difference at all? What’s the return?

A true randomized treatment-control experiment would be the ideal case for running tests but it is not always practical, if at all possible. So let’s see how we can try to answer this question using Bayesian Structural Time Series models. For this post, I generated a dataset with 3 columns. In the image below, Sales is at the top, with 2 economic indicators below it. Let’s assume these two economic indicators are “Consumer Sentiment” and “Spending on Durable Goods” and that they have been good predictors of my sales and there are no other unaccounted for variables that impact my sales.

Time series data

I know the date I started my marketing campaign. Since time flows in one direction, if it has any impact, it would be after this date. If I can use the data from days prior to the campaign to train my model, I can forecast what would have happened if I had not run the campaign. Then by comparing it with the actual sales data, I can understand what difference it made.

Predicted vs Actual sales

While it was initially hard to visually detect this change, now that I added the “expected” forecast (red line), I can clearly see the bump marketing campaign gave me by looking at the gap between green and red lines. In the workbook, I also added to the tooltip a textual summary describing the statistical significance of the results which the CausalImpact package conveniently generates.

Statistical test results

Let’s also look at the cumulative effect of the marketing campaign.

Cumulative value of the marketing campaign over time

It looks like, we made an extra 748 thousand dollars. By comparing this with the amount we spent on the marketing campaign, we can decide whether this was a good investment.

You can download the sample workbook from HERE. The workbook uses the CausalImpact package which can be installed from GitHub. You can find more info on the installation steps in the notes section at the end of the blog post. One thing to keep in mind, CausalImpact package doesn’t work well with lags e.g. if Consumer Sentiment in January is linked to Sales in July, you’ll have to shift your dates first.

Another common type of time series analysis involves understanding dependency relationships between time series with different lags e.g. could Consumer Sentiment from 3 months ago help me make a better prediction of my Sales next month? A frequently used method for this type of analysis is Granger causality.

Granger causality is not necessarily true causality hence the term is often used as “X Granger-causes Y”. It means the patterns in X are approximately repeated in Y after some lag e.g. 4 months in the example I gave above. As a result of this, past values of X can be used to predict future values of Y.

You might be thinking if X happened 4 months before Y and there is a correlation, how is that not a causal relationship? If both X and Y are driven by a 3rd, unknown process (Z) with different lag, it might look like Granger-causality exists between X and Y while in reality manipulation of X would not change Y, only changes to Z would. The roosters crow before dawn, but they don’t cause the sun to rise.

That being said, lack of true causality does not mean one thing can’t be useful in predicting another. Humans relied on roosters’ built in timers for hundreds of years.

Let’s explore Granger causality with an example. Below you can see two time series : The University of Michigan “Consumer Sentiment Index” and “New One Family Houses Sold” from Federal Reserve Economic Data (FRED) website.

Can consumer sentiment in previous months help better predict future home sales?

Input data for Granger-causality test

There are several packages in R to do this test of course. For this post, I used lmtest package and a few methods from forecast package to estimate the number of differences required to make the sample time series data stationary. You can download the sample workbook from HERE. Based on this analysis, with 4 lags, it appears that Consumer Sentiment unidirectionally Granger-causes Home Sales.

Thurman and Fisher’s 1988 paper “Chickens, Eggs and Causality, or Which Came First?” applies Granger causality to this age-old question concluding that the egg came first. It is a fun, short read with tongue-in-cheek future research directions like potential applications of Granger causality to “He who laughs last, laughs best” and “Pride goeth before destruction and an haughty spirit before a fall.” and proof that journal papers and statistics can be fun. As for more serious reading, Judea Pearl’s “Causality: Models, Reasoning and Inference” is a great resource on the topic of causality.

NOTES

Installing CausalImpact (bsts is a dependency but at the time of this post the default version you’d get from CRAN still had problems):

install.packages("https://cran.r-project.org/bin/windows/contrib/3.2/bsts_0.6.2.zip",repos = NULL, type = "local")
library(devtools)
devtools::install_github("google/CausalImpact")