Creating Coxcomb Charts in Tableau

Nightingale’s rose is probably one of the most influential visualizations of all time as the tool Florence Nightingale used to convince Queen Victoria about improving hygiene at military hospitals hence saving lives of many thousands of soldiers.

Commonly referred to as coxcomb or polar area charts, roses resemble both pie and bar charts in some respects. The layout resembles a pie chart but each slice has the same angle in coxcombs unlike pies. While radius of the slice is related to quantity being displayed (like height of the bar in bar charts), coxcomb uses the areas of slices.

Coxcombs help make seasonal patterns visible and deemphasize small differences while providing a nicer image, all of which worked well for Nightingale’s use case given her data and the target audience for the charts.  This may or may not apply to every situation so often a bar chart is probably the safer route to go.

I was curious whether anyone had built a coxcomb chart with Tableau before but I didn’t come across any so I decided to make one myself. Following chart shows monthly sales for different departments. You can click on the image to bring up the interactive Tableau Public viz and also download to take a closer look on how it is set up.

Nightingale's rose in Tableau

Since each of the slices represent sales for that particular month and department, you will notice that the entire slice gets highlighted when you make a selection instead of just the band visible in the chart as you’d expect.

Nightingale's rose in Tableau

Each slice is a polygon comprised of 102 points (seemed to give a good, smooth curve) which are not present in the data but generated by taking advantage of densification in Tableau. Enjoy!

Recency, Frequency, Monetary analysis in Tableau

RFM is commonly used in marketing, retail and professional services industries to assess customer value. The general idea behind the analysis can be summarized as

Recency : People who have purchased recently from you are much more likely to respond to a new offer than someone who you haven’t sold to in a long time.

Frequency : People who shop frequently at your store are more likely to respond to new offers than less frequent buyers.

Monetary : People who spend more money at your store are more likely to show interest in new offers.

in descending order of importance. There are a few different ways to calculate this metric but I will use the method outlined HERE with the sample Superstore database that comes with Tableau.

Quintiles needed for the analysis can be calculated using Tableau’s percentile rank function. For recency the formula would look like the following:

Percentile rank for recency

For frequency we can use Number of Records since each purchase is a record in the Superstore dataset or count OrderIDs.

Percentile rank for frequency

And monetary is the simplest of all

Percentile rank for monetary

Now let’s convert them to quintiles. I will just show the calculation for recency here but they’re the same for all of the above.

Converting to quintiles

And finally combine the results into a single score to get the RFM metric :

Adding up to RFM

Attack of the mesh plots

Although it was the first thing it made me think of, this is not MathWorks’ new marketing campaign. No, our planet is not under attack from giant jellyfish from outer space either.

The “3D mesh plots” hovering in the sky over Sydney, Amsterdam , Tampa, Denver, Vancouver… are the artwork of Janet Echelman made from fishing nets.

3D mesh plots in the night sky

Prescriptive analytics in Tableau with R

In 2010, Lustig et. al from IBM classified business analytics workloads into three major categories in their paper “The Analytics Journey”:

Descriptive Analytics: A set of technologies and processes that use data to understand and analyze business performance
Predictive Analytics: The extensive use of data and mathematical techniques to uncover explanatory and predictive models of business performance representing the inherit relationship between data inputs and outputs/outcomes.
Prescriptive Analytics: A set of mathematical techniques that computationally determine a set of high-value alternative actions or decisions given a complex set of objectives, requirements, and constraints, with the goal of improving business performance.

You can find the full paper here. For those who don’t have time to read the whole paper, in short the journey starts with the question “what is going on?” moves on to “what is likely to happen in the future?” and finishes with “what can we do to make a certain outcome more likely?”.

When you start talking about constraints and improving performance, it starts sounding awfully similar to mathematical/computational optimization, don’t you think? It should. Because optimization is a major part of it.

Imagine, in your business, you discovered a relationship between profit,  frequency of promotions and promotion amount. Given your customer base growth, you can forecast your profit for the next year, and even run what if scenarios by changing these parameters manually and see how they impact the forecast. The more promotions, the more people buy, so you see increased profit because of the volume but probably after a certain point, you get diminishing returns. What is the sweet spot? Prescriptive analytics  tools can tell you what combination of these parameters can maximize your profit.

Let’s look at an example of how to do this in Tableau. Here, I have annual stock returns for a number of companies over 20 years. We will be using a metric called Sharpe ratio which characterizes how well the return of an asset compensates the investor for the risk taken. Our goal is to find the portfolio that maximizes Sharpe ratio.

We will use R’s optim function which comes with the base R package so there is no need to install or load any libraries.

R script for portfolio optimization

Our data is in long format, that is; we have one column for all returns, a column that associates each row with a particular stock and a column with year. Our calculation needs a matrix so we will first splitthe long table into chunks and then add each chunk as a new column (cbind) next to each other.  SharpeRatio function contains the logic for calculating Sharpe Ratio and it takes the percentages of each stock in your portfolio as input. You may have notice the constant 0.05, which is my assumption as the risk-free interest rate, the theoretical rate of return of an investment that an investor would expect from an absolutely risk-free investment. For example US treasury bonds are often considered risk-free. Also some of the optimization constraints are captured in the function. Considering the fact that each stock is associated with a percentage of total portfolio value, these have to be non-negative and add up to 1. We enforce the range of each percentage being between 1 and 0 using an if condition and when condition is not met we introduce a high penalty (1e7) so our optimizer avoids those.  Adding up to 1 is enforced by the line   p<-c(x,1-sum(x)). Function takes 1 fewer number than the total number of stocks selected by the user and the missing argument is computed by subtracting the sum of these arguments from 1 hence making sure when everything is added, it sums up to 1.

Next we call our optim function with a vector of zeros as initial values. I used L-BFGS-B as the optimization method and provided an upper/lower limit for ratios once again. You can try a different method like Nelder-Mead and skip the upper/lower limit and still get very similar results. Note that by default optim tries to minimize the result of the provided function. You can pass an argument (control = list(fnscale = -1)) to have it maximize the goal or simply convert the results of your function. I did the latter. Hence my penalty for stepping out of bounds is a large positive number but my Sharpe Ratio is multiplied by –1.

Below is a screenshot showing an optimized portfolio for these 4 particular stocks based on 20 years of historical data.

Example portfolio

You can download the workbook HERE. Enjoy!

Decision trees in Tableau using R

When the data has a lot of features that  interact in complicated non-linear ways, it is hard to find a global regression model i.e. a single predictive formula that holds over the entire dataset. An alternative approach is to partition the space into smaller regions, then into sub-partitions (recursive partitioning) until each chunk can be explained with a simple model.

There are two main types of decision trees:
Classification trees : Predicted outcome is the class the data belongs.
Regression trees : Predicted outcome is continuous variable e.g. a real number such as the price of a commodity.

There are many ensemble machine learning methods that take advantage of decision trees. Perhaps the best known is the Random Forest classifier that constructs multiple decision trees and outputs the class which corresponds to the mode of the classes output by individual trees.

Let’s start with a classification tree. For decision trees I will use the package rpart but maptree, tree and party are some other packages that can be used to same effect.

All the data used in the examples below are retrieved from UC Irvine Machine Learning Repository. In the example workbook you can find links to documentation and credits for each dataset.

First dataset contains composition (Sodium, Calcium, Magnesium… content) and physical properties (RI – Refractive Index) of samples for different types of glass. If you were to find broken glass in a crime scene, would you be able to tell what it is ?

Data for the first example, composition and physical properties of different types of glass

Here is how you can create a classification tree to answer this question:

Fitting a decision tree and using it to make predictions in a Tableau calculated field

This calculation will just work as long as you have Rserve running and rpart package installed since it is self-contained as it loads the library, trains the model and then uses the model to make predictions within the same calculation. Model fit happens in the part underlined in orange, the tree is pruned to the level where minimum error occurs in the part underlined in blue and finally the prediction happens in the part underlined in green.

Alternatively you could use the control setting in rpart function to impose some rules on splits. In the example below control=rpart.control(minsplit=10, cp=0.001) requires at least 10 items in a node before attempting a split and that a split must decrease the overall cost complexity factor by 0.001.

Specifying rules before growing the tree

You may have noticed that in this example we are using the same dataset for training and prediction, not to mention we are not looking at any diagnostic measures to gage the quality of fit. I did this firstly because I wanted to provide a simple, self-contained example that just works and secondly, the reason for the blog post is to show how a particular technique can be used from within Tableau as opposed to providing guidelines for model fitting and selection. In real life scenarios, you would train the model using one “tagged” dataset, verify the quality, save the trained model and use it to predict with different datasets serving as input. Save and reuse workflow has been explained in detail with an example in my earlier blog post about logistic regression with Tableau and R. But I haven’t talked about how you can use the same approach to iterate on the model by going back and forth between your favorite R development environment and Tableau as needed before you deploy the final model for everyone to use via a Tableau dashboard.

You will notice a sheet in the example workbook named SaveTree. This sheet contains a calculated field that has the following line:

save(fit, file = "C:/Users/bberan/R/myclassificationtree.rda")

which saves the fitted model to a file on my computer. Once you update the file path to a valid path on your computer and it successfully saves, you can easily open the model from your R environment and examine the tree structure and relevant statistics, prune the tree etc.

Examining the model in R console

Once you’re done you can run the same line from your R console and overwrite the file, then read it from Tableau to use for predictions.

SCRIPT_STR('library(rpart);
load("C:/Users/bberan/R/myclassificationtree.rda");
t(data.frame(predict(fit, newdata=data.frame(Type = .arg1, Al =.arg2, Ba=.arg3, Ca =.arg4, Fe =.arg5, K=.arg6, Mg = .arg7, Na = .arg8, RI = .arg9, Si=.arg10), type = "class")))[1,];', ATTR([Type]),AVG([Al]),AVG([Ba]),AVG([Ca]),AVG([Fe]),AVG([K]), AVG([Mg]),AVG([Na]),AVG([RI]),AVG([Si]))

There is a dedicated sheet in the example workbook that uses random forest approach on the same dataset. It uses the package randomForest which you will need to install. However the syntax is almost identical to the classification tree example above so I will skip over it to move on to regression trees right away.

For regression tree example, I will use a separate dataset which contains make, model year, displacement, # cylinders etc. for different cars based on which we will try to predict MPG.

Data for the regression tree example, specs for different cars

Once again using the rpart package, the calculation can be written as follows. It is similar to the classification tree script. The minor differences are highlighted. Note that the discrete variables such as make, origin and # cylinders are used as factors.

Fitting a regression tree and using it to make predictions in a Tableau calculated field

After some pruning, the regression tree looks like the following when you draw it in R where leaf nodes show the MPG values for cars classified under that group.

Regression tree visualized in R

You can find the Tableau workbook with the examples HERE.

Using R forecasting packages from Tableau

A common question about R integration feature in Tableau 8.1 is how to use it with forecasting packages available in R. This is an interesting question for two reasons:

  1. SCRIPT_ functions are calculated fields and calculated fields don’t add rows to the table while that’s what forecasting is all about.
  2. Forecast packages normally return just the forecast results which is fewer rows than the  input data and at the same time, doesn’t make sense to join back in.

Here is a relevant thread from Tableau forums.

http://community.tableausoftware.com/thread/137551

Before we move on any further, I would like to emphasize that Tableau has a built in forecast functionality which makes most of what we will cover here possible with the click of a button and comes with many more convenience features. The intent of the blog post is to show an alternative approach while very complicated can be useful for some.

The example in the forums uses the fpp package. For variety in this blog post I will be using the forecast package. It is worthwhile noting that, we worked very closely with Prof. Hyndman, the author of R’s forecast package when developing the forecasting feature in Tableau 8.

Let’s start with the visualizing the fit and residuals for your in-sample forecast. You will find this in the first sheet of the workbook I provided a link for at the end of the post.

Model fit and residuals calculated in R, visualized in Tableau

We have two time series in this chart. To reduce the network traffic, I retrieved both columns in a single call to Rserve. Let’s take a look at the calculation.

image

First, we are loading the library. For this to work, please make sure  the library and its dependencies are installed. You can find more about how to preload libraries for better performance in my posts here and here. Then we are creating a time series object (ts) using [Earnings] column (.arg1), since it is quarterly our deltat is 1/4 and our time series starts at the beginning of 1960. You will notice the paste function is taking two columns from the resulting forecast (fit and residuals) and concatenating them using ~ as the separator which allows us to retrieve both columns at once. Then in the calculated fields named Fit and Residuals, we tokenize the results. Only one of the calculated fields is shown below for brevity.

Tokenizing concatenated results

So far we’ve been dealing with the in-sample forecast, so we haven’t yet touched the core question about how to retrieve the future predictions into Tableau. Without further ado, let’s jump to the more exciting part. I will demonstrate two approaches. The first one is domain shifting which is shown in the second sheet.

Retrieving the predicted data points from R

Our calculation as shown above is very similar to the one we looked at earlier. But this time we are asking for a forecast (mean) of length (h=) set by the Tableau parameter named [PeriodsToForecast] and using append function to add the predicted rows to the end of the [Earnings] column (which answers the question 2 above) while dropping just as many rows from the top.  The result is shown below. Since we forecast 10 quarters, our time series now extend 2.5 year beyond 1980 but our chart doesn’t start at 1960 anymore. 

Forecast using domain shifting, note the starting year of 1963

We achieve this also by shifting the date axis by the same amount as shown below.

Shifting the date axis

But what if you don’t want to trim rows from the beginning of your data? That’s where densification comes to our aid.

The second approach relies on the fact that Tableau can fill in the missing values given a range. What we will do is to shift the last data point to the date we want our forecast to end. This will create a gap between the last value and the previous and Tableau’s “Show Missing Values” option will automatically add rows to fill that gap.

Show Missing Values option

The last date is moved out to provide the full date extent for forecasting using the following calculated field. Here [MaxDate] is created at the data source level by joining the main table with a sub-query that returns the MAX(Quarter) in the input data.

Shifting the last date in the time series

Our calculated field looks very similar to the previous calculation. Important changes are highlighted which involve removing the blank rows densification added from the data passed on to R’s forecast function as input and pulling back the shifted last data point to the correct date for the forecasting function. Then the results get merged back in using the same append function but this time without anything getting trimmed.

Running the R forecast code with densification

The result is shown below. For comparison we are forecasting 10 quarters (same as above). Now our time series extend 2.5 year beyond 1980 while chart still starts at 1960. 

Forecast using domain completion, note the starting year of 1960

You can download the example workbook from HERE.