Scaling RServe Deployments

Tableau runs R scripts using RServe, a free, open-source R package. But if you have a large number of users on Tableau Server and use R scripts heavily, pointing Tableau to a single RServe instance may not be sufficient.

Luckily you can use a load-balancer to distribute the load across multiple RServe instances without having to invest in a commercial R distribution. In this blog post, I will show you, how you can achieve this using another open source project called HAProxy.

Let’s start by installing HAProxy.

On Mac you can do this by running the following commands in the terminal

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Followed by

brew install haproxy

Create the config file that contains pointers to the Rserve instances.

In this case I created in the folder ‘/usr/local/Cellar/haproxy/’ but it could have been some other folder.

    maxconn 256

    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

listen stats
    bind :8080
    mode http
    stats enable
    stats realm Haproxy\ Statistics
    stats uri /haproxy_stats
    stats hide-version
    stats auth admin:admin@rserve

frontend rserve_frontend
    bind *:80
    mode tcp
    timeout client  1m
    default_backend rserve_backend

backend rserve_backend
    mode tcp
    option log-health-checks
    option redispatch
    balance roundrobin
    timeout connect 10s
    timeout server 1m
    server rserve1 localhost:6311 check maxconn 32
    server rserve2 anotherserver.abc.lan:6311 check maxconn 32

The highlights in the config file are the timeouts, max connections allowed for each Rserve instance, host:port for Rserve instances, load balancer listening on port 80, balancing being done using roundrobin method, server stats page configured on port 8080 and username and password for accessing the stats page. I used a very basic configuration but HAProxy documentation has detailed info on all the options.

Let’s check if config file is valid and we don’t have any typos etc.

BBERAN-MAC:~ bberan$ haproxy -f /usr/local/Cellar/haproxy/haproxy.cfg -c
Configuration file is valid

Now you can start HAproxy by passing a pointer to the config file as shown below:

sudo haproxy -f /usr/local/Cellar/haproxy/haproxy.cfg

Let’s launch Tableau and enter the host and port number for the load balancer instead of an actual RServe instance.

Connection information for the load balancer

Success!! I can see the results from R’s forecasting package in Tableau through the load balancer we just configured.

Results of R script evaluated through the load balancer

Let’s run the calculation one more time.

Now let’s look at the stats page for our HAProxy instance. In this case per our configuration file by navigating to http://localhost:8080/haproxy_stats.

Server statistics for the load balancer for Rserve instances

I can see the two requests I made and that they ended up being evaluated on different RServe instances as expected since round-robin load balancing forwards a client request to each server in turn.

Now let’s install it on a server that is more likely to be used in production and have it start up automatically etc.

I used a Linux machine (Ubuntu 14.04 specifically) for this. There are only a few small differences in the configuration steps. To install HAProxy, in a terminal window enter :

apt-get install haproxy

Now edit the haproxy file under the directory /etc/default/ and set ENABLED=1. This is by default 0. Setting to 1 will run HAProxy automatically when the machine starts.

Now let’s edit the config file which can be found here /etc/haproxy/haproxy.cfg to match the config example above.

And we’re ready to start the load balancer:

sudo service haproxy start

Now you can serve many more visualizations containing R scripts to a larger number of Tableau users. Depending on the amount of load you’re dealing with, you can start with running multiple RServe processes on different ports of the same machine or you can add more machines to scale out further.

Time to put advanced analytics dashboards on more screens 🙂


Quick Tip : RServe tricks that can make your life easier

RServe offers a number of configuration options that can come in handy when working with R inside Tableau but they are not captured in detail in RServe’s documentation. Let’s talk about a few.

How can I start RServe with a configuration file in a custom location?

RServe has a default place for the placement of configuration file e.g. /etc/Rserv.conf on Linux/Mac but you may not want to use that location or even have multiple config files that you switch sometimes so it is much more convenient to explicitly set it.

You can do this in the following way when starting RServe from R:

On Windows

Rserve(args="--RS-conf C:\\PROGRA~1\\R\\R-215~1.2\\library\\Rserve\\Rserv.cfg")

On Linux/Mac

Rserve(args=" --no-save --RS-conf ~/Documents/Rserv.cfg")

You can also do this from command line/terminal window. Assuming Rserve.exe is in the path or you’re in the folder that contains Rserve.exe so Rserve would be recognized:

Rserve –-RS-conf C:\\Users\\bberan\\Rserv.cfg

You probably noticed that on Mac there is an extra argument “--no-save”. On Mac starting RServe requires using one of --save,--no-save or --vanilla but what do they mean? The answer is in R help

BBERAN-MAC:~ bberan$ R --help
Usage: R [options] [< infile] [> outfile]
or: R CMD command [arguments]
Start R, a system for statistical computation and graphics, with the specified options, or invoke an R tool via the 'R CMD' interface.
--save                Do save workspace at the end of the session
--no-save             Don't save it
--no-environ          Don't read the site and user environment files
--no-site-file        Don't read the site-wide Rprofile
--no-init-file        Don't read the user R profile
--restore             Do restore previously saved objects at startup
--no-restore-data     Don't restore previously saved objects
--no-restore-history  Don't restore the R history file
--no-restore          Don't restore anything
--vanilla             Combine --no-save, --no-restore, --no-site-file,--no-init-file and --no-environ

Config files are very useful since they provide a centralized place to pre-load all the libraries you need so you don’t have to load them as part of each request which results in better performance, they allow evaluating any R code as part of RServe startup allowing you to load trained models or R script files, even data that you may want to use as part of your analysis when running  R code from Tableau.

How can I debug R scripts I have in Tableau by taking advantage of my R development environment?

As you’re writing scripts in Tableau, you may want to understand different steps data is going through especially if you’re getting errors when script is being evaluated. There are many ways to debug. Using RServe in debug mode (Rserve_d.exe) outputs all the exchanges into the command line and can be a a bit verbose. You can also insert statements like write.csv in your script to create output of different data structures but by favorite option is to do this using an environment like the basic R GUI or RStudio.

On Linux/Mac

If you started Rserve from your R console the way described in the beginning, you can insert print statements to print into the console and even your plot statements will create visuals in R (if you have X11 installed) which can come in handy when debugging large and complex chunks of R code.

On Windows

You can achieve the same by starting R using:


This takes over the current R session and makes it into an Rserve session as opposed to starting it by typing RServe() which starts a new Rserve process by calling Rserve.exe.

I hope you find this useful.

R, Text analytics, Visualization

Trump vs. Clinton in N-grams

Presidential election campaigns are heating up and you all know what that means.

Word clouds 🙂

I still remember visualizations summarizing Romney-Obama debates in 2012 but one thing that slightly bothered me back then was that almost everything I saw were just counting single words so even the one of the most memorable phrases in Obama’s campaign such as “Four more years” were lost.

Surely, there will be some interesting phrases candidates will use this year. How can you make sure that your analysis doesn’t miss them?

I am planning on doing a series of posts on text analytics but tokenization is an important component in text analysis so let’s start with answering this question first.

This analysis relies on two R packages.


I wrote a number of helper functions to make analysis steps easier to follow.

# Helper functions
removeStopWords <- function(string, words) {
    stopifnot(is.character(string), is.character(words))
    splt <- strsplit(string, " ", fixed = TRUE)
    vapply(splt, function(x) paste(x[!tolower(x) %in% words], collapse = " "),character(1))

countWords <- function(y) { sapply(gregexpr(" ", y), function(x) { 1+sum(x>=0) } ) }

# Find N-grams within certain edit distance to avoid multiple subsets of same phrase
setToClosestMatch<-function(text) {sapply(seq_along(names(text)),function(x){
         ll <- agrep(pattern=names(text)[x],        

# To remove an, the, is etc. after finding n-grams 
reduceToRealNGrams <- function(ngrams,n){
ngrams[countWords(removeStopWords(ngrams, append(tm::stopwords("en"),c('going','can','will'))))>=n]}

# Tokenize 2 to 4 grams 
NgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 4,delimiters = " \\r\\n\\t.,;:\"()?!"))

Let’s start the analysis. I had one file for each candidate, containing transcripts for 5 speeches they gave between June 2015 and April 2016.

# Load the documents from specified folder
docs <- Corpus(DirSource("C:/Users/bberan/Downloads/Transcripts/"))
# Convert to lower case
docs <- tm_map(docs, tolower) 
# Remove common words : locations, candidates’ references to each other etc.
docs <- tm_map(docs, removeWords, c("des moines","new york","new hampshire","south carolina","united states","twin cities","san bernardino","hillary clinton","donald trump"))

This is an optional step that allows you to find phrases that differentiate presidential candidates by putting small weights on phrases used by both candidates.

# Create term document matrix with tf-idf weighting
tdm <- TermDocumentMatrix(docs, control = list(tokenize = NgramTokenizer,weighting = weightTfIdf))

m <- as.matrix(tdm)

# separate Hillary and Trump content and sort by frequency
hillary <- sort(m[,1], decreasing=TRUE)
trump <- sort(m[,2], decreasing=TRUE)

# Get top 250 most common N-grams for each candidate
hillaryTopN<-hillary[order(-nchar(names(hillary[1:250])), hillary[1:250])]
trumpTopN<-trump[order(-nchar(names(trump[1:250])), trump[1:250])]

Since we are looking for 2 to 4 grams, R will find “Make America”, “Make America great”, “Make America great again” as separate n-grams. This step consolidates all of them to Make America great again. It also gets rid of N-grams if after removal of stop words e.g. (a, and, the, or..) they become smaller than a 2-gram. For example a 3-gram like “this or that” would be dropped a part of this step. Removing stop words this late makes sure our phrase structure is not broken e.g. “to keep us safe” does not become “keep safe”.

# get rid of junk and overlapping n-grams
hillaryTopNConsolidated<- reduceToRealNGrams(unique(cbind(hillaryTopN,setToClosestMatch(hillaryTopN))[,2]),2)
trumpTopNConsolidated<- reduceToRealNGrams(unique(cbind(trumpTopN,setToClosestMatch(trumpTopN))[,2]),2)

Now that we completed the “key phrase extraction” process, we can write this data to CSV and read in Tableau to build a word cloud like this.

 N-grams frequently used by Hillary Clinton

It gives a much better idea than individual words but can we build something that gives even more context?

Of course by that I don’t mean a bar chart version of this. Despite all the criticism in the visualization community, word clouds do a decent job of conveying the information contained in a table with words and their respective counts for a number of reasons:

  • Counts in word clouds are not exact or reliable quantities. A speaker can repeat a word several times in a row trying to remember what to say next or when they are interrupted by the audience. Even if that doesn’t happen, whether someone said something 10 vs. 11 times hardly tells you anything meaningful. So when reading a word cloud, what people look for is whether things are roughly the same or vastly different. To use an analogy, for a thermometer reading, there is no reason to display the temperature to 6th decimal place if thermometer is only accurate to one decimal place. 6 decimal places give a false sense of accuracy and takes up unnecessary space. Large numbers with one decimal place is much more useful in conveying the information.
  • Applying transformations like TF-IDF can change the word count values by orders of magnitude which makes accuracy of bar chart even less useful.
  • If corpus is large enough, word frequencies follow a power law pattern. Power law distributions apply to many aspects of human systems. The best known example is economist Vilfredo Pareto’s observation  that wealth follows a “predictable imbalance”, with 20% of the population holding 80% of the wealth.The linguist George Zipf observed that word frequency also falls in a power law pattern, with a small number of high frequency words, a moderate number of common words and a very large number of low frequency words. Later Jacob Nielsen observed power law distributions also in web site page views which is why word clouds often work well highlighting the popular content on news aggregators or forums.

Maybe a good way of providing more context is finding the sentences that contains these common phrases that distinguish one candidate from another.

First I broke the text into sentences in R.

splitToSentences <- function(text) { 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")
text <- as.String(text)
sentence.boundaries <- annotate(text, sentence_token_annotator)
sentences <- text[sentence.boundaries]

Docs[[2]] in my script holds Trump’s speeches so to get the sentences for it:


Then I wrote the results to CSV, imported the CSV into Tableau and wrote some custom SQL to do a join between sentences and n-grams using CONTAINS as join criteria.

A little formatting to give it newspaper styling and here is the final result.

 Trump, Clinton speech analysis on Tableau Public

You can click on the image to see it live on Tableau Public and download the workbook if you like. Selecting topics from the right will bring up all the sentences containing that phrase.

R, Visualization

Correspondence Analysis in Tableau with R

Correspondence analysis is an exploratory data analysis method for discovering relationships between two or more categorical variables. It is very often used for visualizing survey data since if the matrix is large enough (which could be due to large number of variables but also possible with small number of variables with high cardinality) visual inspection of tabulated data or simple statistical analysis cannot sufficiently explain its structure. Correspondence analysis can remarkably simplify representation of such data by projecting both row and column variables into lower dimensional space that can often be visualized as a scatter plot at a small loss of fidelity.

Let’s take a look at an example. Below is the data from 2014 Auto Brand Perception survey by Consumer Reports where 1578 randomly selected adults were asked what they considered exemplar attributes for different car brands. Respondents picked all that apply from among the list that consisted of : Style, Performance, Quality, Safety, Innovation, Value  and Fuel Economy.

We can convert this data into a contingency table in R and do a chi-square test which tells us that there is statistically significant association between car brands and their perceived attributes.


But often this is not sufficient since my goal is to understand how different car makers are perceived to learn how people see my brand, how I compare with the competition, how to competitively position an existing product or bring a new product in the market to fill a gap.

Let’s visualize this as a cross-tab in Tableau.

2014 Auto Brand Perception Survey Results

Even though there are only 7 choices and a single question in the survey, this table is hard to interpret.

Let’s apply correspondence analysis and see what our scatter plot looks like. Here blue dots are cars. Blue points closer to each other are more similar than points farther away. Red items (e.g. Style being hovered over in the screenshot) are the attributes. The axes themselves do not distinguish independent dimensions for discriminating categories so attributes are useful in orienting yourself when looking at the chart and help assign names to different areas of the scatter plot. If you imagine a line extending from the center of the plot towards each of the red points, the distance of blue points to the lines indicate how related they are to the particular attribute. For example for Volvo, safety is the the perception that dominates. Same can be said for Kia and Value. But Subaru is considered safe, have good quality and value while Porsche and Ferrari are mostly associated with attributes Style and Performance, roughly the same amount.

Correspondence Analysis of Brand Perception Survey

This scatter plot explains 70% of the variance in the data. While it doesn’t capture everything, it is a lot easier to consume than cross-tabulation.

The rows and columns used in computing the principal axes of the low-dimensional representation are called active points. Passive (supplementary) points/variables are projected onto the plot but not taken into account when computing the structure of the plot itself. For example if there are two new cars in the market and you want to see their relative positioning in an existing plot, you can add them as supplementary points. If there are outliers, you can also choose to make them into supplementary points not to skew the results. Supplementary variables on the other hand are typically exogenous variables e.g. the age group or education level of the survey participant. In some cases you may prefer generating multiple plots instead e.g. one per gender. You can mark a column or row as supplementary using supcol and support arguments in ca function call e.g. ca(mydata,supcol=c(1,6)) makes 1st and 6th columns in the table supplementary.

You can add more to this chart to explore more. For example, you can put price of the car or safety rating on color and see whether they align with the perceived value or safety.  For example Tesla, Ford and Fiat are all associated with value while Tesla is not a budget car. Similarly Volvo and Tesla both have a 5 star safety rating but consumers associate Volvo much more with safety than any other brand. If you have multiple years of data, you can put years on the Pages Shelf and watch how perception changed over time, whether your marketing campaigns were effective in moving it in a direction you wanted.

Correspondence analysis use cases are not limited to social sciences and consumer research. In genetics for example microarray studies use MCA to identify potential relationships between genes. Let’s pick our next example from a different domain.

If there are multiple questions in your survey, you can use Multiple Correspondence Analysis (MCA) instead. Our data for this example contains categorical information about different organisms. Whether they fly, photosynthesize, have a spine….

Categorica attributes of different organisms

For a moment, imagine the first column doesn’t exist so you have no knowledge about what organism each row is. How easy would it be to understand if there are groups in the data based on these attributes?

Let’s apply MCA to this dataset. In this case I put the attributes in the secondary axis, hid their marks and made their labels larger. I also applied some jitter to deal with overlapping marks.

I can clearly see groups like birds, mammals, plants, fungi and shellfish. If the data wasn’t labeled, I would be able to associate them looking at the chart and by examining the common attributes of adjacent points start developing an understanding of what type of organisms they might be.

Multiple correspondence analysis applied to organisms dataset

You can download the sample workbook from HERE.

R, Visualization

Quick Tip : Overlaying curves on Tableau scatter plots with R

Tableau provides a good set of trend line, reference line and band options but sometimes you want to overlay curves based on a custom equation. Logistic regression curves, sine curves, quantile regression curves…. And want these overlay curves to be smooth…

This is very easy to do by taking advantage of the technique I shared when building Coxcomb charts and radial treemaps. If you know the equation (or know how to get to it) and it can be described in Tableau’s calculation language you can do so using a table calculation. But doing the fit dynamically involves R and when you’re passing the data to R you need to do some basic NULL handling. Here are two examples showing what the results might look like. You can see that despite very few points in my dataset and large gaps between them, the curves look very smooth.

A sine curve and logistic regression curve overlay in Tableau

The key component is the bin field created on the variable that’s on the X axis. In Tableau bins can be used to trigger densification by turning on “Show Missing Values” option on the bin. Doing so adds NULL rows into the data backing up the visualization, which you can fill with values from table calculations (including SCRIPT_ functions). In your R script, you need to remove these artificially generated NULLs, in order not to confuse the curve fitting procedure you’re applying. 

I tied the bin size to a parameter so you can try different values to make the curves coarser or smoother.

If you want to take this technique it a bit further you could use one bin for each axis which will allow you to create a grid. Then you an treat each cell like a pixel in a raster and draw shaded areas such as contours.

Below you can see two examples of this. First one estimates the bivariate kernel density of the data points in the viz on-the-fly using R on the fly and draws the contours using the secondary axes of this dual (technically quadruple) axis chart.

Displaying kernel density as overlay in Tableau

The second chart uses the same data to fit a one-class SVM (support vector machine) which is often used for outlier/novelty detection with a Gaussian radial basis function then draws the decision boundaries using the secondary axes (blue ellipse). Suspected outliers are shown in red while inliers are shown in white.

Displaying the boundaries of a one-class SVM as overlay in Tableau

You can download the sample Tableau workbook from HERE


Using your R models for in-database scoring

Let’s say you trained a predictive model using R integration in Tableau or your favorite R authoring environment.

If your data is very large you probably used a subset of your data for training and want to see how well it will do over the entire database.Or maybe you’re satisfied with the model but you want it to return you only the interesting results from the database e.g. customers who are most likely to churn.

In both scenarios, being able to push the model to the database means being able to rely on the database’s scalability and avoiding delays caused by data movement.

You may be asking, “but what if our database systems can’t run predictive models?” Luckily once trained, some of the very commonly used predictive methods can be expressed in SQL hence also in Tableau’s calculation language. All you need is a way to convert the model into a Tableau calculated field.

If you created a decision tree in R, you can use the following function to convert it to a Tableau formula:

decisionTreeToFormula <- function (tree) {

if (class(tree) != "rpart") {
warning(class(tree), " is not supported.Please use an rpart model.")

capture.output({rpartRules <- path.rpart(tree,rownames(tree$frame)[tree$frame$var=="<leaf>"])})
ruleTexts <- "IF "
operators <- c("<=",">=","<",">","=")
i <- 1


for (rule in rpartRules) {
ruleText <- character(0)
for (component in rule) {
whichSeparator <- lapply(operators, function(x) length(unlist(strsplit(component,x)))) > 1
parts <- unlist(strsplit(component,(operators[whichSeparator])[1]))
if(!(parts[1]=="root")) {
if (is.finite(suppressWarnings(as.numeric(parts[2])))) {
ruleText <- c(ruleText,paste("[",parts[1],"]",(operators[whichSeparator])[1],parts[2],sep=""))
} else {
ruleText <- c(ruleText,paste0(" (",paste0("[",parts[1],"]","='",unlist(strsplit(parts[2],",")),"'",collapse=" OR "),")"))

ruleTexts <- c(ruleTexts, paste0(if(length(ruleTexts)>1){"\nELSEIF "}, paste(ruleText,collapse=" AND ")," THEN '" ,mapping_table[mapping_table$node==as.numeric(names
if(i==length(rpartRules)) ruleTexts <- c(ruleTexts,"\nEND")
i <- i +1
tableauFormula <- paste(ruleTexts,collapse=" ")

You can add this to your Rserve.cfg so you can call the decisionTreeToFormula function  in your SCRIPT_ calls from Tableau or you can use it in your R environment. Either way it will give you some text which you can copy-paste into Tableau and use it as a formula that can run on the database.

In the example below, I fit a decision tree on a database of cars trying to predict the gas mileage (Bad|OK|Good). I used only a small portion of variables (just maker and year) to get a compact result back. Applying the function gives me a nice IF..THEN..ELSE formula that handles both numeric and categorical items. Now I can use this classifier on any database. As a dimension, as a filter… by just copy-pasting the formula into a calculated field in Tableau.

> myTree<-rpart(mpg~modelyear+maker,data=cars)
> decisionTreeToFormula(myTree)
IF ([maker]='America') AND [modelyear] < 79 THEN 'Bad'
ELSEIF ([maker]='America') AND [modelyear] >= 79 THEN 'Good'
ELSEIF ([maker]='Asia' OR [maker]='Europe') THEN 'OK'

What about a regression model? This post has that covered, too. You can achieve similar results with the following function :

regressionModelToFormula <- function(model, decimalPlaces) {
if ((class(model)[1] != "glm" && model$family$family != "binomial") & class(model)[1] != "lm") {
warning("Function is not applicable to this model type.")
factors=c("",gsub("[-^0-9]", "", names(unlist(model$xlevels))))[-1]
coeff <- coefficients(model);
catVars <-merge(names,varCoeffMap,all.x=TRUE)

if (nrow(catVars) !=0) {
contVars <- round(coeff[-which(names(coeff) %in% catVars$varNames)],decimalPlaces)
} else {
contVars <- round(coeff,decimalPlaces)
catFormula <- " "

for(catVar in unique(catVars$factors)){
curVar <- catVars[catVars$factors==catVar & !is.na(catVars$coefficients) ,]
catFormula <- paste("WHEN",curVar$levels, "THEN",round(curVar$coefficients,decimalPlaces), sep=" ", collapse="\n")
catFormula <- paste("+ CASE [",catVar,"] \n",catFormula," ELSE 0 \n END \n", sep="")

numericFormula <- paste(contVars[-1], "*" ,paste("[",names(contVars[-1]),"]",sep=""), "+", collapse = " ")
if (class(model)[1]=="lm") {
combinedFormula <- paste(numericFormula,contVars[1],catFormula)
combinedFormula <- paste("1 / (1 + exp(", numericFormula, contVars[1],catFormula, "))")
tableauFormula <- gsub(pattern=":", replacement="*", x=combinedFormula)

If you call this function from R on the logistic regression model I used in one of my earlier blog posts, below is what you will get. Copy-paste this into a Tableau calculated field and you’re ready to make predictions with your logistic regression model in your database.

> regressionModelToFormula(lrmodel,5)
1 / (1 + exp( 0.00628 * [gre] + 1.60393 * [gpa] -8.82164 +
CASE [rank]
WHEN 2 THEN -1.10904
WHEN 3 THEN -1.31259
WHEN 4 THEN -2.01912

And if you apply this function on the multiple linear regression model in the last sheet of the example workbook from my last blog post, what you get will resemble the following:

> regressionModelToFormula(mrmodel, 4)
1.6224 * [Economic Indicator X] + 0.597 * [Economic Indicator Y] -37.1103

I hope you find this useful.


Multivariate Forecasting in Tableau with R

Since version 8.0 it is very easy to generate forecasts in Tableau using exponential smoothing. But in some cases you may want to enrich your forecasts with external variables. For example you may have the government’s forecast for population growth, your own hiring plans, upcoming holidays*, planned marketing activities… which could all have varying levels of impact on your forecasts. Or sometimes you may just want to simulate different scenarios and ask “what if” questions e.g. what would my sales look like if I hired 10 more sales representatives? If interest rates went up by X percent, how would that impact my profits?

Let’s see how we can tackle both uses cases with the help of Autoregressive Integrated Moving Average with eXogenous variables (ARIMAX) models in R’s forecast package.

*In some cases seasonality may be sufficient to capture weekly cycles but not for moving events like Easter, Chinese New Year, Ramadan, Thanksgiving, Labor day etc.

Handling special events

In the image below the observed/historical demand is shown in blue. At the bottom you can see the holidays as green bars with height of 1 and non-holidays as bars of height 0. This representation is often referred to as dummy encoding.

Our data contains 2 holidays that happened in the past and 1 upcoming holiday. One can clearly see that the holidays are causing noticeable spikes in demand. For a store owner who doesn’t want to miss the next holiday opportunity by running out of stock early, it is very valuable to incorporate this important piece of information in the demand forecast.

Accounting for holidays and special events in forecasts

The formula for the forecast shown with the red line (which doesn’t take holidays into account) looks like the following:

append(.arg1[1:99],forecast(ts(.arg1[1:99], frequency=12), h=21)$mean)",AVG([Demand]))

First 99 points cover the historical data while last 21 are what’s being predicted. The data follows a 12 period cycle. Second line of R code appends the predicted values to the reported values to generate the full series.

The formula for the forecast shown with the green line (which incorporates the holidays) looks like the following:

data <- data.frame(demand=.arg1,holiday=.arg2);
training <-data[1:99,];
append(training[,1],forecast(model,xreg=.arg2[100:120])$mean)", AVG([Demand]), ATTR([Holiday]))

In this script, exogenous regressors are passed to the function using the xreg argument. It can be clearly seen that both the spike due to holiday and the regular seasonality in demand are represented by this model. If there are multiple events with potentially different effects (positive vs. negative, different magnitudes etc.), they can be handled using the same method if added as separate variables.

What-If Analysis

Now that we covered handling events as additional regressors, let’s talk about we can apply the same methodologies to do what if analysis.

Below you can see three time series; Sales and 2 economic indicators. Assume that the two indicators contain the current best estimates of behavior for next few days but Sales for the same period is unknown however can be estimated using the two economic indicators.

Sales forecast based on the two economic indicators

What makes this visualization more interesting is that you can also adjust the value of economic indicators and the time frame these overrides apply. For example in the image below, indicator X has been increased by 15 units (dark blue peak) for the period April 13th and April 25th while indicatory Y has been reduced by 20 units (dark orange dip) for the period April 20th and May 1st. The impact can be clearly seen in the dark green portion of the line in the first chart. By adjusting the parameters in the dashboard one can perform what-if analysis and understand impact of likely future events, best/worst case scenarios etc.

What if analysis with time series forecasting

Let’s take a look at the R script that is used to create this view.

training <- ts(data.frame(sales=.arg1,indX=.arg2,indY=.arg3)[1:100,],frequency=7);
whatifdata <- data.frame(indX=.arg2,indY=.arg3)[101:120,];
AVG([Sales]),AVG([Economic Indicator X]),AVG([Economic Indicator Y]))

As you can see, the formula is very similar to earlier examples. In this demo dataset, the first 100 rows are used for model fitting while the last 20 contain the sales forecast as well as the inputs for the sales forecast that are the what-if values defined in Economic indicator X and Y fields as a function of parameter entries.

You can find all of these examples in the Tableau workbook published HERE.

In the sample workbook, I also provided a sheet that compares the ARIMAX result to multiple linear regression to give a better sense of what you’re getting out of applying this particular method.


Cause & Effect : A different way to explore temporal data in Tableau with R

Whether it is forecasting your quarterly sales or comparing historical data, working with time series data is big part of business analytics. Sometimes patterns are easy to see, while in others they might be elusive especially when comparing different time series with short/frequent cyclical patterns.

A common question could be understanding the impact of a marketing campaign. Did it make a difference at all? What’s the return?

A true randomized treatment-control experiment would be the ideal case for running tests but it is not always practical, if at all possible. So let’s see how we can try to answer this question using Bayesian Structural Time Series models. For this post, I generated a dataset with 3 columns. In the image below, Sales is at the top, with 2 economic indicators below it. Let’s assume these two economic indicators are “Consumer Sentiment” and “Spending on Durable Goods” and that they have been good predictors of my sales and there are no other unaccounted for variables that impact my sales.

Time series data

I know the date I started my marketing campaign. Since time flows in one direction, if it has any impact, it would be after this date. If I can use the data from days prior to the campaign to train my model, I can forecast what would have happened if I had not run the campaign. Then by comparing it with the actual sales data, I can understand what difference it made.

Predicted vs Actual sales

While it was initially hard to visually detect this change, now that I added the “expected” forecast (red line), I can clearly see the bump marketing campaign gave me by looking at the gap between green and red lines. In the workbook, I also added to the tooltip a textual summary describing the statistical significance of the results which the CausalImpact package conveniently generates.

Statistical test results

Let’s also look at the cumulative effect of the marketing campaign.

Cumulative value of the marketing campaign over time

It looks like, we made an extra 748 thousand dollars. By comparing this with the amount we spent on the marketing campaign, we can decide whether this was a good investment.

You can download the sample workbook from HERE. The workbook uses the CausalImpact package which can be installed from GitHub. You can find more info on the installation steps in the notes section at the end of the blog post. One thing to keep in mind, CausalImpact package doesn’t work well with lags e.g. if Consumer Sentiment in January is linked to Sales in July, you’ll have to shift your dates first.

Another common type of time series analysis involves understanding dependency relationships between time series with different lags e.g. could Consumer Sentiment from 3 months ago help me make a better prediction of my Sales next month? A frequently used method for this type of analysis is Granger causality.

Granger causality is not necessarily true causality hence the term is often used as “X Granger-causes Y”. It means the patterns in X are approximately repeated in Y after some lag e.g. 4 months in the example I gave above. As a result of this, past values of X can be used to predict future values of Y.

You might be thinking if X happened 4 months before Y and there is a correlation, how is that not a causal relationship? If both X and Y are driven by a 3rd, unknown process (Z) with different lag, it might look like Granger-causality exists between X and Y while in reality manipulation of X would not change Y, only changes to Z would. The roosters crow before dawn, but they don’t cause the sun to rise.

That being said, lack of true causality does not mean one thing can’t be useful in predicting another. Humans relied on roosters’ built in timers for hundreds of years.

Let’s explore Granger causality with an example. Below you can see two time series : The University of Michigan “Consumer Sentiment Index” and “New One Family Houses Sold” from Federal Reserve Economic Data (FRED) website.

Can consumer sentiment in previous months help better predict future home sales?

Input data for Granger-causality test

There are several packages in R to do this test of course. For this post, I used lmtest package and a few methods from forecast package to estimate the number of differences required to make the sample time series data stationary. You can download the sample workbook from HERE. Based on this analysis, with 4 lags, it appears that Consumer Sentiment unidirectionally Granger-causes Home Sales.

Thurman and Fisher’s 1988 paper “Chickens, Eggs and Causality, or Which Came First?” applies Granger causality to this age-old question concluding that the egg came first. It is a fun, short read with tongue-in-cheek future research directions like potential applications of Granger causality to “He who laughs last, laughs best” and “Pride goeth before destruction and an haughty spirit before a fall.” and proof that journal papers and statistics can be fun. As for more serious reading, Judea Pearl’s “Causality: Models, Reasoning and Inference” is a great resource on the topic of causality.


Installing CausalImpact (bsts is a dependency but at the time of this post the default version you’d get from CRAN still had problems):

install.packages("https://cran.r-project.org/bin/windows/contrib/3.2/bsts_0.6.2.zip",repos = NULL, type = "local")

R, Visualization

Time Series Clustering in Tableau using R

Clustering is a very common data mining task and has a wide variety of applications from customer segmentation to grouping of text documents. K-means clustering was one of the examples I used on my blog post introducing R integration back in Tableau 8.1. Many others in Tableau community wrote similar articles explaining how different clustering techniques can be used in Tableau via R integration.

One thing I didn’t see getting much attention was time series clustering and using hierarchical clustering algorithms. So I thought it might be good to cover both in single post.

Let’s say you have multiple time series curves (stock prices, social media activity, temperature readings…) and want to group similar curves together. Series don’t necessarily align perfectly, in fact they could be even about events that are happening at different pace.

Time series clustering in Tableau using R

There are of course many algorithms to achieve this task but R conveniently offers a package for Dynamic Time Warping. Below is what my calculated field in Tableau looks like.

Calculation for dynamic time warping in Tableau

I started by loading the dtw package, then converted my data from long table format (all time series are in the same column) to wide table format (each series is a separate column).

I then computed distance matrix using dtw as my method and applied hierarchical clustering with average linkage. This gives a tree which I then prune to get the desired number of clusters.

And finally, data is converted back into a long table before being pulled back into Tableau.

Screenshot is from Tableau 10, but don’t worry if you’re not on the Beta program. You can download Tableau 8.1 version of the workbook from HERE. Enjoy!

R, Visualization

Two new great blog posts from Tableau community

Two of my favorite new Tableau blogs are “Data * Science + R” and “Data + Science”. Former is a great resource if you want to learn how to leverage R inside Tableau by walking through sample use cases. Examples are interesting and very thoroughly explained. This year’s first post was about change-point analysis in Tableau. This was a topic I heard from several of our customers and had been planning to blog about but haven’t had the chance.

Change-point analysis in Tableau

“Data + Science” is another awesome blog. Jeff has been doing some great work with various types of undirected graphs. Yesterday he published a blog post about tree diagrams as a follow up to his beautiful Sankey diagram post from a few months earlier.

Tree diagram and Sankey diagram in Tableau

What I love about Tableau is the expressive power and flexibility of the product which combined with such a great user community leads to something new and beautiful almost everyday not to mention countless learning opportunities and cross-pollination. Take the graph below for example. Olivier Catherin took Jeff’s Sankey Diagram idea as an inspiration and further expanded upon it to create the following graph.

Tree Diagram in Tableau