I will be one of the panelists at Washington Technology Industry Association’s event on Predictive Analytics on March 27th with Christian Kelly of Accenture and Stephen Purpura of Context Relevant.
In 2010, Lustig et. al from IBM classified business analytics workloads into three major categories in their paper “The Analytics Journey”:
Descriptive Analytics: A set of technologies and processes that use data to understand and analyze business performance
Predictive Analytics: The extensive use of data and mathematical techniques to uncover explanatory and predictive models of business performance representing the inherit relationship between data inputs and outputs/outcomes.
Prescriptive Analytics: A set of mathematical techniques that computationally determine a set of high-value alternative actions or decisions given a complex set of objectives, requirements, and constraints, with the goal of improving business performance.
You can find the full paper here. For those who don’t have time to read the whole paper, in short the journey starts with the question “what is going on?” moves on to “what is likely to happen in the future?” and finishes with “what can we do to make a certain outcome more likely?”.
When you start talking about constraints and improving performance, it starts sounding awfully similar to mathematical/computational optimization, don’t you think? It should. Because optimization is a major part of it.
Imagine, in your business, you discovered a relationship between profit, frequency of promotions and promotion amount. Given your customer base growth, you can forecast your profit for the next year, and even run what if scenarios by changing these parameters manually and see how they impact the forecast. The more promotions, the more people buy, so you see increased profit because of the volume but probably after a certain point, you get diminishing returns. What is the sweet spot? Prescriptive analytics tools can tell you what combination of these parameters can maximize your profit.
Let’s look at an example of how to do this in Tableau. Here, I have annual stock returns for a number of companies over 20 years. We will be using a metric called Sharpe ratio which characterizes how well the return of an asset compensates the investor for the risk taken. Our goal is to find the portfolio that maximizes Sharpe ratio.
We will use R’s optim function which comes with the base R package so there is no need to install or load any libraries.
Our data is in long format, that is; we have one column for all returns, a column that associates each row with a particular stock and a column with year. Our calculation needs a matrix so we will first
splitthe long table into chunks and then add each chunk as a new column (
cbind) next to each other. SharpeRatio function contains the logic for calculating Sharpe Ratio and it takes the percentages of each stock in your portfolio as input. You may have notice the constant 0.05, which is my assumption as the risk-free interest rate, the theoretical rate of return of an investment that an investor would expect from an absolutely risk-free investment. For example US treasury bonds are often considered risk-free. Also some of the optimization constraints are captured in the function. Considering the fact that each stock is associated with a percentage of total portfolio value, these have to be non-negative and add up to 1. We enforce the range of each percentage being between 1 and 0 using an if condition and when condition is not met we introduce a high penalty (
1e7) so our optimizer avoids those. Adding up to 1 is enforced by the line
p<-c(x,1-sum(x)). Function takes 1 fewer number than the total number of stocks selected by the user and the missing argument is computed by subtracting the sum of these arguments from 1 hence making sure when everything is added, it sums up to 1.
Next we call our optim function with a vector of zeros as initial values. I used L-BFGS-B as the optimization method and provided an upper/lower limit for ratios once again. You can try a different method like Nelder-Mead and skip the upper/lower limit and still get very similar results. Note that by default optim tries to minimize the result of the provided function. You can pass an argument (
control = list(fnscale = -1)) to have it maximize the goal or simply convert the results of your function. I did the latter. Hence my penalty for stepping out of bounds is a large positive number but my Sharpe Ratio is multiplied by –1.
Below is a screenshot showing an optimized portfolio for these 4 particular stocks based on 20 years of historical data.
You can download the workbook HERE. Enjoy!
When the data has a lot of features that interact in complicated non-linear ways, it is hard to find a global regression model i.e. a single predictive formula that holds over the entire dataset. An alternative approach is to partition the space into smaller regions, then into sub-partitions (recursive partitioning) until each chunk can be explained with a simple model.
There are two main types of decision trees:
Classification trees : Predicted outcome is the class the data belongs.
Regression trees : Predicted outcome is continuous variable e.g. a real number such as the price of a commodity.
There are many ensemble machine learning methods that take advantage of decision trees. Perhaps the best known is the Random Forest classifier that constructs multiple decision trees and outputs the class which corresponds to the mode of the classes output by individual trees.
All the data used in the examples below are retrieved from UC Irvine Machine Learning Repository. In the example workbook you can find links to documentation and credits for each dataset.
First dataset contains composition (Sodium, Calcium, Magnesium… content) and physical properties (RI – Refractive Index) of samples for different types of glass. If you were to find broken glass in a crime scene, would you be able to tell what it is ?
Here is how you can create a classification tree to answer this question:
This calculation will just work as long as you have Rserve running and rpart package installed since it is self-contained as it loads the library, trains the model and then uses the model to make predictions within the same calculation. Model fit happens in the part underlined in orange, the tree is pruned to the level where minimum error occurs in the part underlined in blue and finally the prediction happens in the part underlined in green.
Alternatively you could use the control setting in rpart function to impose some rules on splits. In the example below control=rpart.control(minsplit=10, cp=0.001) requires at least 10 items in a node before attempting a split and that a split must decrease the overall cost complexity factor by 0.001.
You may have noticed that in this example we are using the same dataset for training and prediction, not to mention we are not looking at any diagnostic measures to gage the quality of fit. I did this firstly because I wanted to provide a simple, self-contained example that just works and secondly, the reason for the blog post is to show how a particular technique can be used from within Tableau as opposed to providing guidelines for model fitting and selection. In real life scenarios, you would train the model using one “tagged” dataset, verify the quality, save the trained model and use it to predict with different datasets serving as input. Save and reuse workflow has been explained in detail with an example in my earlier blog post about logistic regression with Tableau and R. But I haven’t talked about how you can use the same approach to iterate on the model by going back and forth between your favorite R development environment and Tableau as needed before you deploy the final model for everyone to use via a Tableau dashboard.
You will notice a sheet in the example workbook named SaveTree. This sheet contains a calculated field that has the following line:
which saves the fitted model to a file on my computer. Once you update the file path to a valid path on your computer and it successfully saves, you can easily open the model from your R environment and examine the tree structure and relevant statistics, prune the tree etc.
Once you’re done you can run the same line from your R console and overwrite the file, then read it from Tableau to use for predictions.
There is a dedicated sheet in the example workbook that uses random forest approach on the same dataset. It uses the package randomForest which you will need to install. However the syntax is almost identical to the classification tree example above so I will skip over it to move on to regression trees right away.
For regression tree example, I will use a separate dataset which contains make, model year, displacement, # cylinders etc. for different cars based on which we will try to predict MPG.
Once again using the rpart package, the calculation can be written as follows. It is similar to the classification tree script. The minor differences are highlighted. Note that the discrete variables such as make, origin and # cylinders are used as factors.
After some pruning, the regression tree looks like the following when you draw it in R where leaf nodes show the MPG values for cars classified under that group.
You can find the Tableau workbook with the examples HERE.
A common question about R integration feature in Tableau 8.1 is how to use it with forecasting packages available in R. This is an interesting question for two reasons:
SCRIPT_ functions are calculated fields and calculated fields don’t add rows to the table while that’s what forecasting is all about.
Forecast packages normally return just the forecast results which is fewer rows than the input data and at the same time, doesn’t make sense to join back in.
Here is a relevant thread from Tableau forums.
Before we move on any further, I would like to emphasize that Tableau has a built in forecast functionality which makes most of what we will cover here possible with the click of a button and comes with many more convenience features. The intent of the blog post is to show an alternative approach while very complicated can be useful for some.
The example in the forums uses the fpp package. For variety in this blog post I will be using the forecast package. It is worthwhile noting that, we worked very closely with Prof. Hyndman, the author of R’s forecast package when developing the forecasting feature in Tableau 8.
Let’s start with the visualizing the fit and residuals for your in-sample forecast. You will find this in the first sheet of the workbook I provided a link for at the end of the post.
We have two time series in this chart. To reduce the network traffic, I retrieved both columns in a single call to Rserve. Let’s take a look at the calculation.
First, we are loading the library. For this to work, please make sure the library and its dependencies are installed. You can find more about how to preload libraries for better performance in my posts here and here. Then we are creating a time series object (ts) using [Earnings] column (.arg1), since it is quarterly our deltat is 1/4 and our time series starts at the beginning of 1960. You will notice the paste function is taking two columns from the resulting forecast (fit and residuals) and concatenating them using ~ as the separator which allows us to retrieve both columns at once. Then in the calculated fields named Fit and Residuals, we tokenize the results. Only one of the calculated fields is shown below for brevity.
So far we’ve been dealing with the in-sample forecast, so we haven’t yet touched the core question about how to retrieve the future predictions into Tableau. Without further ado, let’s jump to the more exciting part. I will demonstrate two approaches. The first one is domain shifting which is shown in the second sheet.
Our calculation as shown above is very similar to the one we looked at earlier. But this time we are asking for a forecast (mean) of length (h=) set by the Tableau parameter named [PeriodsToForecast] and using append function to add the predicted rows to the end of the [Earnings] column (which answers the question 2 above) while dropping just as many rows from the top. The result is shown below. Since we forecast 10 quarters, our time series now extend 2.5 year beyond 1980 but our chart doesn’t start at 1960 anymore.
We achieve this also by shifting the date axis by the same amount as shown below.
But what if you don’t want to trim rows from the beginning of your data? That’s where densification comes to our aid.
The second approach relies on the fact that Tableau can fill in the missing values given a range. What we will do is to shift the last data point to the date we want our forecast to end. This will create a gap between the last value and the previous and Tableau’s “Show Missing Values” option will automatically add rows to fill that gap.
The last date is moved out to provide the full date extent for forecasting using the following calculated field. Here [MaxDate] is created at the data source level by joining the main table with a sub-query that returns the MAX(Quarter) in the input data.
Our calculated field looks very similar to the previous calculation. Important changes are highlighted which involve removing the blank rows densification added from the data passed on to R’s forecast function as input and pulling back the shifted last data point to the correct date for the forecasting function. Then the results get merged back in using the same append function but this time without anything getting trimmed.
The result is shown below. For comparison we are forecasting 10 quarters (same as above). Now our time series extend 2.5 year beyond 1980 while chart still starts at 1960.
You can download the example workbook from HERE.
With the increasing amount of user content on the web, text analytics is gaining more mainstream adoption. Sentiment analysis, keyword and named entity extraction are the most common tasks since they allow quickly classifying, filtering and turning text into easily consumable metrics. What do the customers like most about your product, what do they not like? Do people perceive your brand more positively or negatively compared to last year?
With the new R integration feature in Tableau 8.1 it is very easy to add these functionality to your dashboards. There are currently two packages in R that can be used for this purpose: sentiment and qdap. In this post we will use sentiment. This package requires tm and Rstem packages, so first you’ll need to install those. You can do this by typing in the commands below, into your R console (or RStudio if that’s the IDE of your choice).
It may be difficult to find the right versions of Rstem and sentiment. If you already have these packages you can skip to the next step.
Let’s take the first stab by using the classify_polarity function. Comment Text column contains reviews for a hypothetical product. We are using our calculated field Sentiment for both text and color coding as it returns one of three classifications: negative, neutral and positive.
You will notice that the results are not perfect. Second row from the bottom, is in fact a negative comment about delayed delivery but classified as a positive comment. More on that later. Now let’s have a look at what the calculated field looks like.
As you can see the R script is very simple. We are calling the function and retrieving the column corresponding to best_fit. Another method in this package is classify_emotion which classifies text into emotion such as anger, joy, fear… The function call is very similar but we get a different dimension from the results this time. Especially the two lines that are associated with emotion “fear” look far off. But how does this work and how can it be made better?
Sentiment analysis techniques can be classified into two high level categories:
Lexicon based : This technique relies on dictionaries of words annotated with their orientation described as polarity and strength e.g. negative and strong, based on which a polarity score for the text is calculated. This method gives high precision results as long as lexicon used has a good coverage of words encountered in the text being analyzed.
Learning based : These techniques require training a classifier with examples of known polarity presented as text classified into positive, negative and neutral classes.
R’s sentiment package follows a lexicon based approach hence we were able to get right into the action, given it comes with a lexicon for English. In your R package library under \sentiment\data folder you can find the lexicon as a file named subjectivity.csv.gz.
The text that was incorrectly classified as having positive polarity is the following “Took 4 weeks to receive it even though I paid for 2 day delivery. What a scam.” If you open the file, as you probably suspected, you will find out that scam is not a word in the lexicon. Let’s add the following line to the file,
then save, zip the file, restart RServe and refresh our workbook.
Now, you can see that the text is classified correctly as expressing negative sentiment. When using lexicon-based systems, adding new words to the lexicon or using a completely new lexicon are potential paths to follow if you are not getting good results. Incorrect classifications are more likely if slang, jargon and colloquial words are being used in the text you’re analyzing since these are not covered extensively in common lexica.
You can download the workbook containing the example HERE.
In my post on Tableau website earlier this year, I included an example of multiple linear regression analysis to demonstrate taking advantage of parameters to create a dashboard that can be used for What-If analysis. In that example, the model fitting was done inside the same calculation as the predictions, in order to provide a self-contained example that just works. However in real-life, once created, the models are reused for predictions many times until (if at all) they get updated. Let’s take logistic regression as an example to examine reusability.
In this sample, we will use admissions(generated) data that contains, GPA, GRE score, rank of the institution (1 being top tier schools, 4 the lowest tier) and whether the student was admitted or not. We will use regression model to predict the probability of admission for a given GPA, GRE, Rank combination.
If we were to do fitting and prediction inside the same calculated field, it would look like:
Here “binomial” indicates that we want a logistic regression model.
There are many ways to save your model for future use. You can go through the model fitting inside R (using your favorite IDE e.g. R GUI or RStudio etc.) and save it, then just load the model into Tableau or you can save it to a file directly from within Tableau. If you were doing this in R (assuming you have the data as a CSV), it would look like
This is probably what most people would do. And of course evaluate the goodness-of-fit for the model before using it to make predictions.
But if you don’t want to switch between R and Tableau, instead wanted to do this inside Tableau, it is just adding the last line to your calculated field so it looks like:
You would probably do this once to save the model and after the calculation re-executes, remove it from the calculated field. Your calculation reading a previously saved model would look like the following:
Of course once the model is written, it can be used from R as well as Tableau. An even nicer way to do this is to have the model preloaded on Rserve, readily available for use. Doing this involves making some minor edits to Rserve’s configuration file: Rserv.cfg.
There are two relevant options you can use in Rserv.cfg to achieve this. “Source” option runs the provided R script file. Any objects created in the source file becomes available to all users with access to Rserve. These could be new functions (assume there is a custom function you wrote, that’s rather lengthy and you don’t want to write all the script inside a Tableau calculated field, you just want to call it by name and use it), or any object that holds any kind of data (e.g. read an XDF file in R and have access to its contents from Tableau).
The other option is “eval” which executes the R code to follow. For example if you had
it would create an object that holds the value 5 and load the mvoutlier library.
So your Rserv.cfg file may contain the following lines
In this case the second line is loading the model we created. So the person who will use the model for predictions doesn’t have to know where it is saved. If they know the name of the model, they can call it from Tableau. Note that config file read when Rserve starts so if you made changes to the file, you need to start Rserve for them to take effect. Also by default Rserve looks for the config file in R’s working directory. In R, you can find out what your working directory is using the getwd() command.
So I put my Rserv.cfg file in the documents folder. Restarted Rserve and Volia! I am able to call the predict function directly as shown below, without having to fit the model again or read it from the disk.
You can download the example workbook from HERE. The parts that rely on reading models from files and relying objects in Rserve session will not work until you make the necessary changes listed above. Each calculated field also contains comments about changes needed.
Correlation matrices offer a good way of visualizing similarities between members in your dataset. Data points used in calculating pair-wise correlations could be of many different kinds. For example you could be creating a correlation matrix between different commodities and have the price for each commodity over a period of time used for calculating the correlations or you could be comparing products based on their varying qualities. In this example I will use the R sample dataset; mtcars to visualize correlations between different makes and models of cars based on their MPG, horsepower, displacement, # cylinders, weight etc. Before we move on here is what the results look like. You can click on the image below to open a live version of it posted on Tableau Public implemented using table calculations.
Easiest way to get this layout with your data is to create a cross join which will result in paired combinations of values of different variables for different cars. But Tableau Zen Master, Jonathan Drummey has a brilliant alternative which relies on domain completion. It makes the calculations slightly more complicated but by not doing a cross-join, you will cut the size of the data significantly which while not noticeable in this particular example, for large datasets will make a difference. He was kind enough to share his solution which you can find in the third sheet of the workbook provided through the link below.
Since we want to treat values for different variables as part of a series based on which we will compare two cars, it helps to use a long table i.e. put all values into one column and have another column to identify what that value is (e.g. MPG vs. horsepower) as opposed to a wide table where each variable is a column of its own. Table calculation’s addressing and partitioning settings are very important to get the right results.
Given it is not hard to do this using Tableau’s own table calculations, you may think it is a bit of an overkill to do this in R. However what I would like to demonstrate here is something that can be applied more widely. Correlation matrix happens to be the example that came in handy. If you look at the sheet that was built using R, you will see a pattern that can be useful in calculations where you need to create a matrix from a tabular data source for your analysis which I also used in my blog post here for the multi-dimensional scaling example. The pattern consists of breaking the long table into parts that you want to have as separate columns (in this case using R’s split function) and then merging them into a matrix (in this case using cbind and do.call). But for correlation matrix, it gets a bit more interesting due to the cross join.
By looking at the colors in the matrix, you can see that Maserati Bora is most similar to Ferrari Dino while Lotus Europa and Cadillac Fleetwood are very different from each other.
You can download the workbook from HERE which contains correlation matrix built both using R integration and purely in Tableau using table calculations.
If you download the workbook, you can use the correlation coefficient as a quick filter to limit what you see to highly or poorly correlated cars which dynamically resizes your matrix. Or you can use the Variable dimension on a filter to change the number of properties used in calculating the correlation. Would our matrix look different if we were just using number of carburetors, weight and mpg instead of all 11 metrics?