When the data has a lot of features that interact in complicated non-linear ways, it is hard to find a global regression model i.e. a single predictive formula that holds over the entire dataset. An alternative approach is to partition the space into smaller regions, then into sub-partitions (recursive partitioning) until each chunk can be explained with a simple model.
There are two main types of decision trees:
Classification trees : Predicted outcome is the class the data belongs.
Regression trees : Predicted outcome is continuous variable e.g. a real number such as the price of a commodity.
There are many ensemble machine learning methods that take advantage of decision trees. Perhaps the best known is the Random Forest classifier that constructs multiple decision trees and outputs the class which corresponds to the mode of the classes output by individual trees.
All the data used in the examples below are retrieved from UC Irvine Machine Learning Repository. In the example workbook you can find links to documentation and credits for each dataset.
First dataset contains composition (Sodium, Calcium, Magnesium… content) and physical properties (RI – Refractive Index) of samples for different types of glass. If you were to find broken glass in a crime scene, would you be able to tell what it is ?
Here is how you can create a classification tree to answer this question:
This calculation will just work as long as you have Rserve running and rpart package installed since it is self-contained as it loads the library, trains the model and then uses the model to make predictions within the same calculation. Model fit happens in the part underlined in orange, the tree is pruned to the level where minimum error occurs in the part underlined in blue and finally the prediction happens in the part underlined in green.
Alternatively you could use the control setting in rpart function to impose some rules on splits. In the example below control=rpart.control(minsplit=10, cp=0.001) requires at least 10 items in a node before attempting a split and that a split must decrease the overall cost complexity factor by 0.001.
You may have noticed that in this example we are using the same dataset for training and prediction, not to mention we are not looking at any diagnostic measures to gage the quality of fit. I did this firstly because I wanted to provide a simple, self-contained example that just works and secondly, the reason for the blog post is to show how a particular technique can be used from within Tableau as opposed to providing guidelines for model fitting and selection. In real life scenarios, you would train the model using one “tagged” dataset, verify the quality, save the trained model and use it to predict with different datasets serving as input. Save and reuse workflow has been explained in detail with an example in my earlier blog post about logistic regression with Tableau and R. But I haven’t talked about how you can use the same approach to iterate on the model by going back and forth between your favorite R development environment and Tableau as needed before you deploy the final model for everyone to use via a Tableau dashboard.
You will notice a sheet in the example workbook named SaveTree. This sheet contains a calculated field that has the following line:
which saves the fitted model to a file on my computer. Once you update the file path to a valid path on your computer and it successfully saves, you can easily open the model from your R environment and examine the tree structure and relevant statistics, prune the tree etc.
Once you’re done you can run the same line from your R console and overwrite the file, then read it from Tableau to use for predictions.
There is a dedicated sheet in the example workbook that uses random forest approach on the same dataset. It uses the package randomForest which you will need to install. However the syntax is almost identical to the classification tree example above so I will skip over it to move on to regression trees right away.
For regression tree example, I will use a separate dataset which contains make, model year, displacement, # cylinders etc. for different cars based on which we will try to predict MPG.
Once again using the rpart package, the calculation can be written as follows. It is similar to the classification tree script. The minor differences are highlighted. Note that the discrete variables such as make, origin and # cylinders are used as factors.
After some pruning, the regression tree looks like the following when you draw it in R where leaf nodes show the MPG values for cars classified under that group.
You can find the Tableau workbook with the examples HERE.