Using your R models for in-database scoring

Let’s say you trained a predictive model using R integration in Tableau or your favorite R authoring environment.

If your data is very large you probably used a subset of your data for training and want to see how well it will do over the entire database.Or maybe you’re satisfied with the model but you want it to return you only the interesting results from the database e.g. customers who are most likely to churn.

In both scenarios, being able to push the model to the database means being able to rely on the database’s scalability and avoiding delays caused by data movement.

You may be asking, “but what if our database systems can’t run predictive models?” Luckily once trained, some of the very commonly used predictive methods can be expressed in SQL hence also in Tableau’s calculation language. All you need is a way to convert the model into a Tableau calculated field.

If you created a decision tree in R, you can use the following function to convert it to a Tableau formula:

decisionTreeToFormula <- function (tree) {

if (class(tree) != "rpart") {
warning(class(tree), " is not supported.Please use an rpart model.")

capture.output({rpartRules <- path.rpart(tree,rownames(tree$frame)[tree$frame$var=="<leaf>"])})
ruleTexts <- "IF "
operators <- c("<=",">=","<",">","=")
i <- 1


for (rule in rpartRules) {
ruleText <- character(0)
for (component in rule) {
whichSeparator <- lapply(operators, function(x) length(unlist(strsplit(component,x)))) > 1
parts <- unlist(strsplit(component,(operators[whichSeparator])[1]))
if(!(parts[1]=="root")) {
if (is.finite(suppressWarnings(as.numeric(parts[2])))) {
ruleText <- c(ruleText,paste("[",parts[1],"]",(operators[whichSeparator])[1],parts[2],sep=""))
} else {
ruleText <- c(ruleText,paste0(" (",paste0("[",parts[1],"]","='",unlist(strsplit(parts[2],",")),"'",collapse=" OR "),")"))

ruleTexts <- c(ruleTexts, paste0(if(length(ruleTexts)>1){"\nELSEIF "}, paste(ruleText,collapse=" AND ")," THEN '" ,mapping_table[mapping_table$node==as.numeric(names
if(i==length(rpartRules)) ruleTexts <- c(ruleTexts,"\nEND")
i <- i +1
tableauFormula <- paste(ruleTexts,collapse=" ")

You can add this to your Rserve.cfg so you can call the decisionTreeToFormula function  in your SCRIPT_ calls from Tableau or you can use it in your R environment. Either way it will give you some text which you can copy-paste into Tableau and use it as a formula that can run on the database.

In the example below, I fit a decision tree on a database of cars trying to predict the gas mileage (Bad|OK|Good). I used only a small portion of variables (just maker and year) to get a compact result back. Applying the function gives me a nice IF..THEN..ELSE formula that handles both numeric and categorical items. Now I can use this classifier on any database. As a dimension, as a filter… by just copy-pasting the formula into a calculated field in Tableau.

> myTree<-rpart(mpg~modelyear+maker,data=cars)
> decisionTreeToFormula(myTree)
IF ([maker]='America') AND [modelyear] < 79 THEN 'Bad'
ELSEIF ([maker]='America') AND [modelyear] >= 79 THEN 'Good'
ELSEIF ([maker]='Asia' OR [maker]='Europe') THEN 'OK'

What about a regression model? This post has that covered, too. You can achieve similar results with the following function :

regressionModelToFormula <- function(model, decimalPlaces) {
if ((class(model)[1] != "glm" && model$family$family != "binomial") & class(model)[1] != "lm") {
warning("Function is not applicable to this model type.")
factors=c("",gsub("[-^0-9]", "", names(unlist(model$xlevels))))[-1]
coeff <- coefficients(model);
catVars <-merge(names,varCoeffMap,all.x=TRUE)

if (nrow(catVars) !=0) {
contVars <- round(coeff[-which(names(coeff) %in% catVars$varNames)],decimalPlaces)
} else {
contVars <- round(coeff,decimalPlaces)
catFormula <- " "

for(catVar in unique(catVars$factors)){
curVar <- catVars[catVars$factors==catVar & !is.na(catVars$coefficients) ,]
catFormula <- paste("WHEN",curVar$levels, "THEN",round(curVar$coefficients,decimalPlaces), sep=" ", collapse="\n")
catFormula <- paste("+ CASE [",catVar,"] \n",catFormula," ELSE 0 \n END \n", sep="")

numericFormula <- paste(contVars[-1], "*" ,paste("[",names(contVars[-1]),"]",sep=""), "+", collapse = " ")
if (class(model)[1]=="lm") {
combinedFormula <- paste(numericFormula,contVars[1],catFormula)
combinedFormula <- paste("1 / (1 + exp(", numericFormula, contVars[1],catFormula, "))")
tableauFormula <- gsub(pattern=":", replacement="*", x=combinedFormula)

If you call this function from R on the logistic regression model I used in one of my earlier blog posts, below is what you will get. Copy-paste this into a Tableau calculated field and you’re ready to make predictions with your logistic regression model in your database.

> regressionModelToFormula(lrmodel,5)
1 / (1 + exp( 0.00628 * [gre] + 1.60393 * [gpa] -8.82164 +
CASE [rank]
WHEN 2 THEN -1.10904
WHEN 3 THEN -1.31259
WHEN 4 THEN -2.01912

And if you apply this function on the multiple linear regression model in the last sheet of the example workbook from my last blog post, what you get will resemble the following:

> regressionModelToFormula(mrmodel, 4)
1.6224 * [Economic Indicator X] + 0.597 * [Economic Indicator Y] -37.1103

I hope you find this useful.


One thought on “Using your R models for in-database scoring

  1. Pingback: Quick Tip : Overlaying curves on Tableau scatter plots with R « Bora Beran

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s