Let’s say you trained a predictive model using R integration in Tableau or your favorite R authoring environment.
If your data is very large you probably used a subset of your data for training and want to see how well it will do over the entire database.Or maybe you’re satisfied with the model but you want it to return you only the interesting results from the database e.g. customers who are most likely to churn.
In both scenarios, being able to push the model to the database means being able to rely on the database’s scalability and avoiding delays caused by data movement.
You may be asking, “but what if our database systems can’t run predictive models?” Luckily once trained, some of the very commonly used predictive methods can be expressed in SQL hence also in Tableau’s calculation language. All you need is a way to convert the model into a Tableau calculated field.
If you created a decision tree in R, you can use the following function to convert it to a Tableau formula:
decisionTreeToFormula <- function (tree) { if (class(tree) != "rpart") { warning(class(tree), " is not supported.Please use an rpart model.") stop() } capture.output({rpartRules <- path.rpart(tree,rownames(tree$frame)[tree$frame$var=="<leaf>"])}) ruleTexts <- "IF " operators <- c("<=",">=","<",">","=") i <- 1 levelToCategory<-data.frame(labels=attr(tree,"ylevels"),catkey=rep(1:length(attr(tree,"ylevels")))) nodeToLevel<-data.frame(catkey=tree$frame$yval,node=rownames(tree$frame)) mapping_table<-merge(levelToCategory,nodeToLevel,by='catkey')[,2:3] for (rule in rpartRules) { ruleText <- character(0) for (component in rule) { whichSeparator <- lapply(operators, function(x) length(unlist(strsplit(component,x)))) > 1 parts <- unlist(strsplit(component,(operators[whichSeparator])[1])) if(!(parts[1]=="root")) { if (is.finite(suppressWarnings(as.numeric(parts[2])))) { ruleText <- c(ruleText,paste("[",parts[1],"]",(operators[whichSeparator])[1],parts[2],sep="")) } else { ruleText <- c(ruleText,paste0(" (",paste0("[",parts[1],"]","='",unlist(strsplit(parts[2],",")),"'",collapse=" OR "),")")) } } } ruleTexts <- c(ruleTexts, paste0(if(length(ruleTexts)>1){"\nELSEIF "}, paste(ruleText,collapse=" AND ")," THEN '" ,mapping_table[mapping_table$node==as.numeric(names (rpartRules)[i]),]$labels,"'")) if(i==length(rpartRules)) ruleTexts <- c(ruleTexts,"\nEND") i <- i +1 } tableauFormula <- paste(ruleTexts,collapse=" ") return(tableauFormula) }
You can add this to your Rserve.cfg so you can call the decisionTreeToFormula function in your SCRIPT_ calls from Tableau or you can use it in your R environment. Either way it will give you some text which you can copy-paste into Tableau and use it as a formula that can run on the database.
In the example below, I fit a decision tree on a database of cars trying to predict the gas mileage (Bad|OK|Good). I used only a small portion of variables (just maker and year) to get a compact result back. Applying the function gives me a nice IF..THEN..ELSE formula that handles both numeric and categorical items. Now I can use this classifier on any database. As a dimension, as a filter… by just copy-pasting the formula into a calculated field in Tableau.
> myTree<-rpart(mpg~modelyear+maker,data=cars) > decisionTreeToFormula(myTree) IF ([maker]='America') AND [modelyear] < 79 THEN 'Bad' ELSEIF ([maker]='America') AND [modelyear] >= 79 THEN 'Good' ELSEIF ([maker]='Asia' OR [maker]='Europe') THEN 'OK' END
What about a regression model? This post has that covered, too. You can achieve similar results with the following function :
regressionModelToFormula <- function(model, decimalPlaces) { if ((class(model)[1] != "glm" && model$family$family != "binomial") & class(model)[1] != "lm") { warning("Function is not applicable to this model type.") stop() } factors=c("",gsub("[-^0-9]", "", names(unlist(model$xlevels))))[-1] levels=c("",as.character(unlist(model$xlevels)))[-1] coeff <- coefficients(model); names=data.frame(factors,levels,varNames=paste(factors,levels,sep="")) varCoeffMap=data.frame(varNames=names(coeff)[-1],coefficients=as.numeric(coeff[-1])) catVars <-merge(names,varCoeffMap,all.x=TRUE) if (nrow(catVars) !=0) { contVars <- round(coeff[-which(names(coeff) %in% catVars$varNames)],decimalPlaces) } else { contVars <- round(coeff,decimalPlaces) } catFormula <- " " for(catVar in unique(catVars$factors)){ curVar <- catVars[catVars$factors==catVar & !is.na(catVars$coefficients) ,] catFormula <- paste("WHEN",curVar$levels, "THEN",round(curVar$coefficients,decimalPlaces), sep=" ", collapse="\n") catFormula <- paste("+ CASE [",catVar,"] \n",catFormula," ELSE 0 \n END \n", sep="") } numericFormula <- paste(contVars[-1], "*" ,paste("[",names(contVars[-1]),"]",sep=""), "+", collapse = " ") if (class(model)[1]=="lm") { combinedFormula <- paste(numericFormula,contVars[1],catFormula) }else{ combinedFormula <- paste("1 / (1 + exp(", numericFormula, contVars[1],catFormula, "))") } tableauFormula <- gsub(pattern=":", replacement="*", x=combinedFormula) return(tableauFormula) }
If you call this function from R on the logistic regression model I used in one of my earlier blog posts, below is what you will get. Copy-paste this into a Tableau calculated field and you’re ready to make predictions with your logistic regression model in your database.
> regressionModelToFormula(lrmodel,5) 1 / (1 + exp( 0.00628 * [gre] + 1.60393 * [gpa] -8.82164 + CASE [rank] WHEN 2 THEN -1.10904 WHEN 3 THEN -1.31259 WHEN 4 THEN -2.01912 ELSE 0 END))
And if you apply this function on the multiple linear regression model in the last sheet of the example workbook from my last blog post, what you get will resemble the following:
> regressionModelToFormula(mrmodel, 4) 1.6224 * [Economic Indicator X] + 0.597 * [Economic Indicator Y] -37.1103
I hope you find this useful.
Pingback: Quick Tip : Overlaying curves on Tableau scatter plots with R « Bora Beran
Can you show how to do this without converting to a Tableau formula?
You mean how to fit the model? Or how to extract coefficients? Or…?
I was searching google for information to clarify what I saw on rsdutio: https://www.rstudio.com/resources/webinars/working-with-big-data-in-r/ (starting at 25 min) where they build a model and pushed it to a SQL database to score the data. I know how to build a model, get coefficients/ variables of importance, but need more information on how to use the models to score the data in the database.
So you’re trying to convert to SQL as opposed to a Tableau calculation. You need to wrap what my example generates in a select statement e.g. SELECT [a]+[b] as score from MyTable
Hi Bora. I love your code it was just what I needed for making score code in SQL. However I experienced some minor problems – if the variable names contain numbers and if there is more than one category (or factor) variable it failed for me. However your code didn’t need much adjustment – i also adjusted it for use in SQL:
Remove numbers only in the end o the variable name:
factors=c(“”,gsub(“[0-9]*$”, “”, names(unlist(model$xlevels))))[-1]
Add the catFormula name to itself:
catFormula <- paste(" WHEN",paste("'",curVar$levels,"'",sep=""), "THEN",round(curVar$coefficients,decimalPlaces),catFormula, sep=" ", collapse="")
I hope this is correct…