In my post on Tableau website earlier this year, I included an example of multiple linear regression analysis to demonstrate taking advantage of parameters to create a dashboard that can be used for What-If analysis. In that example, the model fitting was done inside the same calculation as the predictions, in order to provide a self-contained example that just works. However in real-life, once created, the models are reused for predictions many times until (if at all) they get updated. Let’s take logistic regression as an example to examine reusability.

In this sample, we will use admissions^{(generated)} data that contains, GPA, GRE score, rank of the institution (1 being top tier schools, 4 the lowest tier) and whether the student was admitted or not. We will use regression model to predict the probability of admission for a given GPA, GRE, Rank combination.

If we were to do fitting and prediction inside the same calculated field, it would look like:

`SCRIPT_REAL('mydata <- data.frame(admit=.arg1, gpa=.arg2, gre=.arg3, rank=.arg4);` |

Here “binomial” indicates that we want a logistic regression model.

There are many ways to save your model for future use. You can go through the model fitting inside R (using your favorite IDE e.g. R GUI or RStudio etc.) and save it, then just load the model into Tableau or you can save it to a file directly from within Tableau. If you were doing this in R (assuming you have the data as a CSV), it would look like

`mydata <- read.csv("C:/Users/bberan/R/admission.csv")` |

This is probably what most people would do. And of course evaluate the goodness-of-fit for the model before using it to make predictions.

But if you don’t want to switch between R and Tableau, instead wanted to do this inside Tableau, it is just adding the last line to your calculated field so it looks like:

`SCRIPT_REAL('mydata <- data.frame(admit=.arg1, gpa=.arg2, gre=.arg3, rank=.arg4);` |

You would probably do this once to save the model and after the calculation re-executes, remove it from the calculated field. Your calculation reading a previously saved model would look like the following:

`SCRIPT_REAL('mydata <- data.frame(admit=.arg1, gpa=.arg2, gre=.arg3, rank=.arg4);` |

Of course once the model is written, it can be used from R as well as Tableau. An even nicer way to do this is to have the model preloaded on Rserve, readily available for use. Doing this involves making some minor edits to Rserve’s configuration file: Rserv.cfg.

There are two relevant options you can use in Rserv.cfg to achieve this. “Source” option runs the provided R script file. Any objects created in the source file becomes available to all users with access to Rserve. These could be new functions (assume there is a custom function you wrote, that’s rather lengthy and you don’t want to write all the script inside a Tableau calculated field, you just want to call it by name and use it), or any object that holds any kind of data (e.g. read an XDF file in R and have access to its contents from Tableau).

The other option is “eval” which executes the R code to follow. For example if you had

`eval x = 5; library("mvoutlier")` |

it would create an object that holds the value 5 and load the mvoutlier library.

So your Rserv.cfg file may contain the following lines

`source C:/Users/bberan/R/myRFunctionCode.R` |

In this case the second line is loading the model we created. So the person who will use the model for predictions doesn’t have to know where it is saved. If they know the name of the model, they can call it from Tableau. Note that config file read when Rserve starts so if you made changes to the file, you need to start Rserve for them to take effect. Also by default Rserve looks for the config file in R’s working directory. In R, you can find out what your working directory is using the getwd() command.

So I put my Rserv.cfg file in the documents folder. Restarted Rserve and Volia! I am able to call the predict function directly as shown below, without having to fit the model again or read it from the disk.

`SCRIPT_REAL('mydata <- data.frame(admit=.arg1, gpa=.arg2, gre=.arg3, rank=.arg4);` |

You can download the example workbook from HERE. The parts that rely on reading models from files and relying objects in Rserve session will not work until you make the necessary changes listed above. Each calculated field also contains comments about changes needed.

Enjoy!

This is great … I’m wondering how I can add categorical variables to the model (ie gender or state of residence are factors in models I use). I note that you use as.factor in your model but you are still using a numerical value as the input …

I also face this problem. I think it may not be possible to include categorical data.

Can you share your SCRIPT, the error messages and data types for your fields? You can use categorical data. I do this myself all the time. There must be something else going on.

Hello!

I managed to solve it. For Categorical Data in Tableau, you need to use the ATTR() aggregate function.

You can also use MIN or MAX.

I am just wondering how can we use categorical variables directly into regression? Do we not need the dummy variables here? I see you have used as.factor() while doing logistic regression. Can we also use as.factor() in linear regression as well?

Thanks in advance!

Hi Arushi,

Yes. You can use as.factor() for linear regression as well. In the model output you will see separate coefficients for categories (for those that are not 0) as opposed to a single categorical column.

~Bora

Why we are taking average of each columns?

This is because SCRIPT_ functions are table calculations which require their arguments to be aggregates.

However the fact that it is AVG([something]) doesn’t mean it has to be aggregated. It only means it is the function to be used if aggregation is necessary. If you uncheck the option “Aggregate Measures” under Analysis menu, results won’t be aggregated under any circumstances.

Even with Aggregate Measures option turned on, if your visualization has a dimension that corresponds to row level data (e.g. each row is a student and you have a unique studentID column that you’re using in your visualization as a dimension as in the example workbook) then aggregating vs. not aggregating won’t make a difference since your aggregation will run with single input row.

Is there a way to use dimension filters on the view without changing the probabilities returned by the model? Let’s say there was a region variable in your example workbook that you don’t want to include in the model, but you do want to make available to users. E.g., a school might have staff assigned to the Northwest who only need to see probabilities for students in that region.

In our situation users are accustomed to having at least a couple filters, but adding them to the view seems to change the probabilities (I’m assuming because the R script is a table calc). I’m hoping there’s a way to avoid simply presenting a long list of people and scores.

Following up because I might have conflated two things:

1. I want to add a filter to remove rows from the view that displays probabilities but I don’t want to exclude these cases from the model.

2. I want to add a field or fields to the view to segment a table view of student probabilities (e.g., Region & State before student).

Thanks.

You can use another table calc as a filter. When you use a table calc as a filter, it doesn’t filter the actual data, it filters/hides the results.

E.g. if you have Region and Student and you want to hide an entire region without affecting the results, you can create a table calc like WINDOW_MAX(MAX([Region])) and set compute using Student. Then drag this calculation onto filter shelf and select show as quick filter.

Thanks, Bora, that was very helpful.

On a completely unrelated note, thought I’d share that I’m using Tableau to visualize some of my dissertation data on the Diyanet’s Friday sermons (cuma hutbeleri). I hope to have a public viz up someday.

I am glad it helped. Best of luck with your dissertation. Please share the viz when you get the chance. I am curious to see if and how the content changed in the past 12 years compared to the governments prior.

Hi Bora,

What’s the difference between a .arg argument and a .arg[] argument in your t test calculations?

Thanks!

Hi Helen,

Assume your .arg1 is a column that contains 3 values e.g. {10,20,30}. if you use .arg1 you are passing all those values to the function you’re calling in R. If you use .arg1[1] then you are passing the first value i.e. 10, if you do .arg1[2] you’re passing the second values etc.

There are two ways to pass Tableau parameters to R. One is to pass it as an argument, in which case parameter is passed as if it is a column of repeated values e.g. {10,10,10} but your function probably needs 1 number so you pick one .arg1[1]. The other way to do it is to pass it as part of the formula string by injecting it into the R script (my network graph blog post has an example that does this).

I hope this helps.

Bora

Hi Bora Beran

I cant find your multiple regression post, i need to do the same thing and as I am new to tableau but have some experience in R trying to integrate both so please help me to find and check out how to do multiple regression using R and tableau.

Regards

kush

You can find it here

http://www.tableausoftware.com/about/blog/2013/10/tableau-81-and-r-25327

Thank you Bora

Hi again, Bora,

Is it possible to use arguments from multiple data sources provided the relationships are set properly and the fields are aggregated? I’ve run several models successfully with a single data source, but when I “reach across” to pull fields from secondary sources I get the following error:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

Hi Bora, feel free to delete my question about using arguments from multiple sources. As far as I can tell, the R error was produced because I was testing with a subset of data where some factor variables had only one level.

Hi Bora,

Is there any way to get coefficient values of a linear regression result in tableau, when i tried i found that tableau cant handle when R returns unequal number of rows. I need to get those coefficient estimates as a column and use them to build a dash board. Do i have a solution to do this?

You can retrieve pretty much anything even if the rows don’t match. I posted an example showing the trick to do it on my OneDrive. http://1drv.ms/1xaSKy8 It is in the LogisticRegression_oddsratio.twbx

thank you Bora

Hi Bora, I am trying to apply association algorithm in tableau using R scripts.

This is what i did in R

wil_dtc<-read.table(file.choose(),sep=",")

wil_dtc_rules <- apriori(arg1,parameter=list(supp=0.0,conf=0.0,target="rules",minlen=2))

write(wil_dtc_rules, file = "wil_dtc_rules_check.csv", quote=TRUE, sep = ",", col.names = NA)

Need to apply the same in tableau. Kindly help.

I will post an association rule example later

ok thanks…but please can you make that fast..actually I’am stck in this and there is an urgent requirement here…hope you understand…thanks

it is not high in my TODO list. If you need something urgently, please take a look at this http://things-about-r.tumblr.com/post/98652993554/deep-down-below-using-in-database-analytics-from

Hi Bora, Thanks for the great post here. I’m trying to do this for a dataset. Below is my code:

SCRIPT_REAL(‘mydata <- data.frame(Delayed=.arg1, Deliq.Amount=.arg2, Countrycomp1=.arg3,

EDI=.arg4, Australia=.arg5, AGP=.arg6);

mydata$Delayed <- as.factor(mydata$Delayed)

mylogit <- glm(Delayed~Deliq.Amount+Countrycomp1+EDI+ AGP +Australia, data = mydata, family = "binomial");

prob <- predict(mylogit, newdata = mydata, type = "response")',

SUM([Deliq.Amount]),SUM([Countrycomp1]),SUM([Treaty reinsurance]),SUM([EDI]), SUM([AGP]),

SUM([Australia]))

I'm triying to visualize or put in a table but I'm getting the same value for all the rows. What does this mean? I tried by unchecking the "aggregate" as well. Where am I going wrong?

Variable information:

"Delayed" is a Binary variable

"Deliq.amount" is numeric variable

All other are dummy variables coded as 1 or 0

I tried to change all the dummy variables into "as.factor" but still I got the same values.

Your script says Delayed=.arg1 but then you pass SUM([Deliq.Amount]) as the first argument. Also you need to make sure you pass all the relevant rows. E.g. if each row is a transaction and has an identifier like TransactionID, add that ID to the viz and use it as the compute using setting for the table calculation. If data is not at the right level, your dummy variables will be passed as sum of those at the level of the visualization. E.g. if what you intend to pass is EDI value for transaction but you only have dimension State in your sheet, then it will pass sum(EDI) to R for each State instead.

Hi Bora,

I’m a student and new to tableau and R. I’m working on a project and need some assistance.

I have a data set(456 rows) containing details of bike rental and factors affecting the same. I would like to create 3 What if parameters and run a multiple regression with 3 independent variables and predict the bike rental for the last day based on the 3 variables . Tableau accepted the code but throws an error.

SCRIPT_REAL(”

mydata <- data.frame(cbind(Total users=.arg1, Temp=.arg2, Humidity=.arg3, Windspeed=.arg4);

fit <- lm(Total users[-456]~ Temp[-456] + Humidity[-456]+ Windspeed[-456], data=mydata);

mydata$Total users[456]<-predict(fit,list(Temp= .arg7[1], Humidity = .arg6[1], Windspeed = .arg5[1] ));

mydata$Total users"

,

SUM([Total users]), AVG([Temp]), AVG([Humidity]),AVG([Windspeed]),[What If – Temperature], [What If – Humidity], [What If – Windspeed]))

Error Message:

Error in base::parse(text = .cmd) : :2:34: unexpected symbol 1: 2: mydata <- data.frame(cbind(Total users ^

Pls help!

Hi Bora,

Your post is excellent! I am interesting would it be possible to get the numbers of the simple linear regression not using R – just with the help of table calculations? two numbers from the formula Y= Intercept + X*k:

Y – dependant variable

k – slope

to be able to manipulate with it using a parameter for X(predictor).

Please let me know if I wasn’t clear for you )

Eugene

Hi Eugene,

I had done this earlier in this post https://boraberan.wordpress.com/2013/03/15/trellis-charts-with-tableau-8/ which also has an accompanying Tableau Public viz you can download.

I hope this helps.

~ Bora

Thanks for the great post. but i’m some trouble running your first script. I set up my Rserve and connection tested successfully, but when i create calculated field with your script. it’s giving me all “undefined values” for all rows. any idea what i might have done wrong? i’m testing the tableau 9.2 with trial license. Thanks.

One thing to check is your addressing/partitioning settings for the calculation. SCRIPT_ functions are table calculations so how they work depends on these settings which define in what chunks data is sent to R. E.g. by changing the setting you could be sending one row at a time, the whole columns at a time, each year of data in a separate call to R etc. Can you right click on the pill and select edit table calculation… and make sure the settings matches the example workbook I provided?

Thanks. I was able to get it to work. However, it only displays the fitted value when i add a row ID column in the view. i want to plot a histogram binning the fitted value and plot against the average actual response rate on the y-axis. any suggestion? i’m not sure why the fitted value only shows correctly when the ID is present.

Cliff, I’m having the same problem. Did you find a solution?

If this doesn’t have to be dynamic, you can add all the fields you need to the sheet (you can put measures even on Detail shelf) do Analysis > View Data and select all, copy, then hit Ctrl+V which will paste all the columns, including results from R as a new data source into Tableau. Then you can build a histogram etc. like you would on any data source.

Tableau passes the data at the same level of detail as the view so if you want to pass one row to R per ID, then ID has to be present in the view.

If you would like to build a dynamic bar chart out of it you can “bin” the values in your R calculated field e.g. divide by the bin size you’d like then use floor to get the bin number, and rely on mark stacking that bar charts will automatically do for you.

Then R can return bin numbers, you can create a bar chart and use sum(number of records) on the other axis which will set the height of each bar.

Dude thanks for the great post and awesome answers. Probably some of the first moments understanding about aggregates and table calcs. You are doing some class work sir. Looking forward to reading more between tableau and r.

Cheers

Hi Bora, I want to do linear regression on the data set with mixed (numeric and categorical) independent variables. I have exactly been following the steps mentioned in your post except for the linear regression command. As soon as I introduce categorical variable in my script, I get the following error:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

I am using the following script:

SCRIPT_REAL('

EnrollmentPd <- .arg1

Sub <- .arg2

TA<- .arg3

fit<-lm(EnrollmentPd ~ Sub + as.factor(TA))

fit$fitted.values

', AVG([Enrollment Period1]) , AVG([# of Subjects1]), ATTR([TA]))

Do you mind sharing the rest of the error message and what your R script looks like?

Thanks for the prompt response Bora. I would really appreciate you help here 🙂

Below is a sample set of the data I am using (20 rows and 3 columns):

TA # of Subjects Enrollment Period

Cardiovascular 100 7

Cardiovascular 200 19

Spine 400 35

Spine 303 25

Spine 403 30

Cardiovascular 266 20

Spine 376 31

Spine 422 34

Spine 456 46

Cardiovascular 290 24

Hematology 522 50

Cardiovascular 298 29

Hematology 782 65

Cardiovascular 221 20

Hematology 600 59

Spine 498 45

Hematology 569 52

Cardiovascular 298 18

Spine 306 23

My script in Tableau:

SCRIPT_REAL( ”

EnrollmentPd <- .arg1

Sub <- .arg2

TA<- .arg3

fit<-lm(EnrollmentPd ~ Sub + as.factor(TA))

fit$fitted.values

", AVG([Enrollment Period1]) , AVG([# of Subjects1]), ATTR([TA]))

Error msg in Tableau:

"Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels"

Just FYI- The below equivalent command runs completely fine in R:

model.lm<-lm(Enrollment.Period ~ X..of..Subjects + TA)

Bora, any luck here? Thanks in adevance

I meant: model.lm<-lm(Enrollment.Period ~ X..of..Subjects + TA)

Looks like the problem is that ATTR(TA) is resulting in a *.

If you turn off aggregation from under Analysis > Aggregate Measures the error should go away.

I tried that, it still is the same

Pingback: All about Rasch and IRT with R – Wencheng's Data Lab