R, Visualization

Logistic Regression in Tableau using R

In my post on Tableau website earlier this year, I included an example of multiple linear regression analysis to demonstrate taking advantage of parameters to create a dashboard that can be used for What-If analysis. In that example, the model fitting was done inside the same calculation as the predictions, in order to provide a self-contained example that just works. However in real-life, once created, the models are reused for predictions many times until (if at all) they get updated. Let’s take logistic regression as an example to examine reusability.

In this sample, we will use admissions(generated) data that contains, GPA, GRE score, rank of the institution (1 being top tier schools, 4 the lowest tier) and whether the student was admitted or not. We will use regression model to predict the probability of admission for a given GPA, GRE, Rank combination. 

If we were to do fitting and prediction inside the same calculated field, it would look like:

SCRIPT_REAL('mydata <- data.frame(admit=.arg1, gpa=.arg2, gre=.arg3, rank=.arg4);
lrmodel <- glm(admit ~ gre + gpa + as.factor(rank), data = mydata, family = "binomial");
prob <- predict(lrmodel, newdata = mydata, type = "response")',
AVG([admit]),AVG([gpa]),AVG([gre]),AVG([rank]))

Here “binomial” indicates that we want a logistic regression model.

There are many ways to save your model for future use. You can go through the model fitting inside R (using your favorite IDE e.g. R GUI or RStudio etc.)  and save it, then just load the model into Tableau or you can save it to a file directly from within Tableau. If you were doing this in R (assuming you have the data as a CSV), it would look like

mydata <- read.csv("C:/Users/bberan/R/admission.csv")
mydata$rank <- factor(mydata$rank)
lrmodel <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
save(lrmodel, file = "C:/Users/bberan/R/mymodel.rda")

This is probably what most people would do. And of course evaluate the goodness-of-fit for the model before using it to make predictions.

But if you don’t want to switch between R and Tableau, instead wanted to do this inside Tableau, it is just adding the last line to your calculated field so it looks like:

SCRIPT_REAL('mydata <- data.frame(admit=.arg1, gpa=.arg2, gre=.arg3, rank=.arg4);
lrmodel <- glm(admit ~ gre + gpa + as.factor(rank), data = mydata, family = "binomial");

save(lrmodel, file = "C:/Users/bberan/R/mymodel.rda")
prob <- predict(lrmodel, newdata = mydata, type = "response")',
AVG([admit]),AVG([gpa]),AVG([gre]),AVG([rank]))

You would probably do this once to save the model and after the calculation re-executes, remove it from the calculated field. Your calculation reading a previously saved model would look like the following:

SCRIPT_REAL('mydata <- data.frame(admit=.arg1, gpa=.arg2, gre=.arg3, rank=.arg4);
load("C:/Users/bberan/R/mymodel.rda")
prob <- predict(lrmodel, newdata = mydata, type = "response")',
AVG([admit]),AVG([gpa]),AVG([gre]),AVG([rank]))

Of course once the model is written, it can be used from R as well as Tableau. An even nicer way to do this is to have the model preloaded on Rserve, readily available for use. Doing this involves making some minor edits to Rserve’s configuration file: Rserv.cfg.

There are two relevant options you can use in Rserv.cfg to achieve this. “Source” option runs the provided R script file. Any objects created in the source file becomes available to all users with access to Rserve.  These could be new functions (assume there is a custom function you wrote, that’s rather lengthy and you don’t want to write all the script inside a Tableau calculated field, you just want to call it by name and use it), or any object that holds any kind of data (e.g. read an XDF file in R and have access to its contents from Tableau).

The other option is “eval” which executes the R code to follow. For example if you had

eval x = 5; library("mvoutlier")

it would create an object that holds the value 5 and load the mvoutlier library.

So your Rserv.cfg file may contain the following lines

source C:/Users/bberan/R/myRFunctionCode.R
eval load("C:/Users/bberan/R/mymodel.rda");

In this case the second line is loading the model we created. So the person who will use the model for predictions doesn’t have to know where it is saved. If they know the name of the model, they can call it from Tableau. Note that config file read when Rserve starts so if you made changes to the file, you need to start Rserve for them to take effect. Also by default Rserve looks for the config file in R’s working directory. In R, you can find out what your working directory is using the getwd() command.

R working directory

So I put my Rserv.cfg file in the documents folder. Restarted Rserve and Volia! I am able to call the predict function directly as shown below, without having to fit the model again or read it from the disk.

SCRIPT_REAL('mydata <- data.frame(admit=.arg1, gpa=.arg2, gre=.arg3, rank=.arg4);
prob <- predict(lrmodel, newdata = mydata, type = "response")',
AVG([admit]),AVG([gpa]),AVG([gre]),AVG([rank]))

You can download the example workbook from HERE. The parts that rely on reading models from files and relying objects in Rserve session will not work until you make the necessary changes listed above. Each calculated field also contains comments about changes needed.

Enjoy!

Advertisements
Standard

55 thoughts on “Logistic Regression in Tableau using R

  1. This is great … I’m wondering how I can add categorical variables to the model (ie gender or state of residence are factors in models I use). I note that you use as.factor in your model but you are still using a numerical value as the input …

    • This is because SCRIPT_ functions are table calculations which require their arguments to be aggregates.

      However the fact that it is AVG([something]) doesn’t mean it has to be aggregated. It only means it is the function to be used if aggregation is necessary. If you uncheck the option “Aggregate Measures” under Analysis menu, results won’t be aggregated under any circumstances.

      Even with Aggregate Measures option turned on, if your visualization has a dimension that corresponds to row level data (e.g. each row is a student and you have a unique studentID column that you’re using in your visualization as a dimension as in the example workbook) then aggregating vs. not aggregating won’t make a difference since your aggregation will run with single input row.

  2. Is there a way to use dimension filters on the view without changing the probabilities returned by the model? Let’s say there was a region variable in your example workbook that you don’t want to include in the model, but you do want to make available to users. E.g., a school might have staff assigned to the Northwest who only need to see probabilities for students in that region.

    In our situation users are accustomed to having at least a couple filters, but adding them to the view seems to change the probabilities (I’m assuming because the R script is a table calc). I’m hoping there’s a way to avoid simply presenting a long list of people and scores.

  3. Following up because I might have conflated two things:

    1. I want to add a filter to remove rows from the view that displays probabilities but I don’t want to exclude these cases from the model.
    2. I want to add a field or fields to the view to segment a table view of student probabilities (e.g., Region & State before student).

    Thanks.

    • You can use another table calc as a filter. When you use a table calc as a filter, it doesn’t filter the actual data, it filters/hides the results.

      E.g. if you have Region and Student and you want to hide an entire region without affecting the results, you can create a table calc like WINDOW_MAX(MAX([Region])) and set compute using Student. Then drag this calculation onto filter shelf and select show as quick filter.

      • Jim says:

        Thanks, Bora, that was very helpful.

        On a completely unrelated note, thought I’d share that I’m using Tableau to visualize some of my dissertation data on the Diyanet’s Friday sermons (cuma hutbeleri). I hope to have a public viz up someday.

      • I am glad it helped. Best of luck with your dissertation. Please share the viz when you get the chance. I am curious to see if and how the content changed in the past 12 years compared to the governments prior.

    • Hi Helen,
      Assume your .arg1 is a column that contains 3 values e.g. {10,20,30}. if you use .arg1 you are passing all those values to the function you’re calling in R. If you use .arg1[1] then you are passing the first value i.e. 10, if you do .arg1[2] you’re passing the second values etc.

      There are two ways to pass Tableau parameters to R. One is to pass it as an argument, in which case parameter is passed as if it is a column of repeated values e.g. {10,10,10} but your function probably needs 1 number so you pick one .arg1[1]. The other way to do it is to pass it as part of the formula string by injecting it into the R script (my network graph blog post has an example that does this).

      I hope this helps.

      Bora

  4. kush says:

    Hi Bora Beran
    I cant find your multiple regression post, i need to do the same thing and as I am new to tableau but have some experience in R trying to integrate both so please help me to find and check out how to do multiple regression using R and tableau.

    Regards
    kush

  5. Hi again, Bora,
    Is it possible to use arguments from multiple data sources provided the relationships are set properly and the fields are aggregated? I’ve run several models successfully with a single data source, but when I “reach across” to pull fields from secondary sources I get the following error:

    Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

    • Hi Bora, feel free to delete my question about using arguments from multiple sources. As far as I can tell, the R error was produced because I was testing with a subset of data where some factor variables had only one level.

  6. kush says:

    Hi Bora,

    Is there any way to get coefficient values of a linear regression result in tableau, when i tried i found that tableau cant handle when R returns unequal number of rows. I need to get those coefficient estimates as a column and use them to build a dash board. Do i have a solution to do this?

  7. Hi Bora, I am trying to apply association algorithm in tableau using R scripts.

    This is what i did in R

    wil_dtc<-read.table(file.choose(),sep=",")
    wil_dtc_rules <- apriori(arg1,parameter=list(supp=0.0,conf=0.0,target="rules",minlen=2))
    write(wil_dtc_rules, file = "wil_dtc_rules_check.csv", quote=TRUE, sep = ",", col.names = NA)

    Need to apply the same in tableau. Kindly help.

  8. Hi Bora, Thanks for the great post here. I’m trying to do this for a dataset. Below is my code:

    SCRIPT_REAL(‘mydata <- data.frame(Delayed=.arg1, Deliq.Amount=.arg2, Countrycomp1=.arg3,
    EDI=.arg4, Australia=.arg5, AGP=.arg6);
    mydata$Delayed <- as.factor(mydata$Delayed)
    mylogit <- glm(Delayed~Deliq.Amount+Countrycomp1+EDI+ AGP +Australia, data = mydata, family = "binomial");

    prob <- predict(mylogit, newdata = mydata, type = "response")',
    SUM([Deliq.Amount]),SUM([Countrycomp1]),SUM([Treaty reinsurance]),SUM([EDI]), SUM([AGP]),
    SUM([Australia]))

    I'm triying to visualize or put in a table but I'm getting the same value for all the rows. What does this mean? I tried by unchecking the "aggregate" as well. Where am I going wrong?

    Variable information:

    "Delayed" is a Binary variable
    "Deliq.amount" is numeric variable
    All other are dummy variables coded as 1 or 0

    I tried to change all the dummy variables into "as.factor" but still I got the same values.

    • Your script says Delayed=.arg1 but then you pass SUM([Deliq.Amount]) as the first argument. Also you need to make sure you pass all the relevant rows. E.g. if each row is a transaction and has an identifier like TransactionID, add that ID to the viz and use it as the compute using setting for the table calculation. If data is not at the right level, your dummy variables will be passed as sum of those at the level of the visualization. E.g. if what you intend to pass is EDI value for transaction but you only have dimension State in your sheet, then it will pass sum(EDI) to R for each State instead.

  9. shyama says:

    Hi Bora,

    I’m a student and new to tableau and R. I’m working on a project and need some assistance.
    I have a data set(456 rows) containing details of bike rental and factors affecting the same. I would like to create 3 What if parameters and run a multiple regression with 3 independent variables and predict the bike rental for the last day based on the 3 variables . Tableau accepted the code but throws an error.

    SCRIPT_REAL(”
    mydata <- data.frame(cbind(Total users=.arg1, Temp=.arg2, Humidity=.arg3, Windspeed=.arg4);
    fit <- lm(Total users[-456]~ Temp[-456] + Humidity[-456]+ Windspeed[-456], data=mydata);
    mydata$Total users[456]<-predict(fit,list(Temp= .arg7[1], Humidity = .arg6[1], Windspeed = .arg5[1] ));
    mydata$Total users"
    ,
    SUM([Total users]), AVG([Temp]), AVG([Humidity]),AVG([Windspeed]),[What If – Temperature], [What If – Humidity], [What If – Windspeed]))

    Error Message:
    Error in base::parse(text = .cmd) : :2:34: unexpected symbol 1: 2: mydata <- data.frame(cbind(Total users ^

    Pls help!

  10. Hi Bora,

    Your post is excellent! I am interesting would it be possible to get the numbers of the simple linear regression not using R – just with the help of table calculations? two numbers from the formula Y= Intercept + X*k:

    Y – dependant variable
    k – slope

    to be able to manipulate with it using a parameter for X(predictor).

    Please let me know if I wasn’t clear for you )

    Eugene

  11. Thanks for the great post. but i’m some trouble running your first script. I set up my Rserve and connection tested successfully, but when i create calculated field with your script. it’s giving me all “undefined values” for all rows. any idea what i might have done wrong? i’m testing the tableau 9.2 with trial license. Thanks.

    • One thing to check is your addressing/partitioning settings for the calculation. SCRIPT_ functions are table calculations so how they work depends on these settings which define in what chunks data is sent to R. E.g. by changing the setting you could be sending one row at a time, the whole columns at a time, each year of data in a separate call to R etc. Can you right click on the pill and select edit table calculation… and make sure the settings matches the example workbook I provided?

      • Thanks. I was able to get it to work. However, it only displays the fitted value when i add a row ID column in the view. i want to plot a histogram binning the fitted value and plot against the average actual response rate on the y-axis. any suggestion? i’m not sure why the fitted value only shows correctly when the ID is present.

      • If this doesn’t have to be dynamic, you can add all the fields you need to the sheet (you can put measures even on Detail shelf) do Analysis > View Data and select all, copy, then hit Ctrl+V which will paste all the columns, including results from R as a new data source into Tableau. Then you can build a histogram etc. like you would on any data source.

        Tableau passes the data at the same level of detail as the view so if you want to pass one row to R per ID, then ID has to be present in the view.

        If you would like to build a dynamic bar chart out of it you can “bin” the values in your R calculated field e.g. divide by the bin size you’d like then use floor to get the bin number, and rely on mark stacking that bar charts will automatically do for you.

        Then R can return bin numbers, you can create a bar chart and use sum(number of records) on the other axis which will set the height of each bar.

  12. Dude thanks for the great post and awesome answers. Probably some of the first moments understanding about aggregates and table calcs. You are doing some class work sir. Looking forward to reading more between tableau and r.

    Cheers

  13. Arushi says:

    Hi Bora, I want to do linear regression on the data set with mixed (numeric and categorical) independent variables. I have exactly been following the steps mentioned in your post except for the linear regression command. As soon as I introduce categorical variable in my script, I get the following error:
    Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
    I am using the following script:
    SCRIPT_REAL('
    EnrollmentPd <- .arg1
    Sub <- .arg2
    TA<- .arg3

    fit<-lm(EnrollmentPd ~ Sub + as.factor(TA))
    fit$fitted.values
    ', AVG([Enrollment Period1]) , AVG([# of Subjects1]), ATTR([TA]))

      • Arushi says:

        Thanks for the prompt response Bora. I would really appreciate you help here 🙂

        Below is a sample set of the data I am using (20 rows and 3 columns):

        TA # of Subjects Enrollment Period
        Cardiovascular 100 7
        Cardiovascular 200 19
        Spine 400 35
        Spine 303 25
        Spine 403 30
        Cardiovascular 266 20
        Spine 376 31
        Spine 422 34
        Spine 456 46
        Cardiovascular 290 24
        Hematology 522 50
        Cardiovascular 298 29
        Hematology 782 65
        Cardiovascular 221 20
        Hematology 600 59
        Spine 498 45
        Hematology 569 52
        Cardiovascular 298 18
        Spine 306 23

        My script in Tableau:
        SCRIPT_REAL( ”
        EnrollmentPd <- .arg1
        Sub <- .arg2
        TA<- .arg3
        fit<-lm(EnrollmentPd ~ Sub + as.factor(TA))
        fit$fitted.values
        ", AVG([Enrollment Period1]) , AVG([# of Subjects1]), ATTR([TA]))

        Error msg in Tableau:
        "Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels"

        Just FYI- The below equivalent command runs completely fine in R:
        model.lm<-lm(Enrollment.Period ~ X..of..Subjects + TA)

  14. Pingback: All about Rasch and IRT with R – Wencheng's Data Lab

  15. Hi Bora,
    Thank you for writing detailed blog posts on R implementation within Tableau! I’ve used several of your posts to come up with a script that returns a linear model coefficient, however I am having trouble showing the significance.

    the current script is running against a discrete dimension with 16 members:

    SCRIPT_STR(

    model <-lm(.arg1~.arg2);
    suma<-summary(model);
    paste(suma, sep='~')
    "
    ,
    AVG([Erwerbstätikeit Frauen]),AVG([GII Total])

    )

    ERROR: Unexpected number of results returned by SCRIPT function. Function expected 16 values; 11 values were returned. Any ideas where I am going wrong?

    Thanks in advance!

    P.S. also tried…but got "unexpected" errors

    SCRIPT_REAL(
    "
    model <-lm(.arg1~.arg2);
    model$p.value
    "
    ,
    AVG([Erwerbstätikeit Frauen]),AVG([GII Total])

    )

  16. Inam says:

    Hi Bora,

    What approach can I take to bring in the regression summary and the coefficients into tableau. Only way I can currently get to the summary, coefficients etc., is by opening the .rda file in r-Studio.

    Thank you in advance for your reply.

  17. George Guan says:

    Hi Bora. Your Post is excellent! I am struggling an issue about Script_STR. I was trying to extract top 1 keywords in News’s title based on their occurrence frequency. My query is

    SCRIPT_STR(“library(‘jiebaR’);
    wk=worker();
    a<-wk[.arg1[1]];
    b<-names(sort(summary(as.factor(c(a))), decreasing=T)[1:1]); paste(b[1])",attr([Title])))

    However, I keep receiving the following error message:
    "Unexpected number of results returned by SCRIPT function. Function expected 1 values; 0 values were returned." Please help me out. Thanks a lot, Bora.

  18. Thirumaniraja says:

    We are trying to achieve linear regression in tableau using R Prog. we have done regression in both R and tableau for validation. Tableau predicted values are not matched exactly with the R output.

    Please find the code below

    For Model Creation

    SCRIPT_REAL(‘mydata <- data.frame(ODFlag=.arg1, Tenor=.arg2, LTV=.arg3,IRR=.arg4,LoanAmount=.arg5, bankbalance=.arg6, tat=.arg7, bounce=.arg8, week4 =.arg9,
    region=.arg10,dealertype=.arg11, profilegroup =.arg12,segment=.arg13,rcverified=.arg14);

    library(caret)
    library(pROC)

    set.seed(142)
    samp_vector <- createDataPartition(mydata$ODFlag, p = 0.7, list = FALSE )
    train_data <- mydata[samp_vector,]
    test_data <- mydata[-samp_vector,]

    lrmodel <- glm(ODFlag ~ region +Tenor + dealertype + LTV + IRR + LoanAmount + profilegroup +segment + bankbalance + tat +rcverified + bounce + week4, data = train_data, family = "binomial");

    save(lrmodel, file = "D:/R/OD Prediction/model.rda")

    prob <- predict(lrmodel, newdata = mydata, type = "response")',

    AVG([OD Flag]),
    AVG([Tenor]),
    AVG([LTV]),
    AVG([IRR]),
    AVG([Loan Amount]),
    AVG([Avg bank balance norms not met]),
    AVG([E-E Tat]),
    AVG([Bounce/MOB]),
    AVG([Week4&NotPaid Flag]),
    ATTR([Region]),
    ATTR([Dealer Type]),
    ATTR([Profile Grp]),
    ATTR([Segment]),
    ATTR([Rc Verified]))

    For Prediction

    SCRIPT_REAL('mydata <- data.frame(ODFlag=.arg1, Tenor=.arg2, LTV=.arg3,IRR=.arg4,LoanAmount=.arg5, bankbalance=.arg6, tat=.arg7, bounce=.arg8, week4 =.arg9,
    region=.arg10,dealertype=.arg11, profilegroup =.arg12,segment=.arg13,rcverified=.arg14);

    load("D:/R/OD Prediction/model.rda")

    prob <- predict(lrmodel, newdata = mydata, type = "response")',

    AVG([OD Flag]),
    AVG([Tenor]),
    AVG([LTV]),
    AVG([IRR]),
    AVG([Loan Amount]),
    AVG([Avg bank balance norms not met]),
    AVG([E-E Tat]),
    AVG([Bounce/MOB]),
    AVG([Week4&NotPaid Flag]),
    ATTR([Region]),
    ATTR([Dealer Type]),
    ATTR([Profile Grp]),
    ATTR([Segment]),
    ATTR([Rc Verified]))

    Please check the same and do the needful

    • Please make sure your level of detail and addressing/partitioning settings are correct. If you initiated Rserve from R console or RStudio you can insert print statements in your Tableau SCRIPT function and have it output intermediate results into R command line. This will help you compare what happens at different stages and see if the data being passed to R looks correct or values look different due to aggregation.

  19. Aman says:

    Hi Bora,

    I want to call VBA Macros in tableau. Is there a way we can do that? I know how to trigger VBA Macro using R but I am unable to apply the same in tableau. Any advice would be really helpful

    Thanks,
    Aman

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s