Scaling RServe Deployments

Tableau runs R scripts using RServe, a free, open-source R package. But if you have a large number of users on Tableau Server and use R scripts heavily, pointing Tableau to a single RServe instance may not be sufficient.

Luckily you can use a load-balancer to distribute the load across multiple RServe instances without having to invest in a commercial R distribution. In this blog post, I will show you, how you can achieve this using another open source project called HAProxy.

Let’s start by installing HAProxy.

On Mac you can do this by running the following commands in the terminal

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Followed by

brew install haproxy

Create the config file that contains pointers to the Rserve instances.

In this case I created in the folder ‘/usr/local/Cellar/haproxy/’ but it could have been some other folder.

global
    daemon
    maxconn 256

defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

listen stats
    bind :8080
    mode http
    stats enable
    stats realm Haproxy\ Statistics
    stats uri /haproxy_stats
    stats hide-version
    stats auth admin:admin@rserve

frontend rserve_frontend
    bind *:80
    mode tcp
    timeout client  1m
    default_backend rserve_backend

backend rserve_backend
    mode tcp
    option log-health-checks
    option redispatch
    balance roundrobin
    timeout connect 10s
    timeout server 1m
    server rserve1 localhost:6311 check maxconn 32
    server rserve2 anotherserver.abc.lan:6311 check maxconn 32

The highlights in the config file are the timeouts, max connections allowed for each Rserve instance, host:port for Rserve instances, load balancer listening on port 80, balancing being done using roundrobin method, server stats page configured on port 8080 and username and password for accessing the stats page. I used a very basic configuration but HAProxy documentation has detailed info on all the options.

Let’s check if config file is valid and we don’t have any typos etc.

BBERAN-MAC:~ bberan$ haproxy -f /usr/local/Cellar/haproxy/haproxy.cfg -c
Configuration file is valid

Now you can start HAproxy by passing a pointer to the config file as shown below:

sudo haproxy -f /usr/local/Cellar/haproxy/haproxy.cfg

Let’s launch Tableau and enter the host and port number for the load balancer instead of an actual RServe instance.

Connection information for the load balancer

Success!! I can see the results from R’s forecasting package in Tableau through the load balancer we just configured.

Results of R script evaluated through the load balancer

Let’s run the calculation one more time.

Now let’s look at the stats page for our HAProxy instance. In this case per our configuration file by navigating to http://localhost:8080/haproxy_stats.

Server statistics for the load balancer for Rserve instances

I can see the two requests I made and that they ended up being evaluated on different RServe instances as expected since round-robin load balancing forwards a client request to each server in turn.

Now let’s install it on a server that is more likely to be used in production and have it start up automatically etc.

I used a Linux machine (Ubuntu 14.04 specifically) for this. There are only a few small differences in the configuration steps. To install HAProxy, in a terminal window enter :

apt-get install haproxy

Now edit the haproxy file under the directory /etc/default/ and set ENABLED=1. This is by default 0. Setting to 1 will run HAProxy automatically when the machine starts.

Now let’s edit the config file which can be found here /etc/haproxy/haproxy.cfg to match the config example above.

And we’re ready to start the load balancer:

sudo service haproxy start

Now you can serve many more visualizations containing R scripts to a larger number of Tableau users. Depending on the amount of load you’re dealing with, you can start with running multiple RServe processes on different ports of the same machine or you can add more machines to scale out further.

Time to put advanced analytics dashboards on more screens 🙂

10 thoughts on “Scaling RServe Deployments”

Mélanie says:

Hello Bora,
Thank you for your post. I want to know if we can use a load balancer with windows, HAProxy is not available on windows.

May 14, 2017 at 10:41 am Reply
- Bora Beran says:
  
  Hi Melanie,
  You may want to try nginx or rinetd. If I can find time, I want to put together an example using nginx.
  
  ~Bora
  
  May 14, 2017 at 9:09 pm Reply
John Petroda says:

Great article. Any general high level guidance on how much load an RServe can take?
Thank you for your help

May 22, 2017 at 5:43 pm Reply
- Bora Beran says:
  
  It all depends on how complex the algorithm is and how much data there is. If there is more traffic than it can handle, it will queue up the requests. In terms of how much each request can be, it will give up where R would give up in terms of memory limits.
  
  May 22, 2017 at 9:18 pm Reply
Sam says:

Loved your article. Thank you…
When you have multiple users connecting, is there a way to kill a request in RServ that is running for several mins without affecting other requests? Possible?
Thx

May 22, 2017 at 5:49 pm Reply
- Bora Beran says:
  
  Unfortunately, there is no signal you can send to Rserve to halt processing. You need to find the processid and kill it through standard operating system methods.
  
  May 22, 2017 at 9:15 pm Reply
  - Sam says:
    
    Thank you for the response. Per your suggestion, when we kill the process, does it kill the one that is taking the long time without affecting other R process that is running?
    Thx
    
    May 27, 2017 at 2:07 pm
Tom says:

Thank you for the good article. Does R (Rserv) run jobs in parallel. For example, if there are 3 users running clustering request, does all 3 run in parallel or does it get queued up and one job gets executed and completed before the next job starts?
Thank you…

May 27, 2017 at 2:14 pm Reply
- Bora Beran says:
  
  It is a matter of how many Rserve processes you have running. If you have only one, it will queue up. On Windows you’ll have 1 process unless you explicitly start multiple. On Linux/Unix/Mac system will automatically fork a new process when new requests come so you can handle things in parallel.
  
  May 28, 2017 at 1:11 pm Reply
Brittany says:

Hi Bora,
I’m hoping you can assist me. I am looking to do some sentiment analysis in Tableau using R. However, I have my own list of keywords that I want to use to perform this sentiment analysis. I noticed that the Rstem packages don’t seem to be available in newer versions of R. Is there an older version I should use? Is it possible to use my own list of words for this analysis? Thanks in advance for your help

June 28, 2017 at 7:49 am Reply