Data Preparation, Python, Text analytics

Build Your Own Data Pipelines with Tableau Command Line Utilities & Scheduled Tasks

During one of my 2016 Tableau Conference talks, I shared an example data pipeline that retrieved tweets containing the session’s hash tag, tokenized them and appended to an extract on Tableau server periodically, paired with an auto-refreshing dashboard.

My “magic trick” started by showing a sorry looking word cloud with only two tweets, which slowly filled up as the session progressed with contents of tweets from the audience.

Tableau Conference 2016 - Accelerate Advanced Analytics

While hitting Twitter every few minutes worked well for a short demo, typically hourly or daily updates make more sense in real life scenarios such as text analytics over social media data or geocoding street addresses of newly acquired customers.

I got a lot of requests for making this into a blog post so I repurposed the demo to do sentiment analysis every night over tweets from the day prior.

It has 3 core components:

  1. A Python script that contains the logic to retrieve and analyze Twitter data and write the results to a CSV
  2. A batch file that runs the Python script, takes its outputs and uses Tableau command line utilities to append the contents of the CSV to an extract on Tableau Server
  3. A scheduled task that triggers the batch file once a day and runs this pipeline

The Python Script

You can download the Python scripts from HERE. The zip archive contains  analyzetweets.py shown below and the config.py which will contain your Twitter credentials. You can embed all of them into one Python file but if you’re going to share your screen for a demo, it might be safer to keep it separate 🙂

Python code snippet for Twitter sentiment analysis

For this sample code to work you will need to install two Python packages which you can easily get via pip. VaderSentiment is a lexicon and rule-based sentiment analysis tool. Twitter package is used to query Twitter.

pip install twitter
pip install VaderSentiment

You also need your PATH variables set correctly so your operating system can find Python.exe and these libraries, not to mention a Twitter developer account so you can access Twitter’s data. Here is a good tutorial on how to set one up.

Note that if you use this as a template to run your own code that doesn’t do sentiment analysis and use Twitter data, you won’t be needing any of these packages.

The Batch File

Batch file navigates into the folder containing the Python script, executes it, then takes its output (sentimentscores.csv) and uses “tableau addfiletoextract” to append its contents to an existing extract (with the same set of columns as the CSV file) on the server. You can copy-paste the content below into a text file and save with .bat extension.

@CD C:\Users\bberan\Documents\twitterDemo
@CALL python analyzetweets.py
for %%I in (“C:\Program Files\Tableau\Tableau 10.0\bin”) do set TableauFolder=%%~sI
@CALL %TableauFolder%\tableau addfiletoextract –server https://your-Tableau-server –username yourUserName –password “yourPassword” –project “TheProjectName” –datasource “TheNameofTheExtractDataSourceToAppendTo” –file “C:\Users\bberan\Documents\twitterDemo\sentimentscores.csv”

The Scheduled Task

Windows Task Scheduler is a handy and relatively unknown tool that comes preinstalled on every Windows computer (Unix variant also have similar utilities like cron).

Launch it from your Start Menu and simply create a task from Actions tab that points to the batch file.

Creating an action to start a program with Windows Task Scheduler

Then using the Triggers tab, set the frequency you’d like to run your batch file.

Setting a refresh schedule using Windows Task Scheduler

Now you have a data pipeline that will run nightly, retrieve recent tweets from Twitter, run sentiment analysis on them and add the results to a Tableau extract on the server.

BONUS: Auto-refresh dashboard

If you plan show results on a dashboard that’s on public display, you’d probably like the dashboard also to refresh at a similar frequency to reflect the latest data. For this all you need to do is to embed the dashboard inside a web page with an HTML meta refresh tag.

This was a rather simple, proof of concept but following this example as a template, you can create multi-stage, scheduled pipelines for many ETL tasks and deliver answers to questions that are much more complex. Enjoy!

Advertisements
Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s