Data Preparation, Python, Text analytics

Build Your Own Data Pipelines with Tableau Command Line Utilities & Scheduled Tasks

During one of my 2016 Tableau Conference talks, I shared an example data pipeline that retrieved tweets containing the session’s hash tag, tokenized them and appended to an extract on Tableau server periodically, paired with an auto-refreshing dashboard.

My “magic trick” started by showing a sorry looking word cloud with only two tweets, which slowly filled up as the session progressed with contents of tweets from the audience.

Tableau Conference 2016 - Accelerate Advanced Analytics

While hitting Twitter every few minutes worked well for a short demo, typically hourly or daily updates make more sense in real life scenarios such as text analytics over social media data or geocoding street addresses of newly acquired customers.

I got a lot of requests for making this into a blog post so I repurposed the demo to do sentiment analysis every night over tweets from the day prior.

It has 3 core components:

  1. A Python script that contains the logic to retrieve and analyze Twitter data and write the results to a CSV
  2. A batch file that runs the Python script, takes its outputs and uses Tableau command line utilities to append the contents of the CSV to an extract on Tableau Server
  3. A scheduled task that triggers the batch file once a day and runs this pipeline

The Python Script

You can download the Python scripts from HERE. The zip archive contains  analyzetweets.py shown below and the config.py which will contain your Twitter credentials. You can embed all of them into one Python file but if you’re going to share your screen for a demo, it might be safer to keep it separate 🙂

Python code snippet for Twitter sentiment analysis

For this sample code to work you will need to install two Python packages which you can easily get via pip. VaderSentiment is a lexicon and rule-based sentiment analysis tool. Twitter package is used to query Twitter.

pip install twitter
pip install VaderSentiment

You also need your PATH variables set correctly so your operating system can find Python.exe and these libraries, not to mention a Twitter developer account so you can access Twitter’s data. Here is a good tutorial on how to set one up.

Note that if you use this as a template to run your own code that doesn’t do sentiment analysis and use Twitter data, you won’t be needing any of these packages.

The Batch File

Batch file navigates into the folder containing the Python script, executes it, then takes its output (sentimentscores.csv) and uses “tableau addfiletoextract” to append its contents to an existing extract (with the same set of columns as the CSV file) on the server. You can copy-paste the content below into a text file and save with .bat extension.

@CD C:\Users\bberan\Documents\twitterDemo
@CALL python analyzetweets.py
for %%I in (“C:\Program Files\Tableau\Tableau 10.0\bin”) do set TableauFolder=%%~sI
@CALL %TableauFolder%\tableau addfiletoextract –server https://your-Tableau-server –username yourUserName –password “yourPassword” –project “TheProjectName” –datasource “TheNameofTheExtractDataSourceToAppendTo” –file “C:\Users\bberan\Documents\twitterDemo\sentimentscores.csv”

The Scheduled Task

Windows Task Scheduler is a handy and relatively unknown tool that comes preinstalled on every Windows computer (Unix variant also have similar utilities like cron).

Launch it from your Start Menu and simply create a task from Actions tab that points to the batch file.

Creating an action to start a program with Windows Task Scheduler

Then using the Triggers tab, set the frequency you’d like to run your batch file.

Setting a refresh schedule using Windows Task Scheduler

Now you have a data pipeline that will run nightly, retrieve recent tweets from Twitter, run sentiment analysis on them and add the results to a Tableau extract on the server.

BONUS: Auto-refresh dashboard

If you plan show results on a dashboard that’s on public display, you’d probably like the dashboard also to refresh at a similar frequency to reflect the latest data. For this all you need to do is to embed the dashboard inside a web page with an HTML meta refresh tag.

This was a rather simple, proof of concept but following this example as a template, you can create multi-stage, scheduled pipelines for many ETL tasks and deliver answers to questions that are much more complex. Enjoy!

Advertisements
Standard
Data Preparation

Quick Tip: Using Access crosstab queries in Tableau

Access databases are still commonly used by many businesses. A question I keep hearing is how to use crosstabs created in Access as data sources in Tableau as such queries are not displayed in the list of views & tables in the connection experience, nor can they be used via Custom SQL option in Tableau.

Let’s use the old Coffee Chain sample data source in this example with the following crosstab query and save it with the name “CoffeeChain Query”.

TRANSFORM Sum(factTable.Sales) AS SumOfSales
SELECT Product.[Product Type] AS ProductType
FROM (Product INNER JOIN factTable ON Product.ProductID = factTable.ProductID)
INNER JOIN Location ON factTable.[Area Code] = Location.[Area Code]
GROUP BY Product.[Product Type]
PIVOT Location.Market;

This query makes new columns from the unique values of Market, returning a table in the following form in Access :

Access crosstab query results

Copy-pasting this query into Tableau as Custom SQL will result in errors. Even though it is saved in the Access file, “CoffeeChain Query” won’t display in the list of views/tables either.

Making this table accessible from Tableau takes one little step that involves creating another query on top of this crosstab.

Creating a new Access query/view

You can follow Create > Query (wizard or designer depending on your preference) by doing a SELECT * from the “CoffeeChain Query” listed in Queries tab.

Configuring the new Access query

Let’s save this as “Cross-Tab Query”.

Saving the Access query/view

And now, if you connect to the Access file from Tableau, you will see that Cross-Tab Query is shown in the table list.

Access crosstab is now accessible from Tableau

Standard
Data Preparation, Visualization

What’s new in Tableau 9.0? Part 1 – Data Preparation

Tableau 9.0 is loaded with features that will make it much easier to get your data ready for analysis. In my first post of “What’s New?” series for Tableau 9.0, I will start with Data Preparation.

Reading Loosely Structured Excel files

If you’re used to dealing with messy Excel spreadsheets, first thing you’ll notice is the “data interpreter”. Data interpreter is a nifty tool. If your spreadsheets are perfectly clean, you probably won’t even notice it is there. It will do a first pass through the data as it loads to populate the preview grid and run a light-weight version of the interpreter to see if spreadsheet contains anything that it can fix. If all looks good, it won’t make a peep.

If your data types and column lengths seem to not make sense or you seem to have multi-level headers as in the example below,

A loosely structured Excel spreadsheet

it will offer you help without getting in the way. Look for a gray bar at the bottom of the join diagram that spans its entire width.

Tableau's Excel structure analyzer

Once you turn it on, it starts looking for areas of the sheet that can be converted into well-structured tables.  Many properties of the sheet are used in this process from formatting elements such as color and borders to whether cells are empty or not and if not what their data types are. Based on this information, it will automatically discard the header/footer text and flatten multi-level table headers, remove inline comments… For the example above, you’d get something like this.

The structure table Tableau was able to extract from the Excel spreadsheet

For comparison this is what it would look like with the option turned off. Big improvement!

Results if Excel workbook were to be read as is into Tableau

Of course since it is trying to detect a pattern, there is a chance it will not get everything right even though it was trained on 6000+ Excel workbooks we gathered from public reference data repositories such as Data.gov and UK Office for National Statistics, OECD, WHO, UNESCO, World Bank etc. If any non-text values are dropped, you will promptly see a warning so you don’t miss anything you intended to keep. You can always preview the results in Excel (or other MS Office alternatives you might have on your machine that can open Excel files) by clicking on the preview button. Preview also includes markup indicating what cells were kept (and classified as header vs. data) and what was removed.

Lost data warning

In 9.0 this functionality is only supported for Excel files opened through the new Excel connectors (not supported on legacy Jet connectors).

Working with crosstabs

Now that your data looks like a table, another common problem is that most reference data come as crosstabs since they’re intended to be consumed as reports as opposed to input for further analysis. With 9.0 fear not! Now you can pivot them.

Unpivoting in Tableau

If your data changes and after creating the workbook and you realize you have a new column with one more year of data, you can simply add it to the existing pivot.

Adding new columns to an existing unpivot transform

To avoid potential confusion, it is worth mentioning that even though in the data analysis world this operation is often referred to as pivot, in database terminology it is called unpivot.

In 9.0 pivot is only supported when using the new Excel and Text connectors (not with Jet), calculated fields can not be pivoted and pivoted fields can not be used for joining tables (but they can be used when blending data sources).

Splitting delimited text

I am sure you had to deal with composite strings in the past such as e-mails, phone numbers, URLs. Extracting relevant parts of a string could be a hassle but it doesn’t have to be if you have Tableau 9.0. Smart split analyzes a sample of your data to determine if it can be broken into representative parts. Some well-known patterns such as e-mails and URLs are recognized with their types and each part is named appropriately. For example e-mail will be broken into username and domain.

Splitting E-mail addresses using the categorical field headers in the sheet view

If data can not be associated with a known type, Tableau will still break the string into its parts but each part gets a generic name after its token number. You can initiate this action in many different places in Tableau.

Splitting an unrecognized domain using the preview grid in the Data tab

As you can see, this is not just a split operation.  Tableau actually removes what it determines to be constants (the repeating characters) from the string hence when you apply smart Split on a string like RGB color codes in the form (33,66,99) you not only get 3 numbers but you also get rid of the parentheses.

If you have an irregular pattern and smart split doesn’t succeed you can use “Custom Split” to specify a delimiter, whether to split from left or right and how many times to split. Both split options automatically create calculated fields for you that rely heavily on a new function in the calculation language called SPLIT.

SPLIT function takes 3 arguments. Input string, delimiter and token number. For example if you have customer names in the form John. C. Doe. You can write the following calculation to get the first name

SPLIT([Customer Name], ‘ ‘, 1)

which gives you the first token from the left after splitting on space.

Assuming some people will have middle names and some don’t, if you’d like to get the last name you can write the following

SPLIT([Customer Name], ‘ ‘, -1)

which gives you the first token from the right after splitting on space.

Tableau Data Engine (extracts, new Excel and Text connectors, Salesforce, Google Analytics, OData, Azure DataMarket, SPSS, SAS and R data files), MySQL 4.0+, PostgreSQL 9.0+, Microsoft SQL Server 2008+ support splitting from both left and right. HP Vertica, Oracle, Teradata, Aster nCluster, Google BigQuery, Amazon Redshift, Hadoop Hive (Cloudera, Hortonworks, MapR) support only splitting from left. If a database has a built-in split function, Tableau takes advantage of it, if no split function is available but regex functions are, Tableau uses those. If neither are available, Tableau will rely on simple string manipulation functions in the database to achieve the same result. As the number of splits grow however SQL queries can get large and perform badly. To avoid this except Tableau data engine sources, number of splits are capped at 10.

Regular Expressions

Tableau 9.0 also expands on the Regex support. REGEXP_EXTRACT, REGEXP_EXTRACT_NTH,  REGEXP_MATCH and REGEXP_REPLACE are now supported on PostgreSQL, Oracle (10g+),  Hadoop HIVE & Impala (no _MATCH), Google Big Query and Tableau Data Engine. Data Engine supports RegEx expressions in conformance with  Unicode Technical Standard #18 , Unicode Regular Expressions level 1, and in addition include Default Word boundaries and Name Properties from level 2. Since new Excel/Text connectors rely on extracts behind the scenes they get it for free. Same applies to other extract based sources such as Salesforce, Google Analytics, OData, Azure DataMarket, SPSS, SAS and R data files. And of course any database you explicitly create an extract from will have it.

There are many other improvements in the data preparation experience in Tableau 9.0 such as the metadata grid, which allows you to quickly review your table metadata, the newly designed connection experience, which allows searching the ever growing list of connectivity options and contains the list of recently used data sources.

The new connection experience

Check-out the Tableau blog for a high-level overview of the new Data Preparation features in 9.0 and improvements in other feature areas.

Standard