Data Preparation, Python, Text analytics

Build Your Own Data Pipelines with Tableau Command Line Utilities & Scheduled Tasks

During one of my 2016 Tableau Conference talks, I shared an example data pipeline that retrieved tweets containing the session’s hash tag, tokenized them and appended to an extract on Tableau server periodically, paired with an auto-refreshing dashboard.

My “magic trick” started by showing a sorry looking word cloud with only two tweets, which slowly filled up as the session progressed with contents of tweets from the audience.

Tableau Conference 2016 - Accelerate Advanced Analytics

While hitting Twitter every few minutes worked well for a short demo, typically hourly or daily updates make more sense in real life scenarios such as text analytics over social media data or geocoding street addresses of newly acquired customers.

I got a lot of requests for making this into a blog post so I repurposed the demo to do sentiment analysis every night over tweets from the day prior.

It has 3 core components:

  1. A Python script that contains the logic to retrieve and analyze Twitter data and write the results to a CSV
  2. A batch file that runs the Python script, takes its outputs and uses Tableau command line utilities to append the contents of the CSV to an extract on Tableau Server
  3. A scheduled task that triggers the batch file once a day and runs this pipeline

The Python Script

You can download the Python scripts from HERE. The zip archive contains  analyzetweets.py shown below and the config.py which will contain your Twitter credentials. You can embed all of them into one Python file but if you’re going to share your screen for a demo, it might be safer to keep it separate 🙂

Python code snippet for Twitter sentiment analysis

For this sample code to work you will need to install two Python packages which you can easily get via pip. VaderSentiment is a lexicon and rule-based sentiment analysis tool. Twitter package is used to query Twitter.

pip install twitter
pip install VaderSentiment

You also need your PATH variables set correctly so your operating system can find Python.exe and these libraries, not to mention a Twitter developer account so you can access Twitter’s data. Here is a good tutorial on how to set one up.

Note that if you use this as a template to run your own code that doesn’t do sentiment analysis and use Twitter data, you won’t be needing any of these packages.

The Batch File

Batch file navigates into the folder containing the Python script, executes it, then takes its output (sentimentscores.csv) and uses “tableau addfiletoextract” to append its contents to an existing extract (with the same set of columns as the CSV file) on the server. You can copy-paste the content below into a text file and save with .bat extension.

@CD C:\Users\bberan\Documents\twitterDemo
@CALL python analyzetweets.py
for %%I in (“C:\Program Files\Tableau\Tableau 10.0\bin”) do set TableauFolder=%%~sI
@CALL %TableauFolder%\tableau addfiletoextract –server https://your-Tableau-server –username yourUserName –password “yourPassword” –project “TheProjectName” –datasource “TheNameofTheExtractDataSourceToAppendTo” –file “C:\Users\bberan\Documents\twitterDemo\sentimentscores.csv”

The Scheduled Task

Windows Task Scheduler is a handy and relatively unknown tool that comes preinstalled on every Windows computer (Unix variant also have similar utilities like cron).

Launch it from your Start Menu and simply create a task from Actions tab that points to the batch file.

Creating an action to start a program with Windows Task Scheduler

Then using the Triggers tab, set the frequency you’d like to run your batch file.

Setting a refresh schedule using Windows Task Scheduler

Now you have a data pipeline that will run nightly, retrieve recent tweets from Twitter, run sentiment analysis on them and add the results to a Tableau extract on the server.

BONUS: Auto-refresh dashboard

If you plan show results on a dashboard that’s on public display, you’d probably like the dashboard also to refresh at a similar frequency to reflect the latest data. For this all you need to do is to embed the dashboard inside a web page with an HTML meta refresh tag.

This was a rather simple, proof of concept but following this example as a template, you can create multi-stage, scheduled pipelines for many ETL tasks and deliver answers to questions that are much more complex. Enjoy!

Advertisements
Standard
R, Text analytics, Visualization

Trump vs. Clinton in N-grams

Presidential election campaigns are heating up and you all know what that means.

Word clouds 🙂

I still remember visualizations summarizing Romney-Obama debates in 2012 but one thing that slightly bothered me back then was that almost everything I saw were just counting single words so even the one of the most memorable phrases in Obama’s campaign such as “Four more years” were lost.

Surely, there will be some interesting phrases candidates will use this year. How can you make sure that your analysis doesn’t miss them?

I am planning on doing a series of posts on text analytics but tokenization is an important component in text analysis so let’s start with answering this question first.

This analysis relies on two R packages.

library(tm)
library(RWeka)

I wrote a number of helper functions to make analysis steps easier to follow.

# Helper functions
removeStopWords <- function(string, words) {
    stopifnot(is.character(string), is.character(words))
    splt <- strsplit(string, " ", fixed = TRUE)
    vapply(splt, function(x) paste(x[!tolower(x) %in% words], collapse = " "),character(1))
}

countWords <- function(y) { sapply(gregexpr(" ", y), function(x) { 1+sum(x>=0) } ) }

# Find N-grams within certain edit distance to avoid multiple subsets of same phrase
setToClosestMatch<-function(text) {sapply(seq_along(names(text)),function(x){
         ll <- agrep(pattern=names(text)[x],        
                     names(text)[-x],         
                     value=T,max=list(ins=1,del=1,sub=1))
ifelse(!is.na(ll[which.max(nchar(names(text)))]),ll[which.max(nchar(names(text)))],names(text)[x])})}

# To remove an, the, is etc. after finding n-grams 
reduceToRealNGrams <- function(ngrams,n){
ngrams[countWords(removeStopWords(ngrams, append(tm::stopwords("en"),c('going','can','will'))))>=n]}

# Tokenize 2 to 4 grams 
NgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 4,delimiters = " \\r\\n\\t.,;:\"()?!"))

Let’s start the analysis. I had one file for each candidate, containing transcripts for 5 speeches they gave between June 2015 and April 2016.

# Load the documents from specified folder
docs <- Corpus(DirSource("C:/Users/bberan/Downloads/Transcripts/"))
# Convert to lower case
docs <- tm_map(docs, tolower) 
# Remove common words : locations, candidates’ references to each other etc.
docs <- tm_map(docs, removeWords, c("des moines","new york","new hampshire","south carolina","united states","twin cities","san bernardino","hillary clinton","donald trump"))

This is an optional step that allows you to find phrases that differentiate presidential candidates by putting small weights on phrases used by both candidates.

# Create term document matrix with tf-idf weighting
tdm <- TermDocumentMatrix(docs, control = list(tokenize = NgramTokenizer,weighting = weightTfIdf))

m <- as.matrix(tdm)

# separate Hillary and Trump content and sort by frequency
hillary <- sort(m[,1], decreasing=TRUE)
trump <- sort(m[,2], decreasing=TRUE)

# Get top 250 most common N-grams for each candidate
hillaryTopN<-hillary[order(-nchar(names(hillary[1:250])), hillary[1:250])]
trumpTopN<-trump[order(-nchar(names(trump[1:250])), trump[1:250])]

Since we are looking for 2 to 4 grams, R will find “Make America”, “Make America great”, “Make America great again” as separate n-grams. This step consolidates all of them to Make America great again. It also gets rid of N-grams if after removal of stop words e.g. (a, and, the, or..) they become smaller than a 2-gram. For example a 3-gram like “this or that” would be dropped a part of this step. Removing stop words this late makes sure our phrase structure is not broken e.g. “to keep us safe” does not become “keep safe”.

# get rid of junk and overlapping n-grams
hillaryTopNConsolidated<- reduceToRealNGrams(unique(cbind(hillaryTopN,setToClosestMatch(hillaryTopN))[,2]),2)
trumpTopNConsolidated<- reduceToRealNGrams(unique(cbind(trumpTopN,setToClosestMatch(trumpTopN))[,2]),2)

Now that we completed the “key phrase extraction” process, we can write this data to CSV and read in Tableau to build a word cloud like this.

 N-grams frequently used by Hillary Clinton

It gives a much better idea than individual words but can we build something that gives even more context?

Of course by that I don’t mean a bar chart version of this. Despite all the criticism in the visualization community, word clouds do a decent job of conveying the information contained in a table with words and their respective counts for a number of reasons:

  • Counts in word clouds are not exact or reliable quantities. A speaker can repeat a word several times in a row trying to remember what to say next or when they are interrupted by the audience. Even if that doesn’t happen, whether someone said something 10 vs. 11 times hardly tells you anything meaningful. So when reading a word cloud, what people look for is whether things are roughly the same or vastly different. To use an analogy, for a thermometer reading, there is no reason to display the temperature to 6th decimal place if thermometer is only accurate to one decimal place. 6 decimal places give a false sense of accuracy and takes up unnecessary space. Large numbers with one decimal place is much more useful in conveying the information.
  • Applying transformations like TF-IDF can change the word count values by orders of magnitude which makes accuracy of bar chart even less useful.
  • If corpus is large enough, word frequencies follow a power law pattern. Power law distributions apply to many aspects of human systems. The best known example is economist Vilfredo Pareto’s observation  that wealth follows a “predictable imbalance”, with 20% of the population holding 80% of the wealth.The linguist George Zipf observed that word frequency also falls in a power law pattern, with a small number of high frequency words, a moderate number of common words and a very large number of low frequency words. Later Jacob Nielsen observed power law distributions also in web site page views which is why word clouds often work well highlighting the popular content on news aggregators or forums.

Maybe a good way of providing more context is finding the sentences that contains these common phrases that distinguish one candidate from another.

First I broke the text into sentences in R.

library(openNLP)
library(NLP)
library(tm)
splitToSentences <- function(text) { 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")
text <- as.String(text)
sentence.boundaries <- annotate(text, sentence_token_annotator)
sentences <- text[sentence.boundaries]
return(sentences)
}

Docs[[2]] in my script holds Trump’s speeches so to get the sentences for it:

splitToSentences(docs[[2]])

Then I wrote the results to CSV, imported the CSV into Tableau and wrote some custom SQL to do a join between sentences and n-grams using CONTAINS as join criteria.

A little formatting to give it newspaper styling and here is the final result.

 Trump, Clinton speech analysis on Tableau Public

You can click on the image to see it live on Tableau Public and download the workbook if you like. Selecting topics from the right will bring up all the sentences containing that phrase.

Standard