Data Preparation, Python, Text analytics

Build Your Own Data Pipelines with Tableau Command Line Utilities & Scheduled Tasks

During one of my 2016 Tableau Conference talks, I shared an example data pipeline that retrieved tweets containing the session’s hash tag, tokenized them and appended to an extract on Tableau server periodically, paired with an auto-refreshing dashboard.

My “magic trick” started by showing a sorry looking word cloud with only two tweets, which slowly filled up as the session progressed with contents of tweets from the audience.

Tableau Conference 2016 - Accelerate Advanced Analytics

While hitting Twitter every few minutes worked well for a short demo, typically hourly or daily updates make more sense in real life scenarios such as text analytics over social media data or geocoding street addresses of newly acquired customers.

I got a lot of requests for making this into a blog post so I repurposed the demo to do sentiment analysis every night over tweets from the day prior.

It has 3 core components:

  1. A Python script that contains the logic to retrieve and analyze Twitter data and write the results to a CSV
  2. A batch file that runs the Python script, takes its outputs and uses Tableau command line utilities to append the contents of the CSV to an extract on Tableau Server
  3. A scheduled task that triggers the batch file once a day and runs this pipeline

The Python Script

You can download the Python scripts from HERE. The zip archive contains  analyzetweets.py shown below and the config.py which will contain your Twitter credentials. You can embed all of them into one Python file but if you’re going to share your screen for a demo, it might be safer to keep it separate 🙂

Python code snippet for Twitter sentiment analysis

For this sample code to work you will need to install two Python packages which you can easily get via pip. VaderSentiment is a lexicon and rule-based sentiment analysis tool. Twitter package is used to query Twitter.

pip install twitter
pip install VaderSentiment

You also need your PATH variables set correctly so your operating system can find Python.exe and these libraries, not to mention a Twitter developer account so you can access Twitter’s data. Here is a good tutorial on how to set one up.

Note that if you use this as a template to run your own code that doesn’t do sentiment analysis and use Twitter data, you won’t be needing any of these packages.

The Batch File

Batch file navigates into the folder containing the Python script, executes it, then takes its output (sentimentscores.csv) and uses “tableau addfiletoextract” to append its contents to an existing extract (with the same set of columns as the CSV file) on the server. You can copy-paste the content below into a text file and save with .bat extension.

@CD C:\Users\bberan\Documents\twitterDemo
@CALL python analyzetweets.py
for %%I in (“C:\Program Files\Tableau\Tableau 10.0\bin”) do set TableauFolder=%%~sI
@CALL %TableauFolder%\tableau addfiletoextract –server https://your-Tableau-server –username yourUserName –password “yourPassword” –project “TheProjectName” –datasource “TheNameofTheExtractDataSourceToAppendTo” –file “C:\Users\bberan\Documents\twitterDemo\sentimentscores.csv”

The Scheduled Task

Windows Task Scheduler is a handy and relatively unknown tool that comes preinstalled on every Windows computer (Unix variant also have similar utilities like cron).

Launch it from your Start Menu and simply create a task from Actions tab that points to the batch file.

Creating an action to start a program with Windows Task Scheduler

Then using the Triggers tab, set the frequency you’d like to run your batch file.

Setting a refresh schedule using Windows Task Scheduler

Now you have a data pipeline that will run nightly, retrieve recent tweets from Twitter, run sentiment analysis on them and add the results to a Tableau extract on the server.

BONUS: Auto-refresh dashboard

If you plan show results on a dashboard that’s on public display, you’d probably like the dashboard also to refresh at a similar frequency to reflect the latest data. For this all you need to do is to embed the dashboard inside a web page with an HTML meta refresh tag.

This was a rather simple, proof of concept but following this example as a template, you can create multi-stage, scheduled pipelines for many ETL tasks and deliver answers to questions that are much more complex. Enjoy!

Advertisements
Standard
Python, Visualization

The Knight’s Tour

I saw an online game around the knight’s tour problem the other day and it made me think of mathematics of chess puzzles. You know… Domination, independence (eight queens) etc. I thought it would be a good way to go back to writing some Python code after a long break.

The “Knight’s Tour” is a problem in which the objective is to move a knight, starting from any square on a chessboard to every other square landing on each square only once. Euler is said to be the first one to study this problem in 1759 on a 8 x 8 board. A knight’s tour is  a Hamiltonian path. A closed tour  is a Hamiltonian cycle i.e. the last square visited is also reachable from the first square by a single knight’s move. For all even n >= 6 there exists a closed knight’s tour on an n x n chessboard and an open knight’s tour if n >= 5.

We can create a matrix of size n x n with not visited cells filled with zeroes and every other cell to be filled with the order visited in the tour.

board = []
board_size = -1

def initiateBoard (board_dimensions):
    global board
    global board_size
    global knights_moves
    board_size = board_dimensions
    for i in range(0, board_size):
        board.append(board_size*[0]) #untouched board

Given knight’s behavior the possible moves relative to knight’s location are

knights_moves = ((-2,-1),(1,2),(2,-1),(-2,1),(2,1),(-1,2),(1,-2),(-1,-2))

But depending on where knight is on the board, not all moves will work so we should check whether a certain point is on the board and not visited.

def isAvailable(x, y):
    if x < len(board) and y < len(board[0]) and \
       x >= 0 and y >= 0 and board[x][y]==0:
        return True
    else:
        return False

Naïve backtracking is a possible solution but for an 8×8 board, there exist 33,439,123,484,294 un-oriented paths so instead I will use Warnsdorff’’s rule to get to the solution a bit quicker. According to this heuristic method, knight always proceeds to the square from which it will have the fewest onward moves.

def getPossibleMoves(x, y):
     possible_moves = []
     for move in knights_moves:
        cx,cy = x+move[0], y+move[1]
        if isAvailable(cx,cy):
            possible_moves.append((move[0],move[1]))
     return possible_moves  

def getNumMoves(x, y):
     return len(getPossibleMoves(x, y))
 
def getNextMove (numMoves):
    smallestIndex = 0
    if not numMoves:    #Nowhere to go
        drawBoard()    #Show the results                     
        sys.exit(1)
    smallest = numMoves[0]
    for i in range(len(numMoves)):
        if numMoves[i] < smallest:
            smallest = numMoves[i]
            smallestIndex = i

    return smallestIndex

And finally the recursion to find the complete path

def solve (x,y,num_move):
    assert board[x][y] == 0
    board[x][y] = num_move                                 
    possible_moves = getPossibleMoves(x,y)
    numOfMoves = []     
    for move in possible_moves:
        numOfMoves.append(getNumMoves(x+move[0],y+move[1]))
    nextMove = possible_moves[getNextMove(numOfMoves)]
    solve(x+nextMove[0],y+nextMove[1],num_move+1)   

def getKnightsPath (board_dimensions):
    initiateBoard (board_dimensions)
    solve(0,0,1)

I’m leaving drawBoard() function to your imagination, Smile but here is what I got using matplotlib.path.

image

For scaling to very large boards a divide-and-conquer solution would be necessary. If I can find some time, I’d also like to share a parallel implementation here.

Standard