R, Text analytics, Visualization

Trump vs. Clinton in N-grams

Presidential election campaigns are heating up and you all know what that means.

Word clouds 🙂

I still remember visualizations summarizing Romney-Obama debates in 2012 but one thing that slightly bothered me back then was that almost everything I saw were just counting single words so even the one of the most memorable phrases in Obama’s campaign such as “Four more years” were lost.

Surely, there will be some interesting phrases candidates will use this year. How can you make sure that your analysis doesn’t miss them?

I am planning on doing a series of posts on text analytics but tokenization is an important component in text analysis so let’s start with answering this question first.

This analysis relies on two R packages.

library(tm)
library(RWeka)

I wrote a number of helper functions to make analysis steps easier to follow.

# Helper functions
removeStopWords <- function(string, words) {
    stopifnot(is.character(string), is.character(words))
    splt <- strsplit(string, " ", fixed = TRUE)
    vapply(splt, function(x) paste(x[!tolower(x) %in% words], collapse = " "),character(1))
}

countWords <- function(y) { sapply(gregexpr(" ", y), function(x) { 1+sum(x>=0) } ) }

# Find N-grams within certain edit distance to avoid multiple subsets of same phrase
setToClosestMatch<-function(text) {sapply(seq_along(names(text)),function(x){
         ll <- agrep(pattern=names(text)[x],        
                     names(text)[-x],         
                     value=T,max=list(ins=1,del=1,sub=1))
ifelse(!is.na(ll[which.max(nchar(names(text)))]),ll[which.max(nchar(names(text)))],names(text)[x])})}

# To remove an, the, is etc. after finding n-grams 
reduceToRealNGrams <- function(ngrams,n){
ngrams[countWords(removeStopWords(ngrams, append(tm::stopwords("en"),c('going','can','will'))))>=n]}

# Tokenize 2 to 4 grams 
NgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 4,delimiters = " \\r\\n\\t.,;:\"()?!"))

Let’s start the analysis. I had one file for each candidate, containing transcripts for 5 speeches they gave between June 2015 and April 2016.

# Load the documents from specified folder
docs <- Corpus(DirSource("C:/Users/bberan/Downloads/Transcripts/"))
# Convert to lower case
docs <- tm_map(docs, tolower) 
# Remove common words : locations, candidates’ references to each other etc.
docs <- tm_map(docs, removeWords, c("des moines","new york","new hampshire","south carolina","united states","twin cities","san bernardino","hillary clinton","donald trump"))

This is an optional step that allows you to find phrases that differentiate presidential candidates by putting small weights on phrases used by both candidates.

# Create term document matrix with tf-idf weighting
tdm <- TermDocumentMatrix(docs, control = list(tokenize = NgramTokenizer,weighting = weightTfIdf))

m <- as.matrix(tdm)

# separate Hillary and Trump content and sort by frequency
hillary <- sort(m[,1], decreasing=TRUE)
trump <- sort(m[,2], decreasing=TRUE)

# Get top 250 most common N-grams for each candidate
hillaryTopN<-hillary[order(-nchar(names(hillary[1:250])), hillary[1:250])]
trumpTopN<-trump[order(-nchar(names(trump[1:250])), trump[1:250])]

Since we are looking for 2 to 4 grams, R will find “Make America”, “Make America great”, “Make America great again” as separate n-grams. This step consolidates all of them to Make America great again. It also gets rid of N-grams if after removal of stop words e.g. (a, and, the, or..) they become smaller than a 2-gram. For example a 3-gram like “this or that” would be dropped a part of this step. Removing stop words this late makes sure our phrase structure is not broken e.g. “to keep us safe” does not become “keep safe”.

# get rid of junk and overlapping n-grams
hillaryTopNConsolidated<- reduceToRealNGrams(unique(cbind(hillaryTopN,setToClosestMatch(hillaryTopN))[,2]),2)
trumpTopNConsolidated<- reduceToRealNGrams(unique(cbind(trumpTopN,setToClosestMatch(trumpTopN))[,2]),2)

Now that we completed the “key phrase extraction” process, we can write this data to CSV and read in Tableau to build a word cloud like this.

 N-grams frequently used by Hillary Clinton

It gives a much better idea than individual words but can we build something that gives even more context?

Of course by that I don’t mean a bar chart version of this. Despite all the criticism in the visualization community, word clouds do a decent job of conveying the information contained in a table with words and their respective counts for a number of reasons:

  • Counts in word clouds are not exact or reliable quantities. A speaker can repeat a word several times in a row trying to remember what to say next or when they are interrupted by the audience. Even if that doesn’t happen, whether someone said something 10 vs. 11 times hardly tells you anything meaningful. So when reading a word cloud, what people look for is whether things are roughly the same or vastly different. To use an analogy, for a thermometer reading, there is no reason to display the temperature to 6th decimal place if thermometer is only accurate to one decimal place. 6 decimal places give a false sense of accuracy and takes up unnecessary space. Large numbers with one decimal place is much more useful in conveying the information.
  • Applying transformations like TF-IDF can change the word count values by orders of magnitude which makes accuracy of bar chart even less useful.
  • If corpus is large enough, word frequencies follow a power law pattern. Power law distributions apply to many aspects of human systems. The best known example is economist Vilfredo Pareto’s observation  that wealth follows a “predictable imbalance”, with 20% of the population holding 80% of the wealth.The linguist George Zipf observed that word frequency also falls in a power law pattern, with a small number of high frequency words, a moderate number of common words and a very large number of low frequency words. Later Jacob Nielsen observed power law distributions also in web site page views which is why word clouds often work well highlighting the popular content on news aggregators or forums.

Maybe a good way of providing more context is finding the sentences that contains these common phrases that distinguish one candidate from another.

First I broke the text into sentences in R.

library(openNLP)
library(NLP)
library(tm)
splitToSentences <- function(text) { 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")
text <- as.String(text)
sentence.boundaries <- annotate(text, sentence_token_annotator)
sentences <- text[sentence.boundaries]
return(sentences)
}

Docs[[2]] in my script holds Trump’s speeches so to get the sentences for it:

splitToSentences(docs[[2]])

Then I wrote the results to CSV, imported the CSV into Tableau and wrote some custom SQL to do a join between sentences and n-grams using CONTAINS as join criteria.

A little formatting to give it newspaper styling and here is the final result.

 Trump, Clinton speech analysis on Tableau Public

You can click on the image to see it live on Tableau Public and download the workbook if you like. Selecting topics from the right will bring up all the sentences containing that phrase.

Standard
R, Visualization

Correspondence Analysis in Tableau with R

Correspondence analysis is an exploratory data analysis method for discovering relationships between two or more categorical variables. It is very often used for visualizing survey data since if the matrix is large enough (which could be due to large number of variables but also possible with small number of variables with high cardinality) visual inspection of tabulated data or simple statistical analysis cannot sufficiently explain its structure. Correspondence analysis can remarkably simplify representation of such data by projecting both row and column variables into lower dimensional space that can often be visualized as a scatter plot at a small loss of fidelity.

Let’s take a look at an example. Below is the data from 2014 Auto Brand Perception survey by Consumer Reports where 1578 randomly selected adults were asked what they considered exemplar attributes for different car brands. Respondents picked all that apply from among the list that consisted of : Style, Performance, Quality, Safety, Innovation, Value  and Fuel Economy.

We can convert this data into a contingency table in R and do a chi-square test which tells us that there is statistically significant association between car brands and their perceived attributes.

chisq.test(table(yourDataFrameGoesHere))

But often this is not sufficient since my goal is to understand how different car makers are perceived to learn how people see my brand, how I compare with the competition, how to competitively position an existing product or bring a new product in the market to fill a gap.

Let’s visualize this as a cross-tab in Tableau.

2014 Auto Brand Perception Survey Results

Even though there are only 7 choices and a single question in the survey, this table is hard to interpret.

Let’s apply correspondence analysis and see what our scatter plot looks like. Here blue dots are cars. Blue points closer to each other are more similar than points farther away. Red items (e.g. Style being hovered over in the screenshot) are the attributes. The axes themselves do not distinguish independent dimensions for discriminating categories so attributes are useful in orienting yourself when looking at the chart and help assign names to different areas of the scatter plot. If you imagine a line extending from the center of the plot towards each of the red points, the distance of blue points to the lines indicate how related they are to the particular attribute. For example for Volvo, safety is the the perception that dominates. Same can be said for Kia and Value. But Subaru is considered safe, have good quality and value while Porsche and Ferrari are mostly associated with attributes Style and Performance, roughly the same amount.

Correspondence Analysis of Brand Perception Survey

This scatter plot explains 70% of the variance in the data. While it doesn’t capture everything, it is a lot easier to consume than cross-tabulation.

The rows and columns used in computing the principal axes of the low-dimensional representation are called active points. Passive (supplementary) points/variables are projected onto the plot but not taken into account when computing the structure of the plot itself. For example if there are two new cars in the market and you want to see their relative positioning in an existing plot, you can add them as supplementary points. If there are outliers, you can also choose to make them into supplementary points not to skew the results. Supplementary variables on the other hand are typically exogenous variables e.g. the age group or education level of the survey participant. In some cases you may prefer generating multiple plots instead e.g. one per gender. You can mark a column or row as supplementary using supcol and support arguments in ca function call e.g. ca(mydata,supcol=c(1,6)) makes 1st and 6th columns in the table supplementary.

You can add more to this chart to explore more. For example, you can put price of the car or safety rating on color and see whether they align with the perceived value or safety.  For example Tesla, Ford and Fiat are all associated with value while Tesla is not a budget car. Similarly Volvo and Tesla both have a 5 star safety rating but consumers associate Volvo much more with safety than any other brand. If you have multiple years of data, you can put years on the Pages Shelf and watch how perception changed over time, whether your marketing campaigns were effective in moving it in a direction you wanted.

Correspondence analysis use cases are not limited to social sciences and consumer research. In genetics for example microarray studies use MCA to identify potential relationships between genes. Let’s pick our next example from a different domain.

If there are multiple questions in your survey, you can use Multiple Correspondence Analysis (MCA) instead. Our data for this example contains categorical information about different organisms. Whether they fly, photosynthesize, have a spine….

Categorica attributes of different organisms

For a moment, imagine the first column doesn’t exist so you have no knowledge about what organism each row is. How easy would it be to understand if there are groups in the data based on these attributes?

Let’s apply MCA to this dataset. In this case I put the attributes in the secondary axis, hid their marks and made their labels larger. I also applied some jitter to deal with overlapping marks.

I can clearly see groups like birds, mammals, plants, fungi and shellfish. If the data wasn’t labeled, I would be able to associate them looking at the chart and by examining the common attributes of adjacent points start developing an understanding of what type of organisms they might be.

Multiple correspondence analysis applied to organisms dataset

You can download the sample workbook from HERE.

Standard
R, Visualization

Quick Tip : Overlaying curves on Tableau scatter plots with R

Tableau provides a good set of trend line, reference line and band options but sometimes you want to overlay curves based on a custom equation. Logistic regression curves, sine curves, quantile regression curves…. And want these overlay curves to be smooth…

This is very easy to do by taking advantage of the technique I shared when building Coxcomb charts and radial treemaps. If you know the equation (or know how to get to it) and it can be described in Tableau’s calculation language you can do so using a table calculation. But doing the fit dynamically involves R and when you’re passing the data to R you need to do some basic NULL handling. Here are two examples showing what the results might look like. You can see that despite very few points in my dataset and large gaps between them, the curves look very smooth.

A sine curve and logistic regression curve overlay in Tableau

The key component is the bin field created on the variable that’s on the X axis. In Tableau bins can be used to trigger densification by turning on “Show Missing Values” option on the bin. Doing so adds NULL rows into the data backing up the visualization, which you can fill with values from table calculations (including SCRIPT_ functions). In your R script, you need to remove these artificially generated NULLs, in order not to confuse the curve fitting procedure you’re applying. 

I tied the bin size to a parameter so you can try different values to make the curves coarser or smoother.

If you want to take this technique it a bit further you could use one bin for each axis which will allow you to create a grid. Then you an treat each cell like a pixel in a raster and draw shaded areas such as contours.

Below you can see two examples of this. First one estimates the bivariate kernel density of the data points in the viz on-the-fly using R on the fly and draws the contours using the secondary axes of this dual (technically quadruple) axis chart.

Displaying kernel density as overlay in Tableau

The second chart uses the same data to fit a one-class SVM (support vector machine) which is often used for outlier/novelty detection with a Gaussian radial basis function then draws the decision boundaries using the secondary axes (blue ellipse). Suspected outliers are shown in red while inliers are shown in white.

Displaying the boundaries of a one-class SVM as overlay in Tableau

You can download the sample Tableau workbook from HERE

Standard
Visualization

Understanding Clustering in Tableau 10

Tableau 10.0 comes with k-means clustering as a built-in function so it is worthwhile talking about the use cases for clustering, how the algorithm works and why we chose to make it work the way it is.

K-means procedure splits the data into K segments.  Each segment has a centroid that corresponds to the mean value for the members in that segment. The objective of the algorithm is to place the centroids such that the total of the sum of distances between centroids and members in respective segments is as small as possible.

A simple example

To demonstrate, in the toy example below, algorithm finds two groups with means 49 and 130 and assigns each point to the nearest mean. For example, a data point valued 85 would belong to the cluster on the left since it is closer to 49 than it is to 130.

A bimodal distribution

We can clearly see that there are two groups by looking at the two humps in the histogram. There is no clear gap that could be treated to visually separate the data points however. K-means helps with the decision on where to split. Below is an example of what this might look like in 2 dimensions.

Two clusters

As simple as it may seem, even clustering on a single variable has many applications. For example it is commonly advised to not have more than 7 colors on a choropleth map. If you have a continuous measure and want to convert to a discrete color palette with a particular number of colors, you can achieve this via clustering. In fact, Jenks Natural Breaks method which is the common default on most mapping software is application of k-means on a single field.

But in many real life scenarios, data is much more complex than a single variable. Age, income, number of kids, average transaction amount, weight, blood pressure… can all tell you something about different ways to group your customers, patients, students…

Making sense of the results

Sometimes groupings in data make immediate sense. When clustering by income and age, one could come across a group that can be labeled as “young professionals”. Clustering feature comes with a Describe dialog that gives you summary statistics for each cluster to help with this process.

In UN’s development indicators dataset, using the Describe dialog, one can clearly see that Cluster 1, Cluster 2 and Cluster 3 correspond to Underdeveloped, Developing and Highly Developed countries respectively. By doing so we’re using k-means to compress the information that is contained in 3 columns and 180+ rows to just three labels.

Centroids for clusters in the UN development indicators dataset

Clustering can sometimes also find patterns your dataset may not be able to sufficiently explain by itself.

For example as you’re clustering health records, you may find two distinct groups and “why?” is not immediately clear and describable with the available data, which may lead you to ask more questions and maybe later realize that difference was because one group exercised regularly while the other didn’t, or one had an immunity to a certain disease or may even indicate things like fraudulent activity/drug abuse which otherwise you may not have noticed. Given it is hard to anticipate and collect all relevant data, such hidden patterns are not uncommon in real life.

A great real life example of this is from Bank of America.  In the 90s Bank of America’s National Consumer Assets Group started exploring clustering to segment their customers. One cluster they found had a high portion of customers who either had a home equity line of credit or had a high propensity score for that product. Cluster included only 7 percent of the bank’s customers but it had more than a quarter of the customers with the highest equity propensity scores. The customers tended to be middle aged, married home owners with teenage children. 30% of them had both personal and business-oriented banking products meaning many were small business owners. They couldn’t figure out what this meant by just looking at the data. Survey of branch managers confirmed that these customers were borrowing against their home equity to fund the startup of a small business! None of the bank’s marketing material had been oriented that way. A new campaign around this idea got much better response from customers.

You will see many cases where clustering is used as a coarse summary of the structure in the data that helps move the data exploration in a direction worth exploring and being a tool for slicing the data (forcing a structure on the data) as opposed to serving an exact answer on a silver platter.  Save and Reuse section provides yet another example that underlines this point.

Back to the algorithm…

Like any procedure there are many ways to implement k-means. Typical k-means implementations use random initializations (either random seed or random partitioning) which means two things 1) Subsequent runs over the same data will return different results 2) With a single random initialization, the results will most likely be suboptimal (stuck at a local minima) because it is random guessing. Common solution for (2) is to start over thousands of times and pick the best result which increases the chances of finding a good result at the expense of long wait times.

Our goal was to deliver great results that are repeatable with good performance. This is a tough balancing act. Quality and repeatability are must haves for trusting the results while good performance encourages experimentation with different filters, levels of detail, input variables etc. which is the path to more insights.

After testing a number of initialization methods we found the sweet spot with Howard-Harris method. Howard-Harris is a hierarchical divisive method that finds the variable of highest variance and splits the data in half (at the mean of that variable) and uses the means of each half as initial centroids to find 2 clusters, in our case, using Lloyd’s algorithm.  Then it continues by splitting the cluster with highest variance (out of the first 2 clusters) using the same method to get to a 3 cluster solution and repeats this process until the desired number of clusters is reached.

Lloyd’s algorithm takes the initial centers from Howard-Harris and clusters all points around the nearest center then computes a new center for each cluster by averaging the points in the cluster. If the new center matches the center from the previous step, it returns the result, if not, assigns points to the new center (the mean) and repeats the process.

Distance between data points

In the previous sections, I used the phrases “sum of distances” and “to the nearest centroid” several times.  But how is this distance measured?

There are many ways to measure distance. Manhattan, Mahalanobis, Chebyshev, Minkowski, Euclidian…

Tableau uses squared Euclidian distance for clustering. A good example for understanding Euclidian distance is to start with the simple, 2 dimensional case; the hypotenuse of a triangle. For a right triangle where legs/catheti are 3 and 4, Euclidian distance between these two corners would be SQRT(3^2 + 4^2)=5.  This can be extended to more than 2 variables e.g. SQRT(x^2 + y^2 + z^2…).

Tableau doesn’t take the square root which makes it squared Euclidian distance. It is common to use squared Euclidian distance for k-means clustering.

But how could one compute distances with categorical fields?

Tableau uses Multiple Correspondence Analysis (MCA) to convert categories into numeric distances before clustering so you can use both numeric and categorical fields as inputs to clustering.

Assume you have a single column with 3 categories. Shoes, Dresses and Hats. The three categories don’t contain any true measureable distance information. They are just 3 different strings. Given this is the only available information, the assumption would be that they are at equal distance from each other. If you like thinking in pictures, you can imagine them to be the 3 corners of an equilateral triangle.

Occurrence count and if there is more than 1 categorical column, co-occurrences also impact the distances. For example if you have hospital admission form data, you would likely have some people who checked both female and pregnant boxes, some female and not pregnant, and potentially some, by mistake, marked themselves as male and pregnant. Male and Pregnant would be a very rare occurrence in the database so it would be further away from other points.

Tableau currently imposes a limit of max 25 unique categories per field and a total of 1000 unique categories. So you can use a maximum of 1000/25=40 categorical columns with 25 unique categories in each in a given clustering procedure.

Note that clustering accepts categorical fields as measures since they are meant to be used as attributes of  the entities being clustered e.g. max education level attained for a person. When you use categorical fields that are finer than what’s being clustered e.g. list of products sold at a store while store is what’s being clustered ATTR([Field]) may return *  for some rows. * is treated like NULL values which means those rows will end up in the “Not Clustered” group.

Wouldn’t some measures dominate the results?

If you have a table about country statistics, you might have a column for GDP and one for birth rate. Using the distance formula above, GDP would dominate the clustering results as it is orders of magnitude greater than birth rate. To avoid this problem, Tableau automatically scales all inputs. Of course, there are many ways to do scaling. Tableau uses min-max scaling (also known as normalization) that subtracts the minimum of the column from each value then divide the result by the difference between max and min of the column. As a result of this transformation each column get transformed to a range between 0 and 1. E.g. country with highest GDP would end up having a GDP of 1, lowest would be 0. Same is done for highest/lowest birth rates hence all variables get equally weighted. Standardization (z-score) is another common method used by many software packages that does k-means however min-max scaling is generally considered to be better for clustering per Stoddard (1979), Milligan & Cooper (1988), Dillon, Mulani & Frederick (1989) and Steinley (2004)’s research comparing different scaling methods.

How is optimal number of clusters determined?

There are many ways to find the optimal number of clusters. Some such as the elbow method involve eye-balling while others rely on picking the maximum or minimum value of a metric. You can find R packages that offer 50 different metrics and when you use them realize that all disagree with each other. This is because each method makes assumptions about the shape of the data and what makes a good cluster. There is abundant literature that compare these metrics in search for the methods that are most widely applicable and give good results and performance. We started with these studies and expanded upon them by running our own tests and decided to proceed with Calinski-Harabasz index based on the combination of its performance and how well it works with k-means.

Calinski-Harabasz criterion
Between group sum of squares (SSB) is the sum of squared distances between centroids of clusters and the centroid of the entire dataset.  Larger values can be used as a sign of good separation between clusters.

Within group sum of squares (SSW) is the sum of squared distances between centroid of each cluster and the points in that cluster. Smaller values can be used as a sign of compactness and uniformity of individual clusters.

N is number of rows in the table and k is the number of clusters.

Tableau looks for the maximum value of Calinski-Harabasz index which happens when between group sum of squares is high and within group sum of squares is low.  If equation only consisted of this ratio of course, the way to maximize it would be creating as many clusters as possible. This is where the second part (right side) of the equation comes in that one can think of as sort of a penalty for splitting into more clusters. This prevents further splits from happening  unless a split is justified by enough reduction in error.

Tableau does k-means with different values of k ranging from 2 to 25 and compares each result with the previous. If current result is less than the previous, returns the previous result. Since it is looking for a local maximum, it will terminate early which means better performance and conservative estimates of cluster count.

When/why to manually set k?

It is very common to use clustering to segment data that has no visible “clumps”. Clothing manufacturers use it to slice data into Small, Medium, Large etc. clusters. Similarly it can help with product bundling decisions e.g. how many minutes, # SMS and GBs of data should be offered in a deal for “moderate users” segment?

Save and Reuse

Once you’re satisfied with the results, you can persist them by dragging the “Clusters” pill into the data pane. Resulting field can be used just like a group with rest of Tableau. You can rename/alias cluster groups even manually move items around.

Overriding results from an algorithm manually might sound weird at first but clustering is an exploratory process and it is perfectly OK to enhance results using external knowledge.

A good example of such use is the way Boston Globe divided 200 towns in eastern Massachusetts and southern New Hampshire into editorial zones. In 2003, The Boston Globe introduced geographically distinct versions of the paper with specialized content. Two days a week, readers would get local coverage for their area. The newspaper clustered towns into 4 groups based on demographic information but then applied external rules before coming up with the final result. For example, the zones needed to be geographically contiguous to optimize delivery routes and contain sufficient population to justify specialized content as well as being aligned with how the advertising deals were made which resulted in manually moving towns between groups such that it satisfies these constraints while also having similar demographics identified through clustering.

When does k-means not work well?

Like any statistical method k-means works well in some cases but not all. You can find many examples with a quick Google search showing some patterns that are very easy to detect visually but k-means doesn’t “find” them. Here are two of the typical examples used for these demonstrations.

Donut3 spirals

Why is this happening?

K-means seeks the way to dividing the data into areas where each area contains similar items that can potentially be summarized/represented by its mean, such that you can look at a cluster’s centroid and give it a meaningful name e.g. enterprise customers.

If you look at the example with two circles, and just assume for a moment that one axis is weight, the other axis is height, what do the points in outer circle have common with each other? Outer circle contains both the tallest and the shortest person as well as heaviest and lightest. The members of outer circle don’t share any common attributes. One can say the same about the 3 spirals. In neither of these cases visually detected groupings are similar.

If one is looking to discover such patterns, k-means is not the right method. Patterns like this could be seen on maps (events following along a river or a road) or in image processing  etc. and would require algorithms that are significantly slower than k-means that typically look at pairwise distances between points searching for continuity, dense areas vs. sparse areas etc. It is extremely unlikely that your business data will look anything like these.

When the goal is finding groups that share similar properties, k-means often does a good job of detecting them even if the areas have very different densities as shown in the example below as long as there is sufficient data.

3 clusters with varying densities

K-means will give you convex/globular clusters since each point is assigned to the nearest centroid and if this works for your data and type of analysis  (and in a lot of cases when data is being sliced/segmented into desired number of buckets it will), it can be safely applied to enhance your data exploration experience.

I can’t emphasize the importance of the word exploration enough. First of all, when looking for clusters, you will most likely want to write calculations, apply filters, aggregate your data… to convert your e.g. transaction level data to metrics that might potentially uncover clusters such as customer’s average spending or days since last visit. This is often referred to as feature building and is the step where many analyses prematurely fail. We built our clustering user experience to be very dynamic and interactive to encourage more experimentation to increase the likelihood of relevant metrics to be distilled. Secondly, examining results of clustering (e.g. using Describe dialog) is an integral part of the analysis. Of course this can be said for any statistics or data mining method. But with clustering, given you have the ability to augment the results with your domain knowledge, such exploration means you can tackle cases where algorithm doesn’t give you a perfect answer right out of the bat. For example if you have an elongated cluster where k-means split it in half to 20-25 year-old with spending 0-5K and 20-25 year-old with spending 5-10K but you believe it makes sense for it to be single cluster that is just 20-25 year-olds regardless of spending, then you can save the cluster pill and merge the two clusters, in just 5 clicks.

And that’s a wrap. I hope you find this information on the details of the algorithm and the use case examples helpful.

Happy clustering!

Standard
R, Visualization

Time Series Clustering in Tableau using R

Clustering is a very common data mining task and has a wide variety of applications from customer segmentation to grouping of text documents. K-means clustering was one of the examples I used on my blog post introducing R integration back in Tableau 8.1. Many others in Tableau community wrote similar articles explaining how different clustering techniques can be used in Tableau via R integration.

One thing I didn’t see getting much attention was time series clustering and using hierarchical clustering algorithms. So I thought it might be good to cover both in single post.

Let’s say you have multiple time series curves (stock prices, social media activity, temperature readings…) and want to group similar curves together. Series don’t necessarily align perfectly, in fact they could be even about events that are happening at different pace.

Time series clustering in Tableau using R

There are of course many algorithms to achieve this task but R conveniently offers a package for Dynamic Time Warping. Below is what my calculated field in Tableau looks like.

Calculation for dynamic time warping in Tableau

I started by loading the dtw package, then converted my data from long table format (all time series are in the same column) to wide table format (each series is a separate column).

I then computed distance matrix using dtw as my method and applied hierarchical clustering with average linkage. This gives a tree which I then prune to get the desired number of clusters.

And finally, data is converted back into a long table before being pulled back into Tableau.

Screenshot is from Tableau 10, but don’t worry if you’re not on the Beta program. You can download Tableau 8.1 version of the workbook from HERE. Enjoy!

Standard
Visualization

Going 3D with Tableau

Even though they are not the type of visuals that first come to mind when you talk about business intelligence, if your background is in science or engineering like myself at some point in your career you probably did 3D charts. And by that I don’t mean 3D bar or pie charts. I mean scatter plots, contours, mesh surfaces… where 3rd dimension adds meaningful information.

If you wanted to do these types of charts in Tableau, how would you do them?

I thought it would be good to put together a few examples starting from the simplest and moving to more advanced visualizations. Let’s start with the scatter plot below.

Four clusters in a 3D scatter plot

Achieving this takes a background image (the cube image to emphasize the feeling of depth) and 3 simple calculated fields that re-project x,y,z coordinates to x_rotated and y_rotated fields. Wikipedia has a great summary if you’re curious where these formulas come from. You can use z_rotated field to set the size of the marks to give perspective effect to also to set the sort order of the marks so you get the z-order right. You can find this chart as a Tableau Public visualization HERE if you’d like to take a closer look or use it as an example for your own 3D scatter plots.

The example above shows rotating in 2 directions (dashboard exposes one control but you can download and open the sheet to see both options) but what if you wanted to be able to rotate in 3 different directions? I saved this for my next example as incremental change makes it easier to follow.

Caffeine Molecule (click to view Tableau Public viz)

Caffeine molecule example above uses a dual axis chart. One axis draws the atoms while the other draws the bonds between them.  It still relies purely on the background image, mark size and z-order to achieve the 3D look just like the previous example but  you will notice that, this time “size” has a multiplier to account for the size of the atom in addition to the distance in 3D space to represent the relative sizes of different atoms. Another thing you’ll notice is that now that you’re using dual axis, z-order doesn’t always work well since Tableau sorts within each axis independently so in this setup atoms are always drawn above bonds but each group is sorted within themselves. You can find the Tableau public viz HERE or by clicking on any of the screenshots.

You can extend this basic idea to do many other types of 3D views. For example a contour plot like the one below and go from 2D contours on the left to perspective view with 3 extremely simple calculated fields.

Contour Plot (click to view Tableau Public viz)

Or 3D meshes that let you play with different parameters and allow you to explore results of different equations.

3D Mesh in Tableau (click to view Tableau Public viz)

To look at waves and ripples…

3D Mesh in Tableau (click to view Tableau Public viz)

Or draw filled surfaces…

Filled surface

Or doing just some viz art like downloading radio telescope data to build the album cover of Joy Division’s Unknown Pleasures album in 3D in Tableau.

Joy Divison - Unknown Pleasures (click to view Tableau Public viz)

How about also drawing axis, grids and even labels and have them rotate with the visualization instead of using a static background image and just have the marks rotate?

3D scatter plot with axis in Tableau (click to view Tableau Public viz)

Since you’re custom drawing these objects, you will need to have data points for axis extents, grid cell coordinates etc. I did this using Custom SQL. You can find data used in all example workbooks HERE as well as the SQL used. But easiest way to understand how it works is exploring the visualization. Thanks to VizQL, you can simply drag the [Type] pill from color shelf to row shelf see the visualization broken into its pieces which will help you understand how lines (surfaces, grids etc.) are handled differently from points and how it is possible to turn on/off grid using a parameter.

Breakdown of components of the visualization (click to view Tableau Public viz)

Besides Custom SQL, there is very little change to go from static background image to dynamic axis. You will see there is a minor change only in the x_rotated field’s definition.

No 3D visualization post would be complete without an example that involves 3D glasses 🙂

To be honest  this post didn’t originally have one but as I was wrapping up the examples, I showed them to some co-workers. Apparently two Tableau developers (Steven Case and Jeff Booth) had put together a stereoscopic sphere example and they thought it would be a nice addition to this post and they were kind enough to share it with me.  Time to put on some anaglyph glasses…

Anaglyph Stereoscopic 3D Sphere in Tableau (click to view in Tableau Public)

In this post I tried to share some examples of how to create various types of useful 3D charts in Tableau. I hope you find it useful.

Standard
Visualization

Hidden gems in Tableau 9.2

Tableau 9.2 has just been released. It is full of several exciting features like Mapbox integration and being able to move totals, new permission settings as well as new iPhone app, it is easy to miss some small but welcome improvements.

Here are a few little features that I am sure some of you will appreciate.

Filtering on discrete aggregates : Throwing that ATTR(State) or a discrete MIN(Date) on the filter shelf is not a problem anymore.

Putting more in your LOD dimensionality : Before 9.2, you wouldn’t be able to use Sets, Combined Fields and Bins as dimensionality in your LOD expressions. Now you can write that {fixed [Profit (bin)] : AVG([Sales])} calculation you always wanted 🙂

As usual, we are looking forward to your feedback on the new release!

Standard
Visualization

Quick tip : Creating your own dashed line styles in Tableau

You may be writing a paper that will be published in black and white and color is not an option or you just like using different patterns. Whatever the reason might be sometimes you want to draw your line charts with dashed lines. The most common methods people use to do this in Tableau are 1) Using a dual axis chart 2) Using the Pages shelf. Both are very simple, but (2) returns much more pleasant results.

Last week, I was talking to a Tableau customer and what they asked for during the call was not achievable with either of these methods since they wanted to be able to use multiple dashed line styles in the same visualization and also display a legend for it. So I had to improvise 🙂

After the call I thought it would be good to put together an example and share it with everyone in case anybody else wants to do the same. You can get to the Tableau Public visualization by clicking HERE or the image below and download a copy of the workbook to take a closer look if you like. I also added the other two options I mentioned into the sample workbook.

Dashed lines in Tableau (click to see in Tableau Public)

So how does it work? Solution is simple but has to be applied with caution. The best way is to look at before/after results to make sure any sudden peaks or dips are not eliminated by adding dashed line effect. The trick is to insert NULLs in the right places to get the line effect you like. For example

IF INDEX()%5 <> 0 THEN
[Your Field Here]
END

In the example I published I used a densified axis to get a very smooth curve but it is not needed but with a smooth curve and dense set of points, the result looks much better.

If you’re doing curve fitting using R integration, you can achieve something similar inside your R script using

for (i in 1:6) result[seq(i, length(result), 24)]<-NA;

where assume result is the name of the vector you would be returning to Tableau. What this would do is to insert 6 consecutive rows of nulls into your result for every 24 rows.

Standard
Visualization

Tableau Customer Conference 2015

Every year I keep asking myself. Could the conference get any better? And it does. This year, 10 thousand customers and 1 thousand Tableau employees got together for the data event of the year.

I flew to Las Vegas two days before the first day of the conference as I was part of the backstage crew and a backup speaker. After two days of rehearsing on site at the MGM Grand Arena where Manny Pacquiao and Floyd Mayweather “took stage” a few months earlier, we were ready for the big day.

Backstage the day before
The calm before the storm

Of course two days is nothing compared to the overall time spent preparing for the keynote. Even though each speaker spends roughly 10 minutes on stage, a speaker and his/her backup spend at least 100 hours looking for interesting datasets, writing the demo scripts and rehearsing them. But it is definitely worth it. A conference of this scale and “data rock stars” attending it deserve no less.

Keynote backstage
Keynote backstage

Overall, my favorite part of the keynote demos was the visual analysis section which covered the support for embedding visualizations in tooltips and connecting to spatial data files (shape files etc.) among many other smaller improvements but the biggest feature in the keynote for me was data integration (cross-database joins and unions). Since my teams work on the advanced analytics features such as clustering and multivariate outlier detection, I was very happy to see the positive reaction from the audience.

Once in a while my Photoshop skills come in handy. In addition to working on the features and co-authoring the demo, I also designed the Tableau 2075 conference announcement image that concluded the analytics section of the keynote.

Tableau Conference 2075, Kepler 452-b

One of the most exciting parts of the conference for me is the opportunity to talk to lots of Tableau users face-to-face. This year with Tableau Community Appreciation Party, it got even better. I got to meet with lots of fellow Tableau users I see and interact with on the forums and bloggers whose work I enjoy.

Analytics roundtable was very helpful in hearing detailed reactions to keynote demos and what the needs were in other similar areas. Based on these discussions, I compiled a list of topics to cover here on the blog in the next few months and was convinced that I need to join this thing called Twitter.

Delivering my session “Understanding Level of Detail Expressions” (and hearing Joe Mako’s comments on the content afterwards), being part of Matt Francis and Emily Kund’s podcast were other highlights of the event for me.  And of course, meeting (and partying with) Tableau Zen Masters was the best way I could imagine to wrap up the event. I am already looking forward to next year’s conference!

Standard
Visualization

What’s new in Tableau 9.0? Part 2 – Level of Detail Expressions

Level of detail expressions extend Tableau’s calculation language by introducing the capability to define at what level aggregations should happen. It is a small syntax change but implications are HUGE!

I will focus on the technical details in this post with a few examples to clarify the concepts. Stay tuned for a series of posts on Tableau website from our Product Consultants which will cover a range of business questions you can answer using level of detail expressions with lots of example workbooks, perfect for some inspiration.

What is level of detail anyways?

In Tableau, the results for a calculation such as SUM([Sales]) depend on the context. The context is defined by the filters and level of detail. Dimensions on rows, columns, color, size, label, detail or path shelves define the level of detail for the sheet.

For example if [State] dimension is on Row shelf, SUM([Sales]) will give sum of all transactions for each [State]. If [Product Type] is also on one of the shelves I mentioned above then SUM([Sales]) will give sum of all transactions within each [State] for each [Product Type]. The more dimensions in your sheet or the more unique members each dimension contains, the more granular your results will be. Since each result as drawn as a mark in your visualization, the finer the level of detail for your sheet, the more marks you will have.

Below is an example using the Coffee Chain dataset. The image on the left has Product Type as the only dimension in the sheet while the one on the right also has Product, giving it finer granularity hence more marks.

Product type vs Product, coarser and finer level of detail respectively

Filters on the other hand change the context by reducing the number of data points used in the calculation e.g. remove a state by name or where profit is below zero or that are not in TOP 5 etc. Using table calculations as filters is an exception to this however since they just hide marks without changing the data over which the calculations operate.

Level of Detail Expressions

Level of detail expressions allow you to specify the context for your calculations. For example the following expression

{fixed [State] : SUM([Sales])}

will always calculate sum of sales per state regardless of sheet’s level of detail. It also ignores the filters in effect except context filters, data source and extract filters.

Tableau has had several mechanisms to calculate at a coarser level of detail than the sheet or perform multiple levels of aggregation for many releases. Reference lines, totals and table calculations have been commonly used methods for these tasks which worked very well in some cases but not others. What level of detail expressions bring to the table is the ability to use the results mixed with row level values, push the mixed and multi-level aggregate calculations to the database (as opposed to local calculations with other methods) and use results as dimensions, for table calculation addressing/partitioning, in actions or bin them.

Level of detail expressions have the following structure

{keyword [dimension1],[dimension2]… : Aggregate expression}

The portion highlighted in blue is the dimensionality expression which consists of 0 or more dimensions. If there is more than one dimension, they must be separated with commas. Aggregate expression does not necessarily have to be a simple aggregation e.g. SUM([Sales]) you can insert other calculations here, even wrap an IF block inside a level of detail expression.

There are 3 different keywords

Fixed : Limits the level of detail of the aggregate expression to only the dimensions listed in the dimensionality expression. I will cover the interaction with filters towards the end of this post.
Include : Defines the level of detail for the aggregate expression by combining the dimensions listed in the dimensionality expression with dimensions in the sheet. It is great when you want to calculate at a fine level of detail in the database then re-aggregate and show at a coarser level of detail in your visualization and/or when you want your calculation’s level of detail change when you drill down or add more dimensions to your Sheet.
Exclude : Defines the level of detail for the aggregate expression by removing the dimensions listed in the dimensionality expression from dimensions in the sheet. It is great for ‘percent of total’ or ‘difference from overall average’ scenarios. Can be considered as the written/textual representation of features like Totals and Reference Lines.

Specifying no dimensions using fixed keyword e.g. {fixed : MIN([Order Date])} gives a single result for the entire table. Shorthand for {fixed : MIN([Order Date])} is {MIN([Order Date])}. This gives you the date your store made its first sale, when you earned that dollar bill you got framed 🙂

Since level of detail of the sheet determines the number of marks drawn in the visualization, when your calculation has a different level of detail, something needs to be done to reconcile the difference.

Here is what Tableau will do automatically for you.

If your calculation is of coarser level of detail, Tableau will replicate the results as needed  such that there is a value for each mark. In the example below

{fixed [Segment] : SUM([Sales])} is getting replicated for each Category, Segment.

Replication behavior when calculation has coarser level of detail than the sheet

This allows you to the following to calculate the difference between a Transaction Amount and the AVG Transaction Amount by that customer.

[Transaction Amount] – {fixed [Customer ID] : AVG([Transaction Amount])}

Note that this calculation mixes an aggregate with a row level value but doesn’t give you the error you’re used to seeing when you have aggregates and non-aggregates in the same calculation. Yet another benefit of using level of detail expressions!

If your calculation has finer level of detail, Tableau will aggregate the results as needed  such that there is only one value for each mark. In the example below

{fixed [Segment], [Category] : SUM([Sales])} is getting AVG’ed up to one value per segment (assume that the pill in the sheet has AVG as aggregation).

Aggregation behavior when calculation has finer level of detail than the sheet

With include keyword calculation will always either have same or finer level of detail than the sheet which requires aggregation to match sheet’s level of detail. For fixed it can be coarser or finer depending on the calculation and dimensions in the sheet. Since the need to aggregate or not depends on what dimensions are in sheet, it can change anytime. Having an aggregation defined is safe since if and when aggregation is needed, you already told Tableau how to do it. Because of all this, calculations with custom level of detail are always wrapped in an aggregate when in the sheet unless they’re used as dimensions. If no aggregation is happening (e.g. calculation has the same level of detail as the sheet or is coarser hence getting replicated instead) then changing the aggregation will not make any difference in the results. You will notice that when you drag a pill that is an exclude calculation, Tableau will default to ATTR (as opposed to SUM or AVG as you’re used to see with measures) to hint that what’s happening isn’t aggregation.

How does it work?

Level of detail expressions rely on database queries to calculate the results. If you look at your Tableau logs, you will notice them as sub-queries with inner joins and sometimes with cross joins. If your database doesn’t have a CROSS JOIN operator, Tableau will join without an ON/WHERE clause or use one that always returns true to get the same effect as long as your database supports it.

Cross-joins happen when there is no common dimension to join on. For example {MIN([Order Date])} is a single value repeated for every row hence there is no need for a join key.  It is a cross-join between the main table (that defines the sheet) and a one-row table (resulting from the calculation).

Of course this is not the only case. For example you can have a calculation such as {exclude [State]  : AVG([Sales])} inside a Sheet where [State] is the only dimension. This would also result in “blank” dimensionality again hence a cross-join query will be generated.

Level of Detail Expressions and Filters

There are several different kinds of filters in Tableau and they get executed in the following order from top to bottom.

Filters in Tableau

I colored Extract Filters and Table Calculation filters differently since former is only relevant if you’re creating a Tableau Extract from a data source and latter applies after the calculations are executed hence hides the marks without filtering out the underlying data used in calculations.

If you’re familiar with SQL, you can think of measure filters as HAVING clause in a query and dimension filters as the WHERE clause.

Fixed calculations happen before dimension filters, as a result unless you promote the pills in your filter shelf to “Context Filters” they will be ignored. There are many benefits to this. Imagine writing the following calculation

SUM([Sales]) / ATTR({SUM([[Sales])})

and having the dimension [State] in your Sheet. With this setup, this calculation will give you ratio of a State’s sales to total Sales. You can add a filter to your Sheet to hide some of the States but filter will only affect the numerator. Since denominator is fixed, it will it will still give the sum for all States. When you also consider taking advantage of this with filter actions in dashboards, you can imagine how powerful it can be.

Include and Exclude calculations happen after Dimension filters like any other (except fixed) measure in Tableau. So if you want filters to apply to your level of detail expression but don’t want to use Context Filters, you can rewrite your calculation using Exclude or Include keywords.

Level of Detail Expressions and Totals

There were several questions about this on Tableau forums and Beta e-mail alias so I decided to add a section to the post to cover this topic as well.

First, let’s talk about how Tableau computes Totals. In Tableau there are two kinds of totals. Totals (aka single-pass totals) and Visual Totals (aka two-pass totals). Former applies the current aggregation at a coarser level of detail than the Sheet and is the default setting when you turn on Totals. This option appears in the Total Using menu as “Automatic”.

Below is an example of single-pass Totals where you can see that subtotals in the first image on the left match AVG(Sales) per Category  (upper right), while Grand Total matches the AVG(Sales) for the entire table (in the lower right) i.e. when no dimensions are in the Sheet.

 Totals (aka single-pass totals)

If you wanted the Total rows to show the AVG of the rows above them e.g. the average of the values displayed for Bookcases, Chairs, Furnishings and Tables then you would need to use Visual (aka two-pass) totals. You can switch to visual totals by changing “Total Using” setting from “Automatic” to the desired aggregation to be used in second pass.

Note that values for single pass and two pass will be the same when what you’re doing is sum (“Total Using” setting) of sum (pill’s aggregation), max of max or min of min.

In the example below, Total Using is set to AVG. So subtotals show the AVG of SUM of sales per Sub-category within each Category. Green box shows the result while green line indicates the values that were averaged to get that result. Similarly the Grand Total line at the bottom (blue box) shows the AVG of all the rows marked with blue line. I renamed the total cells in this case so they are more descriptive of what’s happening.

Visual (aka two-pass) totals

Now that we clarified the basics, let’s move onto the main question of how totals and level of detail expressions play together.

Even though a single pass grand total applies the aggregation at table granularity i.e. as if no dimensions were in the table this does not affect what happens inside the curly braces.

For example if your visualization contains the dimension [Product Category] and your calculation is
AVG({include  [Customer Name]:SUM([Sales])})

LOD expression will still calculate at [Customer Name] , [Product Category] level for the grand total. Total row calculation only changes how the outer aggregate behaves (AVG in this case) so the results of the LOD calculation would get averaged up to [Product Category] for each mark in the Sheet but to table level (single value) for the grand total.

Single-pass Totals are not affected by the replication that is triggered by LOD expressions coarser than the Sheet. However please keep in mind that Visual (aka two-pass) totals do.

Understanding Nesting

You can nest several layers of level of detail calculations to answer even more elaborate business questions.  One thing to keep in mind though is the fact that context for a calculation is defined by its parent(s) as opposed to the sheet. For example :

{fixed [State] : AVG({include [Customer] : SUM([Sales])})}

it will have the same effect as writing

{fixed  [State] : AVG({fixed [State], [Customer] : SUM([Sales])})}

since nested calculation inherits the dimensionality from the outer calculation in the first case resulting in [State] + [Customer] = [State], [Customer]. Also since outer calculation is fixed, nested calculation will not be affected by the filters in the sheet.

And that’s not just it. Aggregation/replication behavior I explained earlier in this post also applies to nested calculations. Let’s walk through an example…

Assume you are trying to calculate the average customer spending in each State but you realize that customers travelling and spending small amounts in multiple states are skewing the results and instead of filtering those out, you decide to calculate the total spending for each Customer and use that value towards the State average. You could do this by writing the following calculation

{fixed [State], [Customer] : AVG({exclude [State] : SUM([Spending])})}

And set the pill’s aggregation to AVG when you drag it into your sheet.

In this example, nested calculation will give the SUM of [Sales] per [Customer]. However the outer aggregation has the dimensionality of State, Customer which is finer. This will result in replication of total sales per Customer for each State in which customer ever made a purchase.

Anatomy of a nested level of detail calculation

In this case since what happens between the two levels of calculations is replication, what aggregation you use here (AVG({exclude [State] : ….) doesn’t matter. The final step that averages customer totals to State level happens through the pill’s aggregation setting and the fact that State is the only dimension in your Sheet.

While this is a good example for explaining how nesting works there is even a simpler way of writing this calculation :

{fixed [Customer] : SUM([Spending])}

You just need to drag it into your sheet with the dimension State and set the pill’s aggregation to AVG.

This is because when Tableau generates this INNER JOIN, it does so between the dimensionality of the level of detail (LOD) expression and dimensionality that is the union of  the sheet and the LOD expression’s dimensionalities. In this case, the union of dimensionalities is Customer (from the calculation) + State (from the Sheet) which will result in the same replication behavior as in the nested case.

Working with disjoint LODs

When you think of all that’s happening to calculate the result, syntax to do it is amazingly simple.

Limitations

  • Calculations that involve level of detail expressions won’t be materialized in Tableau Extracts in  9.0.
  • Sets, combined-fields (use dimensions that make up the combined field individually by separating them with commas instead), parameters, table calculations cannot be used in the dimensionality expression in 9.0
  • Only field names can be used in dimensionality expression (e.g. an expression like MAX([Order Date]) or YEAR([Order Date]) or an IF statement cannot be used in the dimensionality expression). Note that you can use a calculated field that has the formula YEAR([Order Date]) and use that instead. If you use {exclude [Order Date] : SUM([Sales])} and have [Order Date] in your sheet, Tableau will recognize the date hierarchy and automatically exclude the dateparts from calculation’s level of detail as you drill down.
  • Table calculations and ATTR are not allowed in aggregate expression e.g. {fixed [County] : ATTR([State])} or {fixed [County] : INDEX()} are not supported
  • Dimensionality from one source with measure from another blended source can not be mixed in a level of detail calculation
  • All dimensions in the dimensionality expression should come from the same data source
  • Level of detail expressions that use exclude or include keywords can not be used as dimensions. They also can not be binned. This is because they have relative dimensionality. While this may appear like an arbitrary limitation, it makes them much more understandable. Consider the case where there are two include calculations in the Sheet that are used as dimensions. Calculation1 is {include x : min(i)}, Calculation2 is {include y : min(k)}. If there are no other dimensions in the Sheet, Calculation1 will behave like {fixed x, Calculation2 : min(i)} and Calculation2 will behave like {fixed y, Calculation1 : min(k)}. Circular reference! This is a solvable problem but at the expense of understandability. We chose ease of use and simplicity instead.
  • Level of detail calculations from secondary sources can not be used via data blending 2 in 9.0
  • In 9.0 results from Level of detail calculations are cached as a whole (not broken down per sub-query) so generating same sub-queries in different sheets/calculations won’t result in cache hits.

Data source support

In Tableau 9.0, level of detail expressions are not supported for the following data sources: Cubes, Big Query, DataStax, Informatica Data Services, Microsoft Jet, Splunk, Actian Vectorwise.

Following data sources has higher version requirements for level of detail calculations than minimum supported versions per Tableau technical specifications here :

Teradata 12+, Netezza 7+, HIVE 0.13+, Impala 1.2.2+

Rest of the data sources support level of detail calculations with the same minimum version requirements as listed in technical specs. The list also includes the new connections not yet listed in technical spec such as Apache Spark, IBM BigInsights and SPSS, R, SAS data files which adds up to a total of 36 connection types! Note that the new Excel and Text connectors are not Jet based so they have full level of detail calculation support as well.

Generic ODBC has limited support as cross-join syntax varies from database to database. What Tableau issues through generic ODBC may not be what’s supported by the database or database may not support cross-joins at all. So some scenarios may not work. It is advised that you use dedicated connection types where available for the best experience.

If you’re connecting through Tableau Data Server, it also has to be Tableau 9.0 or newer.

Performance
Level of detail expressions rely on database queries to calculate the results. Performance heavily depends on your database and size of the data, whether necessary indexes are in the database to be able to perform joins rapidly, complexity and level of nesting in your calculation. If your dataset is large and database is fast, level of detail expressions can give you vast performance improvements over Table Calculations since you can push finer detail calculations directly into the database instead of moving a large dataset over the wire into Tableau on your desktop.

You can make your calculations run more efficiently on the database, by avoiding unnecessary cross-joins. Also if you have large tables, you can get performance gains by using MIN instead of ATTR when you know replication will happen. For ATTR, Tableau queries for both MIN and MAX to see if it needs to show * or not. If you know your calculation will replicate, you already know MIN will be same as MAX. This way you can get the same results and shave-off a sub-query.

Other Notes

When using level of detail calculations as dimensions, be careful if your aggregation is returning floats since floating point arithmetic does not give results that are reliable for equality comparisons (JOINs check for equality) and this may result in unexpected results.

Conclusion

Congratulations! You made it to the end of the post!  I agree it was a long post but when you think about it, I think it is pretty cool that it was possible to fit most technical details in a single blog post. This would probably be a 3 volume book that comes in hardcover in the old world of business intelligence and calculations would only work on one platform. Being able to do this with 30+ different databases with such little visible change in the calculation language is very exciting.

Do you know what’s more impressive and excites me even more? All the cool things you will do with it… All the insights you will discover… Enjoy!

Standard