With the increasing amount of user content on the web, text analytics is gaining more mainstream adoption. Sentiment analysis, keyword and named entity extraction are the most common tasks since they allow quickly classifying, filtering and turning text into easily consumable metrics. What do the customers like most about your product, what do they not like? Do people perceive your brand more positively or negatively compared to last year?
With the new R integration feature in Tableau 8.1 it is very easy to add these functionality to your dashboards. There are currently two packages in R that can be used for this purpose: sentiment and qdap. In this post we will use sentiment. This package requires tm and Rstem packages, so first you’ll need to install those. You can do this by typing in the commands below, into your R console (or RStudio if that’s the IDE of your choice).
It may be difficult to find the right versions of Rstem and sentiment. If you already have these packages you can skip to the next step. Before you run the workbook please make sure you load the packages either in the calculation by adding
library(sentiment); before the classify_ functions or in Rserve config as covered in my previous blog post about Logistic Regression.
NOTE: Rstem and sentiment packages are becoming more and more difficult to get to work in newer R versions. If you’re having trouble, please read the comments section where you will find information about alternative packages.
Let’s take the first stab by using the classify_polarity function. Comment Text column contains reviews for a hypothetical product. We are using our calculated field Sentiment for both text and color coding as it returns one of three classifications: negative, neutral and positive.
You will notice that the results are not perfect. Second row from the bottom, is in fact a negative comment about delayed delivery but classified as a positive comment. More on that later. Now let’s have a look at what the calculated field looks like.
As you can see the R script is very simple. We are calling the function and retrieving the column corresponding to best_fit. Another method in this package is classify_emotion which classifies text into emotion such as anger, joy, fear… The function call is very similar but we get a different dimension from the results this time. Especially the two lines that are associated with emotion “fear” look far off. But how does this work and how can it be made better?
Sentiment analysis techniques can be classified into two high level categories:
Lexicon based : This technique relies on dictionaries of words annotated with their orientation described as polarity and strength e.g. negative and strong, based on which a polarity score for the text is calculated. This method gives high precision results as long as lexicon used has a good coverage of words encountered in the text being analyzed.
Learning based : These techniques require training a classifier with examples of known polarity presented as text classified into positive, negative and neutral classes.
R’s sentiment package follows a lexicon based approach hence we were able to get right into the action, given it comes with a lexicon for English. In your R package library under \sentiment\data folder you can find the lexicon as a file named subjectivity.csv.gz.
The text that was incorrectly classified as having positive polarity is the following “Took 4 weeks to receive it even though I paid for 2 day delivery. What a scam.” If you open the file, as you probably suspected, you will find out that scam is not a word in the lexicon. Let’s add the following line to the file,
then save, zip the file, restart RServe and refresh our workbook.
Now, you can see that the text is classified correctly as expressing negative sentiment. When using lexicon-based systems, adding new words to the lexicon or using a completely new lexicon are potential paths to follow if you are not getting good results. Incorrect classifications are more likely if slang, jargon and colloquial words are being used in the text you’re analyzing since these are not covered extensively in common lexica.
You can download the workbook containing the example HERE.