Big data, Collaboration, Economy, retail

The Retail “Apocalypse”

Retail is changing.  

US online sales has been growing 15.6% year over year at roughly 4x the rate of overall retail market, amounting to 15% of all retail sales in 2019.  Depending on the time of year, location and category, the change is even more pronounced.  

1010data’s report shows that while overall Black Friday sales held to similar levels year over year, representing almost 30 percent of the week’s sales, share of in-store channel has dropped from an average of 83 percent of spend in November, 2014 to an average of 68 percent in November 2019.  

For books about 15% digital penetration led to consolidation and eventually bankruptcy of physical bookstore chains. Today for apparel and electronics, share of online sales has already surpassed 30%. On average, share of online grocery sales is still in single digits, but it varies greatly by locale, from a high of 12% in New York City to 5% in San Francisco and 2% in Des Moines. 

While improved delivery times and ease of returns made a significant change in consumer attitude towards online shopping overall, for groceries growth has been slow due to thin margins with an average item price of around $3 and 30% gross margin, leaving only $0.90 to for all of handling, selling and delivery. With KPMG predicting self-driving delivery vehicles to reduce the cost of delivery to between 4 and 7 cents per mile, financially viable online grocery businesses will be well within reach in the next few years.  

And I’m not talking about distant future here. A few months ago, UPS became the first company to receive FAA certification to build a “drone airline”. Certification allows them to fly drones out of operator line of sight, during day or night, over people and with cargo weighing more than 55 pounds. Soon UPS trucks will become hubs for swarms of drones, while in densely populated areas, drones will deliver directly from distribution centers to homes.  While waiting for its own drone services business’ FAA approval of course Amazon wasn’t resting on its laurels, piloting package delivery with their Scout autonomous vehicles. Startups like Cleveron, Postmates and Starship are also building their own commercial robots for last-mile delivery. This all spells trouble for companies like Instacart and retailers that plan on relying on them to solve their delivery problem in the longer run. 

Another and often complimentary approach to reducing last-mile distribution costs is micro-fulfillment centers for deliveries and pickup orders which make 1-day, even 1-hour delivery feasible. California-based startup Farmstead has made over 100,000 deliveries in San Francisco Bay Area in the past 3 years and recently announced expanding its footprint into Carolinas. Their micro-fulfillment centers cost 1% of an average supermarket to build, have much smaller footprints, allowing for much larger number of locations, closer to customers.  UK’s Ocado having shown that it is possible turn profit as an online-only supermarket with its $2+ Billion annual revenue is starting to transform itself to a technology company by licensing its automated fulfillment center tech. Truepill, Alto Pharmacy, Nimble RX and Capsule are applying the same formula to pharmacy, delivering not only your regular prescriptions but also offer same-day delivery at no extra cost so you don’t have to endure long pharmacy lines while you’re sick. 

Many large US retailers are already in various stages of pilot implementations. Kroger through a joint venture with Ocado, is building 20 fulfillment centers for online orders and have so far committed to facilities near Dallas, Cincinnati, Orlando and Atlanta.  Meijer recently announced that it will begin testing micro-fulfillment with logistics company; Dematic. Albertsons is piloting a fully automated micro-fulfillment center with the robotics company Takeoff TechnologiesFabric claims to have built the world’s smallest fulfillment center, which can process up to 600 orders per day out of 6,000 square feet. That is roughly twice the number of orders per square foot an average brick-mortar store would generate per Readex Research’s 2019 Annual Retailer Survey.  Considering average online grocery basket size is also roughly twice1 that of brick-and-mortar, assuming a fulfillment center operating at full capacity, this translates to roughly 4x revenue per square foot compared to traditional brick-and-mortar store setup.  When you factor in the cost savings of distribution facilities relative to commercial retail space, the online model becomes even more compelling. 

Finally, many staples we’re used to buying in stores are very well suited to planned, automatic replenishment. You can schedule deliveries for your cereal and peanut butter once a week, toilet paper once a month and adjust as needed to whatever cadence best suits your household. If you own a loyalty card and use the same retailers consistently, it is only a matter of time before this gets fully automated. Thanks to machine learning, it won’t be too long before Alexa reminds you if you’d like your monthly pantry items delivered to your home or pick them up from the nearest Amazon Locker on your drive back from work. Harry’s, Dollar Shave ClubBarkbox are a few companies who have been successful with this model. Not to mention Farmstead, with most of its customers enrolled in a weekly subscription program to save money on staples like milk, eggs, bacon and most packaged goods. Predictability allows Farmstead to better optimize their supply chain, reduce waste and pass on those savings to their customers. Meal kit vendors have also benefited from such predictability resulting in reduced carbon footprint

What does all this mean for traditional brick-and-mortar retailers?  

UK-based Argos is a rare example from the catalog retail era that started in early 1900s. While almost all those retailers went bankrupt within the past 2 decades, Argos survived by transforming itself, with e-commerce click-and-collect and delivery options, now majority of its revenue being generated through online sales. Retailers of today have a similar choice but it is easier said than done. 

Transforming to an omni-channel retailer requires significant innovation and organizational change; a system-wide digital transformation including the supply chain: 

1. Articulate your vision. Executing across multiple channels harmoniously requires a highly concerted effort that might be difficult to adapt in organizations that are used to operating in silos. A strong vision from the point of view of the customer and their changing needs and expectations, not by new technology or existing organizational boundaries is key to aligning various stakeholders. 

2. Define your strategy.  Retailers need to consider their market and make their own decisions about business and operations as there is no single recipe for success. Different customer segments will value parts of the shopping experience differently, different products will align better with different distribution channels but there exists plenty of success stories to get inspiration from. 

  • Listen to your customers Internet gave customers a voice and they expect to be heard. This could be through customer support channels, social media, blogs, forums and indirect feedback through instrumenting of customer experiences. Brands that pay attention to customer feedback have more engaged customers, higher customer satisfaction scores and are able to identify new product opportunities.   Two great recent examples of this is Coca Cola and Soylent. Soylent made a name for itself with its “open-source” meal replacement products. They enabled their customer community to come up with DIY recipes and share with each other, some of which inspired recipes currently sold by the company. Coca Cola introduced the new Orange-Vanilla flavor as it was one of the most popular pairings based on the data from their Freestyle fountain dispensers. Accenture found 91% of consumers prefer to buy from brands that remember their choices and provide relevant offers and recommendations while 83% are willing to share their data to enable personalized experiences. While today customization typically means coupons in stores or product recommendations on e-commerce sites, as Coca Cola and StitchFix have shown, there are many more ways to personalize. 
  • Give customers a reason to come to your store For most customers, grocery shopping is a chore they would like to avoid if they could. Successful retailers find ways to draw them into their stores.  TJ Maxx and Lidl understand that people love the thrill of a treasure hunt.  Lidl, in addition to the usual meat, fruits and vegetables, offers a rotation of specials; “Lidl surprises” that are released every Monday and Thursday. As soon as they’re gone, they’re gone. Replace groceries with apparel and you end up with TJ Maxx’s formula. New selections at least once a month with deep discounts only available in store. Bonobos and Glossier took brick-and-mortar and turned it on its head in their successful showroom concepts, where customers visit the stores not to pick items off the shelves but for the experience, to try on products and get personalized fashion advice.  This is an approach that can be generalized to fast moving goods as well. Imagine yourself enjoying a wine flight, sampling food or even taking a cooking class in the store while robots in the automated warehouse at back of the store get your order ready for pick up on your way out. 
  • Meet your customers where they are Today’s customer has high expectations with regards to convenience and flexibility. They could be ordering through your website but exchanging at a local store, comparing product specs online but buying it in store or ordering online for in-store pickup. To be successful in the world of omni-channel retailing, a seamlessly integrated customer experience across all channels from physical stores, computers and mobile devices through apps, e-commerce sites and social media is required to deliver vastly improved customer experiences. It is important to understand what channels are most important for your customer base and start with those. Sometimes these might be usual suspects like subscriptions, free returns, curbside-pickup etc. but sometimes it might require thinking outside the box. For example, way ahead of its time in 2011 Tesco opened the world’s first virtual store in Seoul subway to help time-pressed commuters to shop on the go using their smartphones with same-day delivery. 
  • Improve your supply chain Grocery giant ALDI (owner of ALDI stores and Trader Joe’s) is known for its bargain prices. They primarily owe this to ~90% of their products being private label which means lower unit cost and reduced selection which in turn also means smaller store footprints. The British online grocery retailer Ocado operates no stores and does all home deliveries from its warehouses with an industry leading 0.02% waste. For Ocado fully automated warehouses not only allows to be price-competitive, it means a better customer experience, reduced waste and carbon-footprint.  Walmart spent $4 billion in 1991 to create Retail Link to better collaborate with its suppliers. Today there are many off-the-shelf platforms retailers can use for this purpose for a fraction of the cost. For customers, this means fewer out of stocks and the retailers an additional revenue stream through data monetization.  Retailers will need to redesign their supply chains based on services they want to offer which will often mean looking at the supply chain as many possible starting and ending points rather than a single flow to get products on the shelf. This also means a focus shift from on-shelf availability to dynamic trade-offs between availability, margins and delivery times, reallocating products across channels based on sell-through rates and even testing demand for a product in online before moving it to store shelves. This is only possible with a shared inventory and end-to-end visibility across all distribution channels.   

3. Assess infrastructure needs. Organizations must determine the necessary technology capabilities from data management and machine learning to in-store sensors, warehouse management and delivery capabilities to support their vision.  Being able to effectively execute on an omni-channel strategy requires a fully unified stack. Successful retailers use machine learning to watch consumer trends and customer feedback, personalize offers, manage product assortment, decide optimal distribution center locations, demand/inventory forecasting, understand user journey and marketing channel effectiveness, in-store IoT devices to react to user actions and inventory updates in real time, robotic automation to increase warehouse efficiency and sharing data with suppliers to more effectively manage inventory and enable timely direct-to-store shipments. Making all the components work together, integrating between different hardware and software components are often multi-year projects. 

4. Identify necessary organizational changesRetailers will need to restructure business processes and metrics, define rules for shipping products and allocating revenue between channels. If a gift is ordered from a website but exchanged at a local store where should the revenue go? What if the customer went to a store, saw a display model but product was out of stock then ordered from her smartphone to ship it to her home? With omni-channel retail blurring the lines, the right incentives need to be put in place for business success with all parties focusing on delivering customer value.  

Retail is changing but in a way, everything old is new again.

E-commerce sites are the new catalog stores, Alexa is not the name of the server who knows how you like your steak but a voice assistant who will know about almost all your buyer preferences, convenience stores will become vending machines you can walk into, and the milkman will be a robot.

All in all, it will be better for the consumer and less taxing on the environment.  

So why call it an apocalypse? It is the retail renaissance. 

Standard
Big data, GIS, Satellites

If you’re looking for a data prep challenge, look no further than satellite imagery

I has been almost 10 months since my last blog post. Probably time to write a little bit about what I’ve been up to.

As some of you might know, in January I joined Descartes Labs. As a company, one of our goals is to make spatial data more readily available and to make it easier to go from observations to actionable insights. In a way, just like Tableau, we’d like people to see and understand their data but our focus is on sensor data whether it is remote such as satellite or drone imagery, video feeds or in-situ such as a weather station data. And when we talk about big data we mean many Petabytes being processed using tens of thousands of CPU or GPU cores.

But at a high level, many common data problems that you’d experience with databases or Excel spreadsheets apply just the same. For example, it is hard to find the right data, there are inconsistencies and data quality issues which become more obvious when you want to integrate multiple data sources.

Sounds familiar?

We built a platform that aims to automatically address many of these issues, what one might call a Master Data Management (MDM) system in enterprise data management circles but focusing on sensor data. For imagery, many use cases from creating mosaics to change detection and various other deep learning applications require these data corrections for best results. And having an automated system shaves off what would otherwise be many hours of manual data preparation.

For example to use more than two images in an analysis, the multiple images have to be merged into a shared data space.  The specific requirements of the normalization is application dependent, but often requires that the data be orthorectified, coregistered, and their spectral signatures normalized, while also accounting for interference by clouds and cloud shadows. We use machine learning to automatically detect clouds and their shadows hence can filter them out on demand, an example of which you can see below.

Optical image vs water/land/cloud/cloud shadow segmentation
Optical image vs water/land/cloud/cloud shadow segmentation

However, to truly abstract satellite imagery to an information layer, analysts must also account for a variety of effects that distort the satellite observed spectral signatures. These spectral distortions have various causes that include geographic region, time of year, differences in the satellite hardware, and the atmosphere.

The largest of these effects is often the atmosphere.  Satellites are above the atmosphere looking down and, therefore, mix the sunlight reflected from the surface with that scattered by the atmosphere. The physical processes at play are similar to why the sky is blue when we look up.

The process of estimating and removing these effects from satellite imagery is referred to as atmospheric correction.  Once these effects are removed from the imagery, the data is said to be in terms of “surface reflectance”. This brings satellite imagery into a spectral space that is most similar to what humans see every day on the Earth’s surface.  

By putting imagery into this shared spectral data space, it becomes easier to integrate multiple sources of spectral information – whether those sources be imagery from different satellites, from ground based sensors, or laboratory measurements.

Top of Atmosphere vs Surface Reflectance
What a satellite sees (left) vs the surface (right)

We take a different tact than other approaches to surface reflectance in that our algorithms are designed to be a base correction that is applicable to any optical image. Other providers of surface reflectance data often focus on their own sensors and their own data, sometimes making it more difficult for users of multiple sensors to integrate the otherwise disparate observations. 

We have already preprocessed, staple data sources such as NASA’s Landsat 8 and European Space Agency (ESA)’s Sentinel-2 data. This includes all global observations for the lifespan of the respective satellites. We also generate scenes for other optical sensors, including previous Landsat missions, on request. In addition to our own algorithms, we also support USGS’ (LaSRC) and ESA’ (Sen2Cor) surface reflectance data. 

If you’re into serious geospatial analysis you should definitely give our platform a try and see for yourself. If you’re not but know someone who does, spread the word! With our recently launched platform, we are very excited to help domain experts get to insights faster by helping them find the right datasets, smartly distribute their computations across thousands of machines, and reduce the burden of dealing with data quality issues and the technical nuances of satellite data. You can read more about our surface reflectance correction and how to use it in our platform here.

Standard
Big data, GIS, Mapping

New Beginnings

After 5 exhilarating years at Tableau on December 29th, I said my goodbyes and walked out of the Seattle offices for the last time.

Tableau was a great learning experience for me. Watching the company grow nearly 7 fold, going through an IPO and becoming the gold standard for business intelligence…

I  worked with very talented people, delivered great features and engaged with thousands of enthusiastic customers.

It was a blast.

But there is something about starting anew. It is the new experiences and challenges we face that make us grow. And there’re very few things I like more in life than a good challenge 🙂

I flew out of Seattle on December 30th to start a new life in Santa Fe, New Mexico and join a small startup called Descartes Labs on January 2nd.

descartes-labs-data-refinery

At Descartes Labs, we’re building a data refinery for remote sensing data; being used for a growing set of scenarios from detecting objects in satellite images using deep learning to creating the most accurate crop forecasts on the market and understanding the spread of Dengue fever, using a massively scalable computational infrastructure with over 40 Petabytes of image data with 50+ Terabytes new data added daily. If you haven’t heard of us already, check us out. You won’t be disappointed.

I will continue writing but Tableau won’t be the primary focus for new content on the blog anymore. I will try to answer Tableau related questions as time permits. Although I must say that several of Descartes Labs’ customers also use Tableau so I am using Tableau quite often on my new job as well.

To the new year and new beginnings!

 

Standard
Data Preparation, Python, Text analytics

Build Your Own Data Pipelines with Tableau Command Line Utilities & Scheduled Tasks

During one of my 2016 Tableau Conference talks, I shared an example data pipeline that retrieved tweets containing the session’s hash tag, tokenized them and appended to an extract on Tableau server periodically, paired with an auto-refreshing dashboard.

My “magic trick” started by showing a sorry looking word cloud with only two tweets, which slowly filled up as the session progressed with contents of tweets from the audience.

Tableau Conference 2016 - Accelerate Advanced Analytics

While hitting Twitter every few minutes worked well for a short demo, typically hourly or daily updates make more sense in real life scenarios such as text analytics over social media data or geocoding street addresses of newly acquired customers.

I got a lot of requests for making this into a blog post so I repurposed the demo to do sentiment analysis every night over tweets from the day prior.

It has 3 core components:

  1. A Python script that contains the logic to retrieve and analyze Twitter data and write the results to a CSV
  2. A batch file that runs the Python script, takes its outputs and uses Tableau command line utilities to append the contents of the CSV to an extract on Tableau Server
  3. A scheduled task that triggers the batch file once a day and runs this pipeline

The Python Script

You can download the Python scripts from HERE. The zip archive contains  analyzetweets.py shown below and the config.py which will contain your Twitter credentials. You can embed all of them into one Python file but if you’re going to share your screen for a demo, it might be safer to keep it separate 🙂

Python code snippet for Twitter sentiment analysis

For this sample code to work you will need to install two Python packages which you can easily get via pip. VaderSentiment is a lexicon and rule-based sentiment analysis tool. Twitter package is used to query Twitter.

pip install twitter
pip install VaderSentiment

You also need your PATH variables set correctly so your operating system can find Python.exe and these libraries, not to mention a Twitter developer account so you can access Twitter’s data. Here is a good tutorial on how to set one up.

Note that if you use this as a template to run your own code that doesn’t do sentiment analysis and use Twitter data, you won’t be needing any of these packages.

The Batch File

Batch file navigates into the folder containing the Python script, executes it, then takes its output (sentimentscores.csv) and uses “tableau addfiletoextract” to append its contents to an existing extract (with the same set of columns as the CSV file) on the server. You can copy-paste the content below into a text file and save with .bat extension.

@CD C:\Users\bberan\Documents\twitterDemo
@CALL python analyzetweets.py
for %%I in (“C:\Program Files\Tableau\Tableau 10.0\bin”) do set TableauFolder=%%~sI
@CALL %TableauFolder%\tableau addfiletoextract –server https://your-Tableau-server –username yourUserName –password “yourPassword” –project “TheProjectName” –datasource “TheNameofTheExtractDataSourceToAppendTo” –file “C:\Users\bberan\Documents\twitterDemo\sentimentscores.csv”

The Scheduled Task

Windows Task Scheduler is a handy and relatively unknown tool that comes preinstalled on every Windows computer (Unix variant also have similar utilities like cron).

Launch it from your Start Menu and simply create a task from Actions tab that points to the batch file.

Creating an action to start a program with Windows Task Scheduler

Then using the Triggers tab, set the frequency you’d like to run your batch file.

Setting a refresh schedule using Windows Task Scheduler

Now you have a data pipeline that will run nightly, retrieve recent tweets from Twitter, run sentiment analysis on them and add the results to a Tableau extract on the server.

BONUS: Auto-refresh dashboard

If you plan show results on a dashboard that’s on public display, you’d probably like the dashboard also to refresh at a similar frequency to reflect the latest data. For this all you need to do is to embed the dashboard inside a web page with an HTML meta refresh tag.

This was a rather simple, proof of concept but following this example as a template, you can create multi-stage, scheduled pipelines for many ETL tasks and deliver answers to questions that are much more complex. Enjoy!

Standard
R

Scaling RServe Deployments

Tableau runs R scripts using RServe, a free, open-source R package. But if you have a large number of users on Tableau Server and use R scripts heavily, pointing Tableau to a single RServe instance may not be sufficient.

Luckily you can use a load-balancer to distribute the load across multiple RServe instances without having to invest in a commercial R distribution. In this blog post, I will show you, how you can achieve this using another open source project called HAProxy.

Let’s start by installing HAProxy.

On Mac you can do this by running the following commands in the terminal

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Followed by

brew install haproxy

Create the config file that contains pointers to the Rserve instances.

In this case I created in the folder ‘/usr/local/Cellar/haproxy/’ but it could have been some other folder.

global
    daemon
    maxconn 256

defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

listen stats
    bind :8080
    mode http
    stats enable
    stats realm Haproxy\ Statistics
    stats uri /haproxy_stats
    stats hide-version
    stats auth admin:admin@rserve

frontend rserve_frontend
    bind *:80
    mode tcp
    timeout client  1m
    default_backend rserve_backend

backend rserve_backend
    mode tcp
    option log-health-checks
    option redispatch
    balance roundrobin
    timeout connect 10s
    timeout server 1m
    server rserve1 localhost:6311 check maxconn 32
    server rserve2 anotherserver.abc.lan:6311 check maxconn 32

The highlights in the config file are the timeouts, max connections allowed for each Rserve instance, host:port for Rserve instances, load balancer listening on port 80, balancing being done using roundrobin method, server stats page configured on port 8080 and username and password for accessing the stats page. I used a very basic configuration but HAProxy documentation has detailed info on all the options.

Let’s check if config file is valid and we don’t have any typos etc.

BBERAN-MAC:~ bberan$ haproxy -f /usr/local/Cellar/haproxy/haproxy.cfg -c
Configuration file is valid

Now you can start HAproxy by passing a pointer to the config file as shown below:

sudo haproxy -f /usr/local/Cellar/haproxy/haproxy.cfg

Let’s launch Tableau and enter the host and port number for the load balancer instead of an actual RServe instance.

Connection information for the load balancer

Success!! I can see the results from R’s forecasting package in Tableau through the load balancer we just configured.

Results of R script evaluated through the load balancer

Let’s run the calculation one more time.

Now let’s look at the stats page for our HAProxy instance. In this case per our configuration file by navigating to http://localhost:8080/haproxy_stats.

Server statistics for the load balancer for Rserve instances

I can see the two requests I made and that they ended up being evaluated on different RServe instances as expected since round-robin load balancing forwards a client request to each server in turn.

Now let’s install it on a server that is more likely to be used in production and have it start up automatically etc.

I used a Linux machine (Ubuntu 14.04 specifically) for this. There are only a few small differences in the configuration steps. To install HAProxy, in a terminal window enter :

apt-get install haproxy

Now edit the haproxy file under the directory /etc/default/ and set ENABLED=1. This is by default 0. Setting to 1 will run HAProxy automatically when the machine starts.

Now let’s edit the config file which can be found here /etc/haproxy/haproxy.cfg to match the config example above.

And we’re ready to start the load balancer:

sudo service haproxy start

Now you can serve many more visualizations containing R scripts to a larger number of Tableau users. Depending on the amount of load you’re dealing with, you can start with running multiple RServe processes on different ports of the same machine or you can add more machines to scale out further.

Time to put advanced analytics dashboards on more screens 🙂

Standard
R

Quick Tip : RServe tricks that can make your life easier

RServe offers a number of configuration options that can come in handy when working with R inside Tableau but they are not captured in detail in RServe’s documentation. Let’s talk about a few.

How can I start RServe with a configuration file in a custom location?

RServe has a default place for the placement of configuration file e.g. /etc/Rserv.conf on Linux/Mac but you may not want to use that location or even have multiple config files that you switch sometimes so it is much more convenient to explicitly set it.

You can do this in the following way when starting RServe from R:

On Windows

Rserve(args="--RS-conf C:\\PROGRA~1\\R\\R-215~1.2\\library\\Rserve\\Rserv.cfg")

On Linux/Mac

Rserve(args=" --no-save --RS-conf ~/Documents/Rserv.cfg")

You can also do this from command line/terminal window. Assuming Rserve.exe is in the path or you’re in the folder that contains Rserve.exe so Rserve would be recognized:

Rserve –-RS-conf C:\\Users\\bberan\\Rserv.cfg

You probably noticed that on Mac there is an extra argument “--no-save”. On Mac starting RServe requires using one of --save,--no-save or --vanilla but what do they mean? The answer is in R help

BBERAN-MAC:~ bberan$ R --help
Usage: R [options] [< infile] [> outfile]
or: R CMD command [arguments]
Start R, a system for statistical computation and graphics, with the specified options, or invoke an R tool via the 'R CMD' interface.
Options:
--save                Do save workspace at the end of the session
--no-save             Don't save it
--no-environ          Don't read the site and user environment files
--no-site-file        Don't read the site-wide Rprofile
--no-init-file        Don't read the user R profile
--restore             Do restore previously saved objects at startup
--no-restore-data     Don't restore previously saved objects
--no-restore-history  Don't restore the R history file
--no-restore          Don't restore anything
--vanilla             Combine --no-save, --no-restore, --no-site-file,--no-init-file and --no-environ

Config files are very useful since they provide a centralized place to pre-load all the libraries you need so you don’t have to load them as part of each request which results in better performance, they allow evaluating any R code as part of RServe startup allowing you to load trained models or R script files, even data that you may want to use as part of your analysis when running  R code from Tableau.

How can I debug R scripts I have in Tableau by taking advantage of my R development environment?

As you’re writing scripts in Tableau, you may want to understand different steps data is going through especially if you’re getting errors when script is being evaluated. There are many ways to debug. Using RServe in debug mode (Rserve_d.exe) outputs all the exchanges into the command line and can be a a bit verbose. You can also insert statements like write.csv in your script to create output of different data structures but by favorite option is to do this using an environment like the basic R GUI or RStudio.

On Linux/Mac

If you started Rserve from your R console the way described in the beginning, you can insert print statements to print into the console and even your plot statements will create visuals in R (if you have X11 installed) which can come in handy when debugging large and complex chunks of R code.

On Windows

You can achieve the same by starting R using:

run.Rserve(args="--no-save")

This takes over the current R session and makes it into an Rserve session as opposed to starting it by typing RServe() which starts a new Rserve process by calling Rserve.exe.

I hope you find this useful.

Standard
R, Text analytics, Visualization

Trump vs. Clinton in N-grams

Presidential election campaigns are heating up and you all know what that means.

Word clouds 🙂

I still remember visualizations summarizing Romney-Obama debates in 2012 but one thing that slightly bothered me back then was that almost everything I saw were just counting single words so even the one of the most memorable phrases in Obama’s campaign such as “Four more years” were lost.

Surely, there will be some interesting phrases candidates will use this year. How can you make sure that your analysis doesn’t miss them?

I am planning on doing a series of posts on text analytics but tokenization is an important component in text analysis so let’s start with answering this question first.

This analysis relies on two R packages.

library(tm)
library(RWeka)

I wrote a number of helper functions to make analysis steps easier to follow.

# Helper functions
removeStopWords <- function(string, words) {
    stopifnot(is.character(string), is.character(words))
    splt <- strsplit(string, " ", fixed = TRUE)
    vapply(splt, function(x) paste(x[!tolower(x) %in% words], collapse = " "),character(1))
}

countWords <- function(y) { sapply(gregexpr(" ", y), function(x) { 1+sum(x>=0) } ) }

# Find N-grams within certain edit distance to avoid multiple subsets of same phrase
setToClosestMatch<-function(text) {sapply(seq_along(names(text)),function(x){
         ll <- agrep(pattern=names(text)[x],        
                     names(text)[-x],         
                     value=T,max=list(ins=1,del=1,sub=1))
ifelse(!is.na(ll[which.max(nchar(names(text)))]),ll[which.max(nchar(names(text)))],names(text)[x])})}

# To remove an, the, is etc. after finding n-grams 
reduceToRealNGrams <- function(ngrams,n){
ngrams[countWords(removeStopWords(ngrams, append(tm::stopwords("en"),c('going','can','will'))))>=n]}

# Tokenize 2 to 4 grams 
NgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 4,delimiters = " \\r\\n\\t.,;:\"()?!"))

Let’s start the analysis. I had one file for each candidate, containing transcripts for 5 speeches they gave between June 2015 and April 2016.

# Load the documents from specified folder
docs <- Corpus(DirSource("C:/Users/bberan/Downloads/Transcripts/"))
# Convert to lower case
docs <- tm_map(docs, tolower) 
# Remove common words : locations, candidates’ references to each other etc.
docs <- tm_map(docs, removeWords, c("des moines","new york","new hampshire","south carolina","united states","twin cities","san bernardino","hillary clinton","donald trump"))

This is an optional step that allows you to find phrases that differentiate presidential candidates by putting small weights on phrases used by both candidates.

# Create term document matrix with tf-idf weighting
tdm <- TermDocumentMatrix(docs, control = list(tokenize = NgramTokenizer,weighting = weightTfIdf))

m <- as.matrix(tdm)

# separate Hillary and Trump content and sort by frequency
hillary <- sort(m[,1], decreasing=TRUE)
trump <- sort(m[,2], decreasing=TRUE)

# Get top 250 most common N-grams for each candidate
hillaryTopN<-hillary[order(-nchar(names(hillary[1:250])), hillary[1:250])]
trumpTopN<-trump[order(-nchar(names(trump[1:250])), trump[1:250])]

Since we are looking for 2 to 4 grams, R will find “Make America”, “Make America great”, “Make America great again” as separate n-grams. This step consolidates all of them to Make America great again. It also gets rid of N-grams if after removal of stop words e.g. (a, and, the, or..) they become smaller than a 2-gram. For example a 3-gram like “this or that” would be dropped a part of this step. Removing stop words this late makes sure our phrase structure is not broken e.g. “to keep us safe” does not become “keep safe”.

# get rid of junk and overlapping n-grams
hillaryTopNConsolidated<- reduceToRealNGrams(unique(cbind(hillaryTopN,setToClosestMatch(hillaryTopN))[,2]),2)
trumpTopNConsolidated<- reduceToRealNGrams(unique(cbind(trumpTopN,setToClosestMatch(trumpTopN))[,2]),2)

Now that we completed the “key phrase extraction” process, we can write this data to CSV and read in Tableau to build a word cloud like this.

 N-grams frequently used by Hillary Clinton

It gives a much better idea than individual words but can we build something that gives even more context?

Of course by that I don’t mean a bar chart version of this. Despite all the criticism in the visualization community, word clouds do a decent job of conveying the information contained in a table with words and their respective counts for a number of reasons:

  • Counts in word clouds are not exact or reliable quantities. A speaker can repeat a word several times in a row trying to remember what to say next or when they are interrupted by the audience. Even if that doesn’t happen, whether someone said something 10 vs. 11 times hardly tells you anything meaningful. So when reading a word cloud, what people look for is whether things are roughly the same or vastly different. To use an analogy, for a thermometer reading, there is no reason to display the temperature to 6th decimal place if thermometer is only accurate to one decimal place. 6 decimal places give a false sense of accuracy and takes up unnecessary space. Large numbers with one decimal place is much more useful in conveying the information.
  • Applying transformations like TF-IDF can change the word count values by orders of magnitude which makes accuracy of bar chart even less useful.
  • If corpus is large enough, word frequencies follow a power law pattern. Power law distributions apply to many aspects of human systems. The best known example is economist Vilfredo Pareto’s observation  that wealth follows a “predictable imbalance”, with 20% of the population holding 80% of the wealth.The linguist George Zipf observed that word frequency also falls in a power law pattern, with a small number of high frequency words, a moderate number of common words and a very large number of low frequency words. Later Jacob Nielsen observed power law distributions also in web site page views which is why word clouds often work well highlighting the popular content on news aggregators or forums.

Maybe a good way of providing more context is finding the sentences that contains these common phrases that distinguish one candidate from another.

First I broke the text into sentences in R.

library(openNLP)
library(NLP)
library(tm)
splitToSentences <- function(text) { 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = "en")
text <- as.String(text)
sentence.boundaries <- annotate(text, sentence_token_annotator)
sentences <- text[sentence.boundaries]
return(sentences)
}

Docs[[2]] in my script holds Trump’s speeches so to get the sentences for it:

splitToSentences(docs[[2]])

Then I wrote the results to CSV, imported the CSV into Tableau and wrote some custom SQL to do a join between sentences and n-grams using CONTAINS as join criteria.

A little formatting to give it newspaper styling and here is the final result.

 Trump, Clinton speech analysis on Tableau Public

You can click on the image to see it live on Tableau Public and download the workbook if you like. Selecting topics from the right will bring up all the sentences containing that phrase.

Standard
R, Visualization

Correspondence Analysis in Tableau with R

Correspondence analysis is an exploratory data analysis method for discovering relationships between two or more categorical variables. It is very often used for visualizing survey data since if the matrix is large enough (which could be due to large number of variables but also possible with small number of variables with high cardinality) visual inspection of tabulated data or simple statistical analysis cannot sufficiently explain its structure. Correspondence analysis can remarkably simplify representation of such data by projecting both row and column variables into lower dimensional space that can often be visualized as a scatter plot at a small loss of fidelity.

Let’s take a look at an example. Below is the data from 2014 Auto Brand Perception survey by Consumer Reports where 1578 randomly selected adults were asked what they considered exemplar attributes for different car brands. Respondents picked all that apply from among the list that consisted of : Style, Performance, Quality, Safety, Innovation, Value  and Fuel Economy.

We can convert this data into a contingency table in R and do a chi-square test which tells us that there is statistically significant association between car brands and their perceived attributes.

chisq.test(table(yourDataFrameGoesHere))

But often this is not sufficient since my goal is to understand how different car makers are perceived to learn how people see my brand, how I compare with the competition, how to competitively position an existing product or bring a new product in the market to fill a gap.

Let’s visualize this as a cross-tab in Tableau.

2014 Auto Brand Perception Survey Results

Even though there are only 7 choices and a single question in the survey, this table is hard to interpret.

Let’s apply correspondence analysis and see what our scatter plot looks like. Here blue dots are cars. Blue points closer to each other are more similar than points farther away. Red items (e.g. Style being hovered over in the screenshot) are the attributes. The axes themselves do not distinguish independent dimensions for discriminating categories so attributes are useful in orienting yourself when looking at the chart and help assign names to different areas of the scatter plot. If you imagine a line extending from the center of the plot towards each of the red points, the distance of blue points to the lines indicate how related they are to the particular attribute. For example for Volvo, safety is the the perception that dominates. Same can be said for Kia and Value. But Subaru is considered safe, have good quality and value while Porsche and Ferrari are mostly associated with attributes Style and Performance, roughly the same amount.

Correspondence Analysis of Brand Perception Survey

This scatter plot explains 70% of the variance in the data. While it doesn’t capture everything, it is a lot easier to consume than cross-tabulation.

The rows and columns used in computing the principal axes of the low-dimensional representation are called active points. Passive (supplementary) points/variables are projected onto the plot but not taken into account when computing the structure of the plot itself. For example if there are two new cars in the market and you want to see their relative positioning in an existing plot, you can add them as supplementary points. If there are outliers, you can also choose to make them into supplementary points not to skew the results. Supplementary variables on the other hand are typically exogenous variables e.g. the age group or education level of the survey participant. In some cases you may prefer generating multiple plots instead e.g. one per gender. You can mark a column or row as supplementary using supcol and support arguments in ca function call e.g. ca(mydata,supcol=c(1,6)) makes 1st and 6th columns in the table supplementary.

You can add more to this chart to explore more. For example, you can put price of the car or safety rating on color and see whether they align with the perceived value or safety.  For example Tesla, Ford and Fiat are all associated with value while Tesla is not a budget car. Similarly Volvo and Tesla both have a 5 star safety rating but consumers associate Volvo much more with safety than any other brand. If you have multiple years of data, you can put years on the Pages Shelf and watch how perception changed over time, whether your marketing campaigns were effective in moving it in a direction you wanted.

Correspondence analysis use cases are not limited to social sciences and consumer research. In genetics for example microarray studies use MCA to identify potential relationships between genes. Let’s pick our next example from a different domain.

If there are multiple questions in your survey, you can use Multiple Correspondence Analysis (MCA) instead. Our data for this example contains categorical information about different organisms. Whether they fly, photosynthesize, have a spine….

Categorica attributes of different organisms

For a moment, imagine the first column doesn’t exist so you have no knowledge about what organism each row is. How easy would it be to understand if there are groups in the data based on these attributes?

Let’s apply MCA to this dataset. In this case I put the attributes in the secondary axis, hid their marks and made their labels larger. I also applied some jitter to deal with overlapping marks.

I can clearly see groups like birds, mammals, plants, fungi and shellfish. If the data wasn’t labeled, I would be able to associate them looking at the chart and by examining the common attributes of adjacent points start developing an understanding of what type of organisms they might be.

Multiple correspondence analysis applied to organisms dataset

You can download the sample workbook from HERE.

Standard
R, Visualization

Quick Tip : Overlaying curves on Tableau scatter plots with R

Tableau provides a good set of trend line, reference line and band options but sometimes you want to overlay curves based on a custom equation. Logistic regression curves, sine curves, quantile regression curves…. And want these overlay curves to be smooth…

This is very easy to do by taking advantage of the technique I shared when building Coxcomb charts and radial treemaps. If you know the equation (or know how to get to it) and it can be described in Tableau’s calculation language you can do so using a table calculation. But doing the fit dynamically involves R and when you’re passing the data to R you need to do some basic NULL handling. Here are two examples showing what the results might look like. You can see that despite very few points in my dataset and large gaps between them, the curves look very smooth.

A sine curve and logistic regression curve overlay in Tableau

The key component is the bin field created on the variable that’s on the X axis. In Tableau bins can be used to trigger densification by turning on “Show Missing Values” option on the bin. Doing so adds NULL rows into the data backing up the visualization, which you can fill with values from table calculations (including SCRIPT_ functions). In your R script, you need to remove these artificially generated NULLs, in order not to confuse the curve fitting procedure you’re applying. 

I tied the bin size to a parameter so you can try different values to make the curves coarser or smoother.

If you want to take this technique it a bit further you could use one bin for each axis which will allow you to create a grid. Then you an treat each cell like a pixel in a raster and draw shaded areas such as contours.

Below you can see two examples of this. First one estimates the bivariate kernel density of the data points in the viz on-the-fly using R on the fly and draws the contours using the secondary axes of this dual (technically quadruple) axis chart.

Displaying kernel density as overlay in Tableau

The second chart uses the same data to fit a one-class SVM (support vector machine) which is often used for outlier/novelty detection with a Gaussian radial basis function then draws the decision boundaries using the secondary axes (blue ellipse). Suspected outliers are shown in red while inliers are shown in white.

Displaying the boundaries of a one-class SVM as overlay in Tableau

You can download the sample Tableau workbook from HERE

Standard