Tell me if this sounds familiar. You’re tasked with analyzing data, but at first glance you find files in different formats, mismatched column headers, and worst of all, inconsistent values. Those Hermès hand bags appear as “Hermes”, “Hermès”, and even “Hermis” in some cases. You realize that analyzing the data is really an exercise in cleaning and preparing data – but how best to tackle it?
You spend 80% of your time doing what?
The scenario above is pretty common. Ask any data professional about their allocation of time, and the answer you’ll hear is 80% – that is, 80% of their time is preparing data, and only 20% is conducting analysis. Industry leaders like DJ Patil quote the 80% figure, and it’s frequently cited at conferences like DataWeek and Strata. When such a large chunk of time is devoted to preparing data, selecting the most efficient approach is crucial.
From personal experience, I can relate. We recently had a project analyzing retail data from multiple stores, where store and location were combined. A fictional example looks like this:
“STARBU #00498 PORTLAND OR USA 97626”
In our case, there were misspellings of Starbucks, missing store numbers, and cases where states were not abbreviated (but then truncated). Before we could begin our analysis, we had to clean it up.
Approaching the problem
There are a few ways to tackle this sort of problem, depending on size of the data set, timeline of project, and technical skills available. The three most common approaches are:
- The Manual Approach: The most direct way is to have an analyst correct the values in Excel. Using a combination of sorting, copy and paste, and some formulas, the analyst brute forces the problem until they arrive at a “good enough” file to analyze.The benefit to this approach is you get immediate results, but caution is required. Because this work is done by hand, it’s prone to mistakes, inaccuracies, and it’s difficult to keep files up to date. If the data prep is a regular activity, or the data increases in volume, scaling the project could mean hiring additional analysts or outsourcing to platforms like oDesk or CrowdFlower.
- The Scripted Approach: Balancing technical skill and setup time, a middle of the road option is to write a script to prepare and clean the data. My background is in Linux, and I tend to favor the text processing tools available like bash, awk, and sed. If additional complexity is needed, more featured scripting languages like perl, python or php are also options.Just as in the manual approach, caution is required. Taken too far, scripts for large data preparations become brittle, requires supervision to run, and are difficult to maintain. I’ve seen organizations try to run their entire preparation and ETL process with bash and perl, and the results aren’t pretty.
- The Enterprise Approach: Solutions from Informatica, IBM and SAP have been around for years, and while powerful, requires a large upfront investment in skills, time and money. For a big, well understood data cleansing operations, these options can make sense, but be wary of using a sledge hammer when a sewing needle will do the job.
Innovators on the Horizon
For many years, these three approaches were the only ones available. Recently, some new arrivals have emerged that are promising advanced functionality for non-technical users, and radical productivity boosts. Some of the front runners include:
- Paxata: Backed by Accel and In-Q-Tel, the leadership team originates from industry heavy weights TIBCO, SAP and Tableau. Today, Paxata is most suited to ad-hoc analytic exercises, similar to the payments data example described above, where you start with a number of files to analyze. Paxata can import and blend files directly, without needing to be tied to a database or data warehouse. The data is cleansed with easy options for ngram and metaphone matching algorithms, and the output made available to BI tools like Tableau and Qlik via ODBC.
- Trifacta: Backed by Accel and Greylock, the concept originated from a Stanford research project called “DataWrangler”, and the leadership team draws significant executive expertise from both academia and Informatica. Trifacta allows users to blend, clean and refine data in a preview, then execute the transformation either locally or pass the job to a Hadoop backend.
- ClearStory: Backed by Google and Andreessen Horowitz, ClearStory encompasses both data blending and cleaning, like Paxata and Trifacta, but also adds a layer of visualization and annotation. ClearStory is different in this regard – Paxata and Trifacta both highlight their partnerships with BI tools Tableau and Qlik to facilitate visualization.
With a growing ecosystem of tools to address the cleaning and preparing step, hopefully the 80/20 rule will soon be behind us.
Image credit: “Face Palm” by Brandon Grasley