Data Cleansing (Data Scrubbing) the process of amending or removing data that is incorrect, incomplete, improperly formatted, or duplicated.
WARNING Hold on to your hat. This can get ugly.
The cost of cleaning data is tough to get at. I used to charge $150 to examine a data-set and $90/hour to do the cleansing. The final charge could range from $300 to well over $2000. But that depends on many factors, including:
- Amount of data
- How messy the data-set is
- If anything needs to be extracted from a PDF
- Number of sources that need to be merged
- If Excel is our tool, will VBA code need to be written?
- How much manual brute force must be applied
- Client’s budget
Other data cleansing service websites only say “contact us for consultation” and this is understandable because of the many variables that go into the cost of cleansing data.
3 POINTS ABOUT DATA CLEANSING
1. Of all the areas of Data Management, data cleansing is hardest to teach and is closest to The 12th Labor of Hercules: Kidnapping Cerberus.
Data cleansing/scrubbing/parsing is something I haven’t yet attempted to teach in my workshops. Over 15 years of experience, and data cleansing is the area where there is always something new in the way that data can be filthy, requiring a new level of cunning and guile. (How do you teach someone to capture a 3-headed dog at the entrance to Hades? I don’t know yet.)
Scrubbing data is conceptual and requires a lot of experience. It’s more strategic than anything else. Formulas can help you move faster but a solid strategy will always yield a better result than a quick solution with a vague strategy. Solid strategy is developed after a wide array of experiences. And that can’t be taught.
2. Data cleansing is not a $14/hr admin task. From outward appearances, data cleansing looks like drag-&-drop, but it’s not. There are patterns to look for. There are ways of isolating specific problems and handling them on their own.
When cleaning data you must minimize the level of human intervention. Copy-Paste is an absolute last resort. You don’t want to get into the middle of that, get interrupted by an important call and then remember where you were with the epic Copy-Paste job. It’s easy to lose one’s place with this and make an error or waste time.
One time, I used VBA code to peel apart 2000 rows of bold and non-bold text because the bold text was the data of interest. It was instant and it was clean.
You don’t get VBA coding for $14/hr.
3. Much of the crap data-sets I’ve seen were preventable. Whether we’re solopreneurs or an international company, Data Management has to be a conscious part of the culture, otherwise, we’re running an operation on crap data. When it starts hurting bad enough, it’s expensive to hire Hercules to handle your Cerberus.
Let’s probe a little deeper …
KAYE: BUSINESS ANALYST & UNWITTING DATA SCRUBBER
Kaye is a Business Analyst hired to review a company’s Classroom Operations data. She knows classrooms, processes, training, logistics. She can determine if a course is a loser or not.
What Kaye hates is filthy data! Even at $130/hr she hates cleaning data. Over lunch she rants to her friend Mindy, “It’s going to take the rest of the [insert expletive] week to clean the data I got today! 31 misspellings and abbreviations of Albuquerque in 1500 transactions worth a half-million dollars .”
Mindy asks, “If you know it’s Albuquerque, what’s the big deal?”
Kaye explains, “When the client wants to know the number of class cancellations, revenue and profit margins, every one of those 31 variations will show up like a unique city. The Albuquerque market can look like a flop when $500,000 is spread between 31 records instead of 1. The wrong person sees that and starts taking about pulling the courses.”
Mindy looks confused.
Let’s me say it this way, Kaye continues. “If there’s Albuquerque and $300,000 tied to it, that might be far below what’s needed to continue in that market. But wait! There’s $10,000 tied to ‘Albukerkee’ and $100,000 tied to ‘Albuquerquee’ for a total of $200,000 associated with the misspellings. $500,000 is worth staying in the market, but I’ve got to clean the data before we can see all of the $500,000.
“Oh damn! I get it.” Mindy says.
Kaye sighs and continues, “Now get this: 8 ways to abbreviate and misspell ‘County Hospital.’ I’m a Business Analyst, not a spellchecker!”
THE DATA CLEANSE (APPROACHING CERBERUS)
Understand this. Kaye’s high-flying Excel and database skills are in statistical functions, graphing and building dashboards. Using Excel’s CHAR function to find a carriage-return to scrub data is not in her arsenal. Yet, Kaye’s real $130/hr job can’t start until someone cleans that data, and it’s going to be Kaye
- 15+ miserable hours of brute force, tedious work outside of her skill set
Below is an embedded Excel worksheet with just 8 rows, 7 columns. Imagine thousands of rows, dozens of columns and 20 sheets! Now we’re within view of Cerberus!
The Recon: What are we dealing with?
- No one has a pager #. Get rid of this column.
- Phone numbers are in the address field
- Angela Wayne: her Office is in her Department field
- Kevin Hayes is in here 2x.
- Kevin Hayes Hayes
- Kevin Hayes
- Kevin Hayes has 2 different Join Dates
- Inconsistent date formats
- Mo’s Join Date is just ‘September’
- Toni has no Last Name, no Join Date, no Address
- What’s with the bizarre spelling of Albuquerque in the Office column?
This has to be cleaned before any analysis can happen.
For more details, I did a guest blogpost for Ann Emery at EmeryEvaluation.com called Pivot Table: Your Tool for Exposing Miscreant Data. Just a small glimpse into the reconnaissance ops in preparing to capture Cerberus.
SUMMARY: THE COST OF DATA CLEANSING (I.E. THE COST OF KIDNAPPING CERBERUS)
As you see, the answer to the question is: the cost of cleansing your data is complicated and expensive.
The cost is in:
- the money
- paying someone to do something that’s not their expertise AND they hate doing it
- the operational costs of the data being filthy
- The cost of implementing new processes so that the clean data doesn’t go back to Hades
Northwest Database Services charges $200 set-up fee per list, $125/hr. I had the aforementioned rates. Other services use data scientists and create tools that are subscription-based at several hundred dollars per month for successful data cleansing.
For small businesses, you’ve just got to avoid this whole situation as best you can. You can do this by having a Data Management mindset and taking precautions. Many of us run our operations on Excel (even in major companies that have expensive data services, there are departments fueled by Excel spreadsheets). This is where having skills on the front end can save you on the back end. Use data validation, conduct your own minor but continual data cleansing, keep data consolidated, and be on guard for any chance that you’re setting yourself up to have to go capture Cerberus. It’s not pretty, and its not cheap. Some businesses are simply stuck because they don’t have the money for a thorough data cleanse.
As always, I invite you: keep your data clean. Now you know why.
I’m retiring from the hands-on data-scrubbing but, you can contact me for live workshops and 1:1 Excel training to empower you to keep your data clean or guide you in the capture of your Cerberus.
Hercules & Cerberus Photo Credit: katty_meoOz at Google+
@Kevin Gilds MPA @OzData Lordy lordy! I hear so much about those. People inherit a spreadsheet from someone who quit or got fired. So, there's no telling what all the italics and colors mean. Do you just ignore it and plow ahead?
I was lucky. I never inherited those mysteries. I inherited one spreadsheet that had MASSIVE formulas in it. That forced me to step up because those formulas calculated commissions. "Someone else wrote the formulas" wouldn't be a good excuse if something went wrong.