This week, Chris Penn’s blogpost was about “When Bad Data Can Be Okay.” I agree with everything he says about distinguishing between data that’s
- predictably wrong and still useful
- just plain unreliable and not useful
- mysterious but you might be able to figure it out if you fiddle with it long enough
Chris closes by saying that rather than asking if our data is right, ask if it’s reliable. I would go a step further and suggest asking: is the data reliable enough. Consider 3 scenarios:
- You mail 500 coupons and 55 are returned as undeliverable GOOD ENOUGH
- Out of 500 tissue samples, 55 of the wrong people are told that they have lung cancer OH GOD, NO!
- 55 of 500 part numbers are missing, but all 500 have long been discontinued GOOD ENOUGH
We need the context to determine if it’s bad data or “good enough.”
Data is Never 100% Right, Not for Long …
… so, all we have available is “good enough.” As soon as you get a pristine dataset, a client changes their phone number, a supplier changes their parts numbering system, someone goes online and creates a duplicate profile, rounding errors accumulate or, you yourself accidentally input and save a date that is completely wrong. Thus, we have to accept that a dataset is never “right.”
But if the data isn’t good enough for our purposes, the next questions are:
- Is it worth the time, money and opportunity costs to make the bad data good enough?
- Can you ever get the bad data good enough or, might laws and privacy restrictions prevent access to what you want?
Accepting these shifting realities of data is a tough pill to swallow for young analysts and people who don’t work directly with data. In this guest blogpost, PIVOT TABLES: YOUR TOOL FOR EXPOSING MISCREANT DATA, that I wrote for Ann Emery’s site, I admit that my first few years of working with data I wasn’t asking. Only after enough embarrassment did it occur to me to first check to see if I had bad data or good-enough data.
It’s interesting that Chris Penn brings up this topic and found it blog-worthy. He must have identified this as a legitimate concern, as I have also. This comes up in Excel workshops when we cover topics like, dashboards, Excel pivot tables and the new Data Model feature in Excel 2013. Rather than diving in, calculating, and making graphs and summaries, the very first question must be about the data quality:
Is this data good enough for what we’re trying to do?
lightbulb image courtesy of smarnad at freedigitalphotos.net