I’m getting ready to teach a 2-session workshop on parsing data in Excel. As with other courses, before I teach them for the first time, I have to clarify the mindset to decide who would get the most from the course. Scrubbing data, reorganizing columns, peeling part names away from part codes … that’s all different from building a spreadsheet for managing a project. Why is this a big deal? Two reasons:
1. In “What Does It Cost To Clean My Data?” we meet Kaye, a Business Analyst who has had messy data dumped on her and she also inherits the job of cleaning it. Many of us are like Kaye: we’ve got our “real job” to do but are faced with hours or days of screwing around with source data before we can do what we are supposed to be doing.
2. Many people who consume data but don’t work with it are oblivious to the struggle and cost of getting useful data.
So, let’s explore this.
THE DATA MANAGEMENT MINDSETS
Validation & Data Quality require a mindset of “no surprises.” If there’s anything strange, we already know about it and addressed it.
Building Apps is a fun adventure where we balance artistry, functionality, time and budgets.
Data-Parsing? Just plain evil in a sexy way.
DATA-PARSING IS A SPORT!
Data parsing is what we all have to do at some point because
- We received data that isn’t formatted the way we need it.
- We received crap data that has to be cleaned up.
- We have our own crap data in multiple sources that has to be merged and cleaned.
Because many of us grudgingly plow through this, few people are really skilled at parsing data. We “kinda do it.” But like with any full contact sport, you can’t “kinda do it.” You can’t kinda get up on the back of a bull or kinda step onto an Olympic wrestling mat.
I was recently discussing a project with someone and warned that their data collection project would be easy. Unbeknownst to them, however, making the data useful is where the real problem lies. As they continue forward, I’m reminded of Alfred Lord Tennyson’s Charge of the Light Brigade:
into the valley of death rode the 600
theirs is not to reason why,
theirs is but to do and die
Gathering 2000 addresses off the internet is easy. But, having name, address, city and state all within a single cell isn’t a useful format. That’s not the end. That’s only gotten you on top of the bull in the chute. Now whatcha gone do?
A dataset is a foe that comes to us however it comes to us. So, before making contact with it, look for patterns, use pivot tables, do some sorting and filtering just to see what’s what. Diving in and starting to clean data without some assessment of your opponent is a death wish.
What we need to do is building strategy and ask questions.
- How many records are you dealing with?
- Are there duplicates or was that addressed by the source of the data?
- What does done look like?
- Is there any “noise” that can be deleted? Clear that stuff out of the way so that you can see. (Pager numbers, middle names, if you’re looking at parts inventories maybe you don’t need the items’ weights)
After the assessment, we need nerve, focus, experience, stamina, patience and strategy more than we need tools.
GANGNAM STYLE OR GUERRILLA STYLE?
We don’t get a choice when we’re parsing data. When building an app, we can negotiate with an end-user who wants too much automation. Not with data-parsing, and that’s why it’s both evil and sexy.
The data comes to us how it comes to us. If there are easy patterns and little variation, the competition will be graceful and quick. However, as is the case with extracting data from PDFs, that s#!t ain’t never easy. I run PDFs through at least 3 converters and go with the result that will be the least guerrilla style in bending the will of the dataset and forcing it to cooperate. I have never ever had data come out of a PDF converter immediately ready for consumption. When the ornery mess of data emerges from the PDF converter, the fight is on, baby!
Once the battle begins, most important is the ability to articulate a task in English so that you can choose the right attack. In the image to the right we need the city and state separated. There is a carriage-return that’s creating a problem.
In Excel a carriage return is CHAR(8). We can do a FIND for CHAR(8) and REPLACE with something that Excel’s text-to-columns feature will recognize.
Again, though, this is strategy. This is like stringing together wrestling moves to get the data separated by using CHAR(8), REPLACE, and Text-to-Columns. This is like the bullrider reacting in-the-moment to a spinner bull that changes directions. This the data-parsing mindset!
The sport of Data Parsing is intriguing because we have little control over what we get, and we have to produce. The people who are good at their full jobs, not just what they “really” do, are those who empower themselves around parsing data. They embrace the intrigue, learn strategies, and exercise patience. They save time, reduce stress and get on with life. Are you ready to get up on that bull or step onto that wrestling mat? If not, don’t “kinda do it.” Get someone who “really does it.”
Women Olympic Wrestling photo credit: familymwr via photopin cc
Jumping Bull photo credit: Bill Gracey via photopin cc
Bull in Chute photo credit: emilio labrador via photopin cc
USA Wrestling photo credit: LondonAnnie via photopin cc