Warning: mysql_real_escape_string(): Access denied for user 'dragon99'@'localhost' (using password: NO) in /home3/dragon99/public_html/wp-content/plugins/easy-contact-forms/easy-contact-forms-database.php on line 152

Warning: mysql_real_escape_string(): A link to the server could not be established in /home3/dragon99/public_html/wp-content/plugins/easy-contact-forms/easy-contact-forms-database.php on line 152
Getting to the Sexy Data - Oz du Soleil

Hey! ‘Sexy’ isn’t my word. It’s used twice in the New York Times article, For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.

‘Sexy’ is used once each by Data Scientists Michael Cavaretta and Matt Mohebbi in this article that describes the problem of getting to the sexy aspects of working with data; i.e., insights, facts and accurate predictions that help you make smart decisions. Lips


Verily, verily, before the analysis, big paychecks, and predictions, someone’s got to clean and shape the data that’s been accumulated. In one example, Iodine, a company that compiles data on the side-effects of medicines, is faced with merging data from multiple sources and each source uses a different word for the same symptom: drowsiness, somnolence, sleepiness.

No analysis before those are merged into a single term. Hence, Mohebbi says:

Matt Mohebbi Quote

According to the NYT article, data scientists report spending 50% to 80% of their time being data janitors. The article describes various companies that are designing software that can ease the janitorial work. And that leads to the second use of ‘sexy.’


It’s great to see the New York Times write about this un-sexy aspect of data management because the janitor work isn’t discussed enough. Yet, it’s a hurdle to much more than insights. Crap data (my formal term) has an impact wherever there’s data.


I recently talked with a real janitor who is in charge of pest control on a corporate campus. His data was such a mess that he couldn’t keep appropriate inventories of traps, goose repellent, and pesticides on-hand. He had the data but, he and his team lacked the skill for the ‘data janitor’ work, and couldn’t get to the cool sexy data which would tell him what to order based on the time of year, known activity of each pest, and the shelf-life of the pesticides.

The pest control guy’s plight is more interesting to me the Data Scientist. They both share similar hurdles. However, a true Data Scientist is likely to be in a situation where there are resources for cleaning data. Those tools/services that are described in the NYT article aren’t cheap. One of the companies, Paxata offers Data Preparation Services for an individual: $3500/year. For a team: $10,000/year. Not gonna happen for the pest control team!


You are not alone! The NYT article opens a bigger conversation that needs to be discussed far more than Big Data. Being a data janitor escapes no one who works with data. It also tends to take up a painful amount of time. Even for analysts who have been formally trained. As students, they work with clean data, learn their statistics and data visualizations, and then something happens. Reality hands them 3 terms for ‘drowsiness’ and a healthy helping of misspellings for each across 100,000 rows of data. Also:

  • Data that’s locked in PDF files
  • Numbers that won’t calculate because they’re formatted as text
  • A mix of spreadsheets, text files, PDF and CSV files that need to be appended and merged
  • Incomplete data that prevents any kind of useful analysis
  • Lists that need to be compared for differences between them

… and almost no training on how to deal with any of it.

So, it’s ok to tell the truth: being a data janitor sucks!

But there’s good news. For most of us, adequate tools already exist, and are inexpensive.


“ye have not because ye ask not.”

Ask For Help

For whatever reasons, asking for help is regularly an undesirable option, somewhere after longtime silent suffering.

The Excel community is vast, and incredibly helpful. A lot of us enjoy helping others and responding to new challenges. Check out my list of Bad-Ass Excel Bloggers. We can help get you to the sexy data.

Pay For Help Or Training

Professional help or training is going to require a specialist because there aren’t any “data janitor” courses. Spend the money to have a specialist do the janitor work or teach your team how to do the janitor work.

The main point is that there’s help, and it doesn’t require special tools. When a specialist is brought in, it’s worth the money. Here are two examples:

  • A small company took 2 people off other regular jobs and assigned them to compare 2 lists of several thousand rows. After 2 days, they were through one-fourth of the project. When they turned it over to me, it took just 45 minutes to complete the whole job. It wasn’t magic. It was many years of doing janitor work and having seen many versions their problem before.
  • An entire office was stopped because their payroll spreadsheet was wrong. After 2 days of the owners combing through the spreadsheet looking for the error, they asked for help. One hour later, we found the error, fixed it and forwarded the spreadsheet for payroll processing. Again, it’s a matter of experience and having a strategy for troubleshooting and janitor work.

When the owners of a company are digging into a spreadsheet that impacts 30+ people’s paychecks, it’s smart to spent the money for professional janitor work and move on.

More Data Literacy Conversations That Include Janitor Work

Early in 2014 data professionals Trina Chiasson and Dyanna Gregory lead an effort to create an eBook with contributors from around the world. DATA+DESIGN: a simple introduction to preparing and visualizing information is a free and open-source guide for non-data people who need to work with data. Chiasson describes the original vision as such:

We all believed in the vision of making data simple.

But working with data can be far from simple. Data come in all different shapes, sizes, and flavors. There’s no one-size-fits-all solution to collecting, understanding, and visualizing information. Some people spend years studying the topic through statistics, mathematics, design, and computer science. And many people want a bit of extra help getting started.

DATA+DESIGN is purposefully written in plain language to help get people started. It’s also written to be tool-agnostic so that users aren’t distracted by lingo surrounding Excel, SQL, JavaScript, Regular Expressions, etc. Thus, it’s conceptual and espouses strategies more than how-to. I wrote the chapter: Getting Data Ready for Cleaning and was excited to see the final product is a wonderfully comprehensive view of working with data. DATA + DESIGN is the full truth of what it takes to get to the sexy.

Please! Keep this conversation alive. Being a data janitor does suck. We just need to tell the truth about it and ask for help. We need more urgency around this problem of janitor work and the time it takes for all of us. Solutions are closer than you think. The sexy data is closer than you think.

useful data is sexy

Lips photo by Daniaah