Let’s talk Data Visualization today because the Joshua Katz study “Beyond Soda, Pop, or Coke” is stirring folks up again!
“I don’t sound anything like I’m from Boston!”
“Arkansas?! I’ve never even visited Arkansas?”
“Do I really sound like I’m from upper Wisconsin?”
As interesting as the study is, it’s insights are buried because Katz’s dashboard/report is weak.
BEYOND SODA, POP, OR COKE
This year, Katz, a PhD candidate at North Carolina State University, compiled responses and used data from a 2003 Harvard study on dialects in the continental United States. The result is a questionnaire that can be filled out and you can get your own dialect map showing U.S. regions that your dialect is most and least similar to. Here’s my map:
Katz’s study came to my attention when a friend in New Zealand took the quiz and had lots of dark and hot red in Wisconsin’s Northern Highland. This was certainly strange. It was also strange that my map has hot zones in southern Washington, Eastern North Carolina and Virginia.
MAKING SENSE OF THESE MAPS
Where things begin to make sense is in the explanation of what the colors mean:
“The colors on the large heat map correspond to the probability that a randomly selected person in that location would respond to a randomly selected survey question the same way that you did.” NYT Sunday Review, How Y’all, Youse and You Guys Talk.
So, for my New Zealander friend, the red in his upper Wisconsin only means that across the continental U.S. there’s the greatest likelihood of finding someone who’ll answer a random question the same as he did. ‘Likelihood’ being the key word. He didn’t keep his numerical scores but that likelihood was probably quite small, with the rest of the U.S. being even smaller. Here’s my map with the numerical scores, telling a much different story:
Notice the spread
Most Similar Lafayette, LA: 47.6 (never been there)
Least Similar Boston, MA: 44.3 (lived in Cambridge, MA for 2 years)
Preliminary Conclusion There’s roughly a 45.95% likelihood across the whole continental U.S. of someone answering a random question from the quiz, the same as I did. Ok … a less than 50% chance. WOAH!
We could rephrase this as “Lafayette is least dissimilar” to my dialect. We could even dig deeper into the math behind the survey. Example:
- 0.58% of respondents say they pronounce aunt the same as ain’t. (I probably got it from watching the Andy Griffith Show.)
- 8.42% pronounce the vowel in bag like the a in say.
A lot of such outlier scores will pull an overall score in one direction or another. Now contrast that against the map of a friend who’s from Minnesota. Her map is hottest in Minnesota, the 56.4% likelihood makes sense, and her first 3 Most Similar are all in the same region of the U.S. Those 3 pieces would lead a person to conclude, “whomever took that dialect quiz is probably from Minnesota.”
COMMENTS ABOUT DATA VISUALIZATION
For those of us who work with datasets and have to present results, there are good lessons to be had. Data visualization is powerful! Without a visual, understanding data can be like trying to hear music by reading notes on a staff. A visual shows us differences, variations, etc. But the reaction to Beyond, Pop, Coke and Soda shows us a number of things.
1. Our audience can take a portion of a result and draw their own conclusions without looking into the definitions or asking questions. They aren’t wrong or lazy, it’s just that not everyone works with data and sees automatic questions to ask. When creating dashboards, we have to do whatever we can reasonably do to have a presentation make sense, but also prepare for whatever response our audience may have because we can’t control everything.
2. It’s important that Katz included the numbers along with the sexy map because both are needed for any meaningful insight, even if the insight is suspicion that something is awry. If we want to go digging, we have more information on where to start, and define what we’re looking for.
3. Dashboards are useful for measuring the same data from different angles. The map and the numbers in the dialect results are a small dashboard showing the same thing 2 different ways. I think there could be additional indicators to make individual results even clearer. What might have kept my New Zealand friend from asking if he really sounds like a Northern Highland Wisconsinite?
- An additional map that would probably be shades of light blue to dark blue.
- Instead of giving everyone 5 Most Similar and 5 Least Similar regions, there wouldn’t be a score at all unless the numbers were within a certain range; e.g., a “Most Similar” needs to be above 50%.
- Place the percentages directly on the map so that it’s clear when maps are side-by-side, one person’s 60% hot-red Northern Highland can’t be interpreted that same as someone else’s 40% hot-red Northern Highland.
The dialect study is truly fascinating. But the final report/dashboard should have been just a little more informative to be more clear that “No. We aren’t trying to guess where you’d sound most at home.” As part of the message here on my website, this is another aspect of data management. We can do the analysis but when it’s time to present the data, we should ask if we’ve done enough to help our audience get the message?