The Path to Hell is Paved with Poor Assumptions

Validating data may cost time, but refraining from it will cost more.

How wonderful is the life of a data journalist. There is so much data publicly available that whenever you whenever you are in need of a story, all you need to do is go to an interesting data repository and start questioning it and low and behold… you have a story. Now all you need to do is write it down in clear, readable prose and maybe throw in some exciting visuals and you’ve got something truly exciting that will surely get people talking.

Take this story at fivethirtyeight.com for example: “Mapping Kidnappings in Nigeria“. This story was published shortly after Boko Haram kidnapped over 300 schoolgirls that started the international “Bring back our girls” campaign. It features a nifty interactive map that shows how the number of kidnappings rose rapidly in the past decade. Naturally it created a lot of buzz on social media…

comments

As I am sure you will agree, this is not exactly the kind of publicity any journalism organization would want, but fivethirtyeight.com is actually an organization that prides itself on being dedicated to journalism. Its foundational manifesto specifically states “… one of our roles will be to critique incautious uses of statistics when they arise elsewhere in news coverage.” As commendable a goal that is, it does give the impression that they would then think twice before publishing a story that refers to ‘media reports’ as ‘discrete events’.

I admit I do not know the precise number of articles devoted to the murders of presidents Lincoln and Kennedy, but when I typed in the phrase “president of the united states murdered” at Google, it yielded “about 41,700,000 results”. If we were to take all of those as discrete events, we’d all sincerely believe that being the president of the United States is the deadliest job in the world.

Yes, this veers into the ridiculous, but this is a ridiculous mistake to make, especially for an experienced data journalism organization.

In all fairness to fivethirtyeight.com however, they did live up to one of the hallmarks of good journalism, namely that of transparency. Rather than pull the article, they owned up to their mistake. The article is still available, but now it starts with an admission of guilt, followed by an apology and a long explanation of the many errors made.

editorsnote

So how come this mistake was made in the first place? Unfortunately, the editors did not explain that. However, the comments that I have read seem to agree on two things:

  1. “Poor proxy variables”.
  2. “Data has no meaning without context.”

Proxy Variables

The blog ‘Adventures in Statistics’ defines proxy variable as “… an easily measurable variable that is used in place of a variable that cannot be measured or is difficult to measure.” In this instance, the journalist relied on news reports about kidnappings since she could not do so on official police reports. Unfortunately for her, and her editors, she forgot that, paraphrasing Erin Simpson, “All trend analyzing using < a news article database> has to take into account the exponential increase in news stories which generate the data.”

In other words… if you’re using proxy variables, think carefully whether they really are applicable. In fact, it is probably best to check with both a statistician and an expert in the field you are covering. This will take time, but validating the data is a data journalist’s first responsibility.

Data without context is meaningless.

The original criticism by Erin Simpson actually provided some good questions. Had they been asked, answered and used in the original analysis, they would have yielded quite an interesting, and validated, story.

  1. “Total number of stories coded for Nigeria over time (what is the shape of that curve)?”
  2. “What are the total number of events generated for Nigeria over time? (What is the shape of that curve?)”
  3. “How does the number of kidnappings compare to the number of coded events? Same shape? Key differences?”
  4. “How many overall events are coded with a specific geolocation? How many get coded to a centroid? (And where is the centroid?)”
  5. “How many kidnapping events are coded with a specific geolocation? Does that change over time?”
  6. “How does this information track with other open source reporting? HRW, UN, WB local NGO crime reporting? Can we corroborate trends?”

So why didn’t they take the time to ask these questions? Data visualization expert Alberto Cairo offers several suggestions for data journalism organizations to help them prevent making these costly mistakes. In my humble opinion, they all apply to this case.

“Data and explanatory journalism cannot be done on the cheap.”

Traditionally, data journalists worked in large news organizations with an excellent network and many resources. Organizations like thirtyfiveeight.com lack those and would thus struggle to find the required expertise in time before publication.

“Data and explanatory journalism cannot be produced in a rush.”

This was likely the most crucial element in this example. In an environment which needs stories to be produced daily, journalists may well not have the time to stop, think, and verify that the way they have questioned their data set is actually valid.

“Part of your audience knows more than you do.”

Of course, that has always been the case since journalists are not expected to be lawyers, engineers, or physicians. However, the combination of journalistic transparency and public data means readers can verify your conclusions and if they find fault with it, let you (and the world) know. It is an additional risk that data journalism organizations need to take into incorporate into their work processes and business models.

“Data journalists cannot survive in a cocoon.”

As professor Paul Bradshaw mentioned in his lecture on “Setting Up ‘Data Newswires'”, the accuracy of the data needs to be checked by asking the following four questions:

  1. Who collected the data?
  2. When did they collect it?
  3. How did they collect it?
  4. Find another source of the same sort of data for comparison.

In other words, the data journalist needs to either know the domain herself or work with someone who does. This actually could work out well for the organization in combination with the previous suggestion. By reaching out to devote audience members whom you have reason to suspect are experts, one can both increase audience satisfaction and the verify the validity of one’s data.

What about time?

In data journalism, you cannot afford a hidden trade-off between time and validity. Once you have gathered a dataset for a story, you always need to take the time to validate your data and make sure you ask the right questions. Not doing so can have disastrous consequences that will make people question the value of your organization.

One potential solution for this problem is to make your audience part of the process. Turn the story into a series that is updated regularly. Start with an introduction which explains what the data set and what the question you want to see answered is. Then invite audience members to collaborate with you. If they don’t have the knowledge themselves, they might know someone who does.

Of course, this does require another process of verification: whether these experts are indeed who they claim to be. However, over time you will build up a network of reliable experts you can count on.  Better to have them assist before, than criticize your story after publication.

If Data journalism cannot afford a hidden trade-off between time and validity, then it is best to be open about it and get people to keep coming back to you.


Sources:

Bradshaw, Paul (2014). “Setting up ‘Data Newswires'”.

Cairo, Alberto (july 09, 2014). “Data journalism needs to up its own standards”.

Chalabi, Mona (may 13, 2014). “Mapping kidnappings in Nigeria” 

Frost, Jim (september 22, 2011) “Proxy Variables: The Good Twin of Confounding Variables”.

Simpson, Erin (may 13, 2014). “If a data point has no context, does it have any meaning?”

The Functional Art (may 13, 2014). “When plotting data, ask yourself: Compared to what, to whom, to where, and to when?”

Advertisements

6 thoughts on “The Path to Hell is Paved with Poor Assumptions

  1. I really like your last part, the time. The solution you suggested is a good approach to solve the solution and it is a nice try to interact with the readers in web3.0. However, why data journalism is powerful? Because big data automatically builds a model inside instead our traditional way: people need to build model from a limited sample to explain why. The big data allows people turn to correlation instead of seeking for causation.

    So, if we use the crowd source to tell our story, does that mean we reduce the power of data? Or does that mean we split the whole dataset into segments? During the highly engagement process, is the potential risk of reliability increase too much that even outweighs the benefit ?
    As you said in the presentation, data journalism is never cheap, it need time, money, and enough wisdom.

    Anyway, I am very interested in your suggestion, and I am wondering if we have a chance to apple your idea in certain program. I am serious.

    Like

    • Splitting the whole dataset can be done, but I wouldn’t recommend doing it blindly. The Guardian once did it though with great results, but they built a specific website for it and clearly had material that allowed this to be done. You can read about it here: http://www.niemanlab.org/2009/06/four-crowdsourcing-lessons-from-the-guardians-spectacular-expenses-scandal-experiment/

      However, this obviously cannot happen with all data. As Wernard details in his blog, the power of big data is that it’ll allow us to see patterns we wouldn’t have been able to detect otherwise. But as mentioned in my response to Gabrielle below… you can also rely on individual members of the audience who are clearly motivated to get things right. How different would the article mentioned above have been if the journalist had gone to Erin Simons before and asked for advice on how to use this dataset?

      As for your last question, you mean whether we’ll have a chance to ask for audience participation in our fact checking mission? I have no idea, first we need to think of an article to fact check. But it might be fun to see if it could be done, yes.

      Like

  2. Very informative and strong blog! I enjoyed both, your blog and presentation of it! So congratulation on that!
    “Now all you need to do is write it down in clear, readable prose and maybe throw in some exciting visuals and you’ve got something truly exciting that will surely get people talking.” It seems you are reading my thoughts as nowadays it seems that people are looking for these kind of articles easy and fun to read. Skipping more complex and serious ones due to lack of time or motivation….But that’s another topic.
    I do agree that data journalism needs a lot to be fulfilled. And I do have one question regarding “Part of your audience knows more than you do” part. Of course, journalists can’t know it all but do you think we can trust audience. I believe it’s a thin line as there a lot of misconceptions floating around which creates mass opinion which might be…wrong. What’s your opinion on this. How and when should we trust reader’s opinion?

    Like

    • Do I believe data journalists should trust their audience? Yes, I do. Of course we can’t do it blindly… as Wernard points out below. His response is something I agree with, so I’ll let that stand as part of my own answer to your question.

      But I do want to extend it with this parable:

      “The editor of the Dutch Financial Daily (Financieel Dagblad) once remarked how before they had a paywall up… the most viewed story that year had been about a famous but now very old Dutch clown becoming a father again. However, the following years readers could only access 10 free articles a month… and they did not waste them by reading light, fluffy pieces.”

      Although almost everyone may enjoy reading those short and silly pieces (I admit I did read it myself), when people are motivated to seek out a certain site they will go for that site’s strengths. Which is why, imho, an organization dedicated to data journalism should make sure they process their data correctly. Because chances are high that its audience will itself have an interest in data processing, the data sets, and the domain itself.

      In other words… the audience of a data journalist will almost certainly contain people who are both highly motivated to getting the facts right, and who have the appropriate skills and knowledge to be of assistance. And a data journalist will be able to recognize these individuals over time and reach out to them. Of course, credentials will need to be checked… just like with any consultants in regular media, and April’s point about motivation needs to be considered as well. But overall… if you want your audience to trust you, you need to trust your audience.

      And as a good data journalist, you’ll actually have the tools at hand to verify that trust. 🙂

      Like

  3. I actually really like the point about making the audience part of the process. I think it’s a good way of dealing with the problem of time and workload. Of course, you can’t trust the audience blindly. But, as I have mentioned elsewhere, the audience could be self-regulating. The idea is that if enough people collaborate with you, they will correct themselves. This might be an overly idealistic view of crowd sourcing, and obviously there are some pitfalls that need to be avoided. For instance, it would only work if enough people participate… I still think the idea is solid though.

    Like

    • Thanks for answering Gabrielle’s question for me, Wernard. 🙂

      A blog I personally love is that of Norman Tebbit, an 83-year-old British politician who writes a weekly blog for the Daily Telegraph. You can find it here: http://blogs.telegraph.co.uk/news/author/normantebbit/

      What I love about the blog aren’t his views, but the unique way he presents it. All are divided into two parts. The first part contains his vision on current events, and in the second he summarizes and comments on the views made by his audience in the past few weeks. He’ll name those commenters whom he found noteworthy, both supporters and detractors. It’s a very old-fashioned, polite way of blogging that I haven’t seen anywhere else. And one that imho shows his respect for and trust in the audience.

      Like

Geef een reactie

Vul je gegevens in of klik op een icoon om in te loggen.

WordPress.com logo

Je reageert onder je WordPress.com account. Log uit / Bijwerken )

Twitter-afbeelding

Je reageert onder je Twitter account. Log uit / Bijwerken )

Facebook foto

Je reageert onder je Facebook account. Log uit / Bijwerken )

Google+ photo

Je reageert onder je Google+ account. Log uit / Bijwerken )

Verbinden met %s