The Path to Hell is Paved with Poor Assumptions

Validating data may cost time, but refraining from it will cost more.

How wonderful is the life of a data journalist. There is so much data publicly available that whenever you whenever you are in need of a story, all you need to do is go to an interesting data repository and start questioning it and low and behold… you have a story. Now all you need to do is write it down in clear, readable prose and maybe throw in some exciting visuals and you’ve got something truly exciting that will surely get people talking.

Take this story at for example: “Mapping Kidnappings in Nigeria“. This story was published shortly after Boko Haram kidnapped over 300 schoolgirls that started the international “Bring back our girls” campaign. It features a nifty interactive map that shows how the number of kidnappings rose rapidly in the past decade. Naturally it created a lot of buzz on social media…


As I am sure you will agree, this is not exactly the kind of publicity any journalism organization would want, but is actually an organization that prides itself on being dedicated to journalism. Its foundational manifesto specifically states “… one of our roles will be to critique incautious uses of statistics when they arise elsewhere in news coverage.” As commendable a goal that is, it does give the impression that they would then think twice before publishing a story that refers to ‘media reports’ as ‘discrete events’.

I admit I do not know the precise number of articles devoted to the murders of presidents Lincoln and Kennedy, but when I typed in the phrase “president of the united states murdered” at Google, it yielded “about 41,700,000 results”. If we were to take all of those as discrete events, we’d all sincerely believe that being the president of the United States is the deadliest job in the world.

Yes, this veers into the ridiculous, but this is a ridiculous mistake to make, especially for an experienced data journalism organization.

In all fairness to however, they did live up to one of the hallmarks of good journalism, namely that of transparency. Rather than pull the article, they owned up to their mistake. The article is still available, but now it starts with an admission of guilt, followed by an apology and a long explanation of the many errors made.


So how come this mistake was made in the first place? Unfortunately, the editors did not explain that. However, the comments that I have read seem to agree on two things:

  1. “Poor proxy variables”.
  2. “Data has no meaning without context.”

Proxy Variables

The blog ‘Adventures in Statistics’ defines proxy variable as “… an easily measurable variable that is used in place of a variable that cannot be measured or is difficult to measure.” In this instance, the journalist relied on news reports about kidnappings since she could not do so on official police reports. Unfortunately for her, and her editors, she forgot that, paraphrasing Erin Simpson, “All trend analyzing using < a news article database> has to take into account the exponential increase in news stories which generate the data.”

In other words… if you’re using proxy variables, think carefully whether they really are applicable. In fact, it is probably best to check with both a statistician and an expert in the field you are covering. This will take time, but validating the data is a data journalist’s first responsibility.

Data without context is meaningless.

The original criticism by Erin Simpson actually provided some good questions. Had they been asked, answered and used in the original analysis, they would have yielded quite an interesting, and validated, story.

  1. “Total number of stories coded for Nigeria over time (what is the shape of that curve)?”
  2. “What are the total number of events generated for Nigeria over time? (What is the shape of that curve?)”
  3. “How does the number of kidnappings compare to the number of coded events? Same shape? Key differences?”
  4. “How many overall events are coded with a specific geolocation? How many get coded to a centroid? (And where is the centroid?)”
  5. “How many kidnapping events are coded with a specific geolocation? Does that change over time?”
  6. “How does this information track with other open source reporting? HRW, UN, WB local NGO crime reporting? Can we corroborate trends?”

So why didn’t they take the time to ask these questions? Data visualization expert Alberto Cairo offers several suggestions for data journalism organizations to help them prevent making these costly mistakes. In my humble opinion, they all apply to this case.

“Data and explanatory journalism cannot be done on the cheap.”

Traditionally, data journalists worked in large news organizations with an excellent network and many resources. Organizations like lack those and would thus struggle to find the required expertise in time before publication.

“Data and explanatory journalism cannot be produced in a rush.”

This was likely the most crucial element in this example. In an environment which needs stories to be produced daily, journalists may well not have the time to stop, think, and verify that the way they have questioned their data set is actually valid.

“Part of your audience knows more than you do.”

Of course, that has always been the case since journalists are not expected to be lawyers, engineers, or physicians. However, the combination of journalistic transparency and public data means readers can verify your conclusions and if they find fault with it, let you (and the world) know. It is an additional risk that data journalism organizations need to take into incorporate into their work processes and business models.

“Data journalists cannot survive in a cocoon.”

As professor Paul Bradshaw mentioned in his lecture on “Setting Up ‘Data Newswires'”, the accuracy of the data needs to be checked by asking the following four questions:

  1. Who collected the data?
  2. When did they collect it?
  3. How did they collect it?
  4. Find another source of the same sort of data for comparison.

In other words, the data journalist needs to either know the domain herself or work with someone who does. This actually could work out well for the organization in combination with the previous suggestion. By reaching out to devote audience members whom you have reason to suspect are experts, one can both increase audience satisfaction and the verify the validity of one’s data.

What about time?

In data journalism, you cannot afford a hidden trade-off between time and validity. Once you have gathered a dataset for a story, you always need to take the time to validate your data and make sure you ask the right questions. Not doing so can have disastrous consequences that will make people question the value of your organization.

One potential solution for this problem is to make your audience part of the process. Turn the story into a series that is updated regularly. Start with an introduction which explains what the data set and what the question you want to see answered is. Then invite audience members to collaborate with you. If they don’t have the knowledge themselves, they might know someone who does.

Of course, this does require another process of verification: whether these experts are indeed who they claim to be. However, over time you will build up a network of reliable experts you can count on.  Better to have them assist before, than criticize your story after publication.

If Data journalism cannot afford a hidden trade-off between time and validity, then it is best to be open about it and get people to keep coming back to you.


Bradshaw, Paul (2014). “Setting up ‘Data Newswires'”.

Cairo, Alberto (july 09, 2014). “Data journalism needs to up its own standards”.

Chalabi, Mona (may 13, 2014). “Mapping kidnappings in Nigeria” 

Frost, Jim (september 22, 2011) “Proxy Variables: The Good Twin of Confounding Variables”.

Simpson, Erin (may 13, 2014). “If a data point has no context, does it have any meaning?”

The Functional Art (may 13, 2014). “When plotting data, ask yourself: Compared to what, to whom, to where, and to when?”