The Path to Hell is Paved with Poor Assumptions

Validating data may cost time, but refraining from it will cost more.

How wonderful is the life of a data journalist. There is so much data publicly available that whenever you whenever you are in need of a story, all you need to do is go to an interesting data repository and start questioning it and low and behold… you have a story. Now all you need to do is write it down in clear, readable prose and maybe throw in some exciting visuals and you’ve got something truly exciting that will surely get people talking.

Take this story at for example: “Mapping Kidnappings in Nigeria“. This story was published shortly after Boko Haram kidnapped over 300 schoolgirls that started the international “Bring back our girls” campaign. It features a nifty interactive map that shows how the number of kidnappings rose rapidly in the past decade. Naturally it created a lot of buzz on social media…


As I am sure you will agree, this is not exactly the kind of publicity any journalism organization would want, but is actually an organization that prides itself on being dedicated to journalism. Its foundational manifesto specifically states “… one of our roles will be to critique incautious uses of statistics when they arise elsewhere in news coverage.” As commendable a goal that is, it does give the impression that they would then think twice before publishing a story that refers to ‘media reports’ as ‘discrete events’.

I admit I do not know the precise number of articles devoted to the murders of presidents Lincoln and Kennedy, but when I typed in the phrase “president of the united states murdered” at Google, it yielded “about 41,700,000 results”. If we were to take all of those as discrete events, we’d all sincerely believe that being the president of the United States is the deadliest job in the world.

Yes, this veers into the ridiculous, but this is a ridiculous mistake to make, especially for an experienced data journalism organization.

In all fairness to however, they did live up to one of the hallmarks of good journalism, namely that of transparency. Rather than pull the article, they owned up to their mistake. The article is still available, but now it starts with an admission of guilt, followed by an apology and a long explanation of the many errors made.


So how come this mistake was made in the first place? Unfortunately, the editors did not explain that. However, the comments that I have read seem to agree on two things:

  1. “Poor proxy variables”.
  2. “Data has no meaning without context.”

Proxy Variables

The blog ‘Adventures in Statistics’ defines proxy variable as “… an easily measurable variable that is used in place of a variable that cannot be measured or is difficult to measure.” In this instance, the journalist relied on news reports about kidnappings since she could not do so on official police reports. Unfortunately for her, and her editors, she forgot that, paraphrasing Erin Simpson, “All trend analyzing using < a news article database> has to take into account the exponential increase in news stories which generate the data.”

In other words… if you’re using proxy variables, think carefully whether they really are applicable. In fact, it is probably best to check with both a statistician and an expert in the field you are covering. This will take time, but validating the data is a data journalist’s first responsibility.

Data without context is meaningless.

The original criticism by Erin Simpson actually provided some good questions. Had they been asked, answered and used in the original analysis, they would have yielded quite an interesting, and validated, story.

  1. “Total number of stories coded for Nigeria over time (what is the shape of that curve)?”
  2. “What are the total number of events generated for Nigeria over time? (What is the shape of that curve?)”
  3. “How does the number of kidnappings compare to the number of coded events? Same shape? Key differences?”
  4. “How many overall events are coded with a specific geolocation? How many get coded to a centroid? (And where is the centroid?)”
  5. “How many kidnapping events are coded with a specific geolocation? Does that change over time?”
  6. “How does this information track with other open source reporting? HRW, UN, WB local NGO crime reporting? Can we corroborate trends?”

So why didn’t they take the time to ask these questions? Data visualization expert Alberto Cairo offers several suggestions for data journalism organizations to help them prevent making these costly mistakes. In my humble opinion, they all apply to this case.

“Data and explanatory journalism cannot be done on the cheap.”

Traditionally, data journalists worked in large news organizations with an excellent network and many resources. Organizations like lack those and would thus struggle to find the required expertise in time before publication.

“Data and explanatory journalism cannot be produced in a rush.”

This was likely the most crucial element in this example. In an environment which needs stories to be produced daily, journalists may well not have the time to stop, think, and verify that the way they have questioned their data set is actually valid.

“Part of your audience knows more than you do.”

Of course, that has always been the case since journalists are not expected to be lawyers, engineers, or physicians. However, the combination of journalistic transparency and public data means readers can verify your conclusions and if they find fault with it, let you (and the world) know. It is an additional risk that data journalism organizations need to take into incorporate into their work processes and business models.

“Data journalists cannot survive in a cocoon.”

As professor Paul Bradshaw mentioned in his lecture on “Setting Up ‘Data Newswires'”, the accuracy of the data needs to be checked by asking the following four questions:

  1. Who collected the data?
  2. When did they collect it?
  3. How did they collect it?
  4. Find another source of the same sort of data for comparison.

In other words, the data journalist needs to either know the domain herself or work with someone who does. This actually could work out well for the organization in combination with the previous suggestion. By reaching out to devote audience members whom you have reason to suspect are experts, one can both increase audience satisfaction and the verify the validity of one’s data.

What about time?

In data journalism, you cannot afford a hidden trade-off between time and validity. Once you have gathered a dataset for a story, you always need to take the time to validate your data and make sure you ask the right questions. Not doing so can have disastrous consequences that will make people question the value of your organization.

One potential solution for this problem is to make your audience part of the process. Turn the story into a series that is updated regularly. Start with an introduction which explains what the data set and what the question you want to see answered is. Then invite audience members to collaborate with you. If they don’t have the knowledge themselves, they might know someone who does.

Of course, this does require another process of verification: whether these experts are indeed who they claim to be. However, over time you will build up a network of reliable experts you can count on.  Better to have them assist before, than criticize your story after publication.

If Data journalism cannot afford a hidden trade-off between time and validity, then it is best to be open about it and get people to keep coming back to you.


Bradshaw, Paul (2014). “Setting up ‘Data Newswires'”.

Cairo, Alberto (july 09, 2014). “Data journalism needs to up its own standards”.

Chalabi, Mona (may 13, 2014). “Mapping kidnappings in Nigeria” 

Frost, Jim (september 22, 2011) “Proxy Variables: The Good Twin of Confounding Variables”.

Simpson, Erin (may 13, 2014). “If a data point has no context, does it have any meaning?”

The Functional Art (may 13, 2014). “When plotting data, ask yourself: Compared to what, to whom, to where, and to when?”


When Worlds Collide in Journalism

The need for fact checking in journalism only grows as we become more aware of the world around us.

We live in the middle world, at least according to Richard Dawkins. Our brains have evolved to make sense of our direct environment and cannot fathom the rules of the cosmos or the microscopic. That’s what makes big data so fascinating, because it reveals patterns that challenge our intuition. We may be able to explain parts of it, but the entire picture? That is far too complex for us to understand, let alone explain.

That was the message I took home with me last Tuesday after attending the SAP big data college tour in Eindhoven. I’m fairly sure that wasn’t their main point – the benefits of working for SAP were mentioned a couple of times – but it nevertheless made the biggest impression. And it got me thinking as to how this applies to journalism.

We live in the middle world, but ours is not the same as that of our ancestors. In her critic of Miller’s article Kindness, Fidelity and other Sexually Selected Virtues Catherine Driscoll states that she finds it difficult to believe that a sense of ethics could evolve as a sexual signal because we all know how suitors may lie in order to present a false picture of themselves. Miller replied that our distant ancestors lived in small, isolated tribes, which made hiding your true self quite challenging. The critic thus was judging what happened in the past with her own mindset and was completely unaware of this.

Our own parents grew up in a world recovering from a world war, a world without home computers, but also a world in which news played an important role in everybody’s lives. Journalism became a force to be reckoned with in the 20th century and reliable news agencies and journalists were greatly respected. Nevertheless, if you visit you parents this weekend and ask for newspaper clippings from their childhood, you’ll likely be disappointed by the ‘mundane’ reporting. Journalists, and parents, are a product of their world and what they considered important and appropriate news may not be what journalists and young adults in our world believe it to be.

This does not make our parents small-minded, nor does it make us broad-minded, much as we’d like to think so. Generally speaking, we have become more aware of the world beyond our borders and how international developments may influence our own lives and vice versa. But we grew up in a world where globalisation has become the norm.

Note how I keep using the term ‘world’ here, not ‘time’. Different though they might be, our world is much more similar to our parents’ than to that of a poor farmer in Niger who risked his life trying to get to Europe last year. His story is told in the German periodical der Spiegel (the article is in English), and I highly recommend you check it out. It is a beautifully written description of a world completely different from our own, but which nevertheless interacts with ours in ways that we are mostly unaware of.

According to the Columbia journalism review, der Spiegel is “home to what is most likely the world’s largest fact checking operation.” Back in 2010, it had 80 fulltime positions for fact-checkers, most of which were consulted during or even before writers started on their articles. And when you see the type of articles that are presented here, you can understand why. The story blends vivid witness accounts with dry facts in a way that both moves and educates the reader. Without facts, a sceptic would write it off as a sob story, and without the witness accounts, particularly the last line, it wouldn’t have the same punch.

In Chapter 2 of the Verification Handbook, Verification Fundamentals: Rules to Live By, Steve Buttry states that journalists need to ask two questions when verifying stories:

  • How do they know that?
  • How else do they know that?

These days, people will often point to sources on the Internet to answer these questions. Which is why a good journalist tries to locate the original source, as detailed by Claire Wardle in her chapter Verifying User-Generated Content, also in the Verification Handbook. According to her, there are four elements to check and confirm content:

  • Provenance: Is this the original piece of content?
  • Source: Who uploaded the content?
  • Date: When was the content created?
  • Location: Where was the content created?

Personally, I believe another element needs to be examined as well:

  • Why do people check out this content?

Journalists are paid to take the time to examine the four elements, but ordinary people will often quote, link, or share a site because of the trust they (or their peers) put in it. And the reason for their trust is likely due to their world view, even if they themselves are unaware of this. Journalists however do need to be more aware of this, and try and explain this to their own audience.

The world has become more complex, for journalists as well as their audience. Before journalists ‘just’ needed to be experts in fact checking, interviews, and investigations. Now they are faced not only with much more data than in earlier decades, but they need to be aware of the meta-data as well and share this with their audience. An audience that is frequently too unwilling to accept that the big world they now live in, is in fact made up of a network of small, interconnected worlds.



  • Buttry, Steve (2014). “Verification Fundamentals: Rules to Live By”. Verification Handbook.
  • Driscoll, Catherine (2007). “Why Moral Virtues are Probably not Sexual Adaptations”. Moral Psychology volume 1 The evolution of Morality: Adaptations and Innateness.
  • Hage, Willem van (November 4 2014). Big Data College Tour at Eindhoven.
  • Miller, Geoffrey (2007). “Kindness, Fidelity and other Sexually Selected Virtues”. Moral Psychology volume 1 The evolution of Morality: Adaptations and Innateness.
  • Goos, Hauke; Riedmann Bernhard (October 21, 2014). “Death in the Sahara: An Ill-Fated Attempt to Reach Fortress Europe”. Der Spiegel.
  • Silverman, Craig (April 9, 2010). “Inside the World’s Largest Fact Checking Operation”. Columbia Journalism Review.
  • Wardle, Claire (2014). “Verifying User-Generated Content”. Verification Handbook.