Reported Features

Interrogating Data: A Science Writer’s Guide to Data Journalism

A graphic showing vertical bands of color that shift from hues of blue, at left, to red, at right. — The “Warming Stripes” graphic represents, in a simple visual form, the change in average annual temperature (in this case, globally) over the past 100+ years. Ed Hawkins/University of Reading (CC BY 4.0)

In many ways, all science writers are already data journalists. What do science writers do when they report on a newly published study? They dive into the details of the paper’s results; they ask experts for opinions on potential flaws in the methodology; they seek to connect the conclusion to their readers’ lives. Such investigation is driven by a desire to find evidence from the most authoritative sources and present it as clearly as possible. Reporting on data requires the same skill set.

Both data journalism and science writing boil down to “taking something complicated and trying to make it understandable,” explains Sara Chodosh, an assistant editor and graphics producer at Popular Science. Once a data journalist has answered their own questions, they go through the same process with an imagined reader: What can the data tell the audience that will help them grasp a larger pattern or concept?

Data journalism, whether it takes the form of a static visualization, an interactive feature, or simply a bit of additional analysis to add context to a breaking news piece, can bring scientific results to the foreground of a story and make them accessible for readers. Take, for example, the maps in a Reuters article that put Australia’s bushfires into perspective. Marvel at an interactive astronomy chart published by National Geographic that allows readers to explore our solar system’s moons. Explore the Climate Central report on shifting snowfall levels, which invites local journalists and meteorologists to repurpose the data in order to connect changing weather patterns directly to their audiences.

In its simplest definition, data journalism is the practice of using numbers and trends to tell a story. It requires a variety of skills: research to find the correct dataset, analysis to determine what kind of story this dataset may tell, and presentation to share that story with readers. And these skills are within reach for many science writers, even without any programming background: Simply ask questions, and you will find the central tenet of a story.

Research: Choose Your Data

The first step with any data story is finding a dataset to analyze. For science writers, one natural source is the results section of any paper that you believe tells a compelling story. Many scientists release their unanalyzed data on open-access platforms such as Dryad and GitHub, a practice that allows others, whether scientists or journalists, to explore and build upon published results. And even data that are not shared through open-access channels are often available on request.

Either way, the choice to use the results of one particular study in a data story requires careful vetting; consider the authors’ credentials and pore over their methods section before diving in.

Priyanka Runwal, a science writer and data reporter at Climate Central, points out that the process of finding a dataset may depend on the assignment. In some instances, one may have a question in mind (say, “How many Americans have been tested for COVID-19?”) and search for a specific dataset that answers this question. In others, one may come upon an intriguing dataset (say, the Global Health Security Index) and seek to formulate a question from it.

In examining a potential dataset for use in a project, consider whether the data tell a compelling story. Are there evident trends or interesting outliers? Would readers want to explore a figure, or would they prefer to jump ahead to the conclusion? A story explaining a review of biodiversity hotspots, for example, may benefit from a map or chart showing where these habitats are located around the world and how they are threatened by humans. In contrast, focusing heavily on numerical results from different trials in a story about testing for a new medical treatment may distract readers from understanding the qualitative conclusions about what the treatment so far seems to accomplish and the necessary steps to come.

Besides these questions of reader value, consider logistical concerns. Are the data downloadable? Have they been released under Creative Commons licenses? What do all of the data labels represent? Do you understand the study methods, caveats, and implications, or will you need to ask a scientist or press officer for clarification?

Digging for Data

Beyond scientific papers themselves, many public and journalist-friendly data sources exist. Here are a few:

World Health Organization’s Global Health Observatory, a repository for international data on a wide variety of health indicators.
Centers for Disease Control and Prevention (CDC), the U.S.’s central source for health information, including data and info sheets on issues ranging from flu cases to wildfire prevention.
National Oceanic and Atmospheric Administration: National Centers for Environmental Information (NOAA: NCEI), America’s central source for weather and natural-disaster data.
Climate Central, a nonprofit climate research organization that caters to local reporters and meteorologists through its Climate Matters program.
Cochrane Reviews, a repository of medical evidence. (Members of the National Association of Science Writers get free access to this resource.)
Global Biodiversity Information Facility (GBIF), an open-access biodiversity platform hosting over 1 million species-occurrence records from both institutions and citizen-science platforms.
International Union for the Conservation of Nature (IUCN) Red List, endangered species data; the Red List has an application programming interface (or API), which is essentially a programming platform researchers may use to download massive amounts of data in bulk. Journalists can apply for an API key to use the interface.
Data Is Plural, a collection of “useful/curious datasets” collected by BuzzFeed News data editor Jeremy Singer-Vine. Singer-Vine sends out additions to the collection in a free weekly newsletter.
Information Is Beautiful, a publication dedicated to data visualization, has made all the datasets behind its visualizations freely available. These datasets are cleaned and updated as needed, making them easy for aspiring data journalists to explore.
Google’s Dataset Search allows users to search for data on any topic, with easily navigable filters for dataset formats and usage rights.
Tabula, a tool for turning PDFs into data files. DocumentCloud, a similar tool, also boasts an open-source repository of public documents that have gone through this process.
Freedom of Information Act (FOIA) requests, for investigative stories that require journalists to request information from public institutions. The Data Journalism Handbook includes a FOIA primer by investigative journalist Djordje Padejski.

Analysis: Rely on Your Curiosity to Turn a Spreadsheet into a Story

Once you have a dataset, the next step is to find patterns in the numbers. Data analysis can often feel like chipping away at a stone in order to make a sculpture; you may start with a massive spreadsheet and spend days isolating specific variables or data points which will illustrate a trend to your readers.

You may make this process more targeted by asking questions of your dataset as though it is an interview subject, suggests Peter Aldhous, a science reporter at BuzzFeed News and data journalism professor at the University of California, Santa Cruz, and the University of California, Berkeley. As he says: “What can the data tell me that I want to know?”

Common questions to consider may include: How do you need to clean the data (through standardizing names, changing labels, geocoding, and so on) to ensure that categories match up and all necessary information is present? What role does each variable play in the source study or in other similar datasets? Which variable may be used as an indicator of a larger trend? What analysis is necessary to show that trend—for example, what other variable might you compare to the first, or what groups of data points might you compare to each other?

Don’t let your curiosity run too wild, though. Aldhous cautions that, like other sources, data can mislead you if you aren’t careful.

A graph showing the COVID-19 cases over different regions of the US, between March and July of 2020. — As COVID-19 cases spike in Florida, Texas, and other southern states, the data-visualization volunteers at The COVID Tracking Project at *The Atlantic* often use regional charts to show how these current outbreaks compare to the U.S.’s most infamous outbreak thus far (in the northeastern U.S. in March and April 2020). Charts on the project’s website are automatically updated daily, along with the project’s database. The COVID Tracking Project (CC BY-NC-4.0)

Duncan Geere, a freelance data journalist and former editor at Information Is Beautiful, puts his warning this way: “Figure out what the data is showing, but also what it’s not showing.” What are the limitations in this dataset, due either to flaws in the methods used to compile the data or to discrepancies between what the data reveal and the story you want to tell? How might you want to filter the data to account for limitations, outliers, or missing pieces? What biases may have been present in the compilation? Closely examining data-collection methods is especially crucial when the data are describing people.

Geere suggests writing down aspects of a dataset that you find interesting, as well as questions that come up, as you explore the data. “I reason that, if I find this particular aspect of the data interesting, then my audience will as well,” he says.

It may take some time to home in on what variable or trend from a dataset tells the most compelling story. Runwal leans into this exploration, she says. “For me, it requires patience, and eyeballing numbers for a while to actually make sense of them.” To this end, you may test several different methods of filtering or analyzing your data before deciding which focus will be most informative for your readers.

Patience is also key in the analysis process because code (even a supposedly simple Excel formula) often breaks. When that happens, online resources abound: forums such as the National Institute for Computer-Assisted Reporting (NICAR) listserv, Stack Overflow, and even social media sites can help you solicit advice from more experienced data reporters. Geere recommends the Data Visualization Society, which boasts an active Slack server including both journalists and visualization experts from other fields.

Finally, just as responsible writers record interviews and save their notes, responsible data journalists keep careful track of every step in their analysis. You want your work to be reproducible, both by other people in your newsroom—data journalists aren’t safe from copy-editing and fact-checking—and by readers.

As Sam Leon, data-investigations lead at the international NGO Global Witness, explains in a chapter of the Data Journalism Handbook on methodologies, data can easily be “distorted and mis-represented” through errors at analysis stages. Such errors can range from a typo introduced while cleaning data to an analytic choice that misrepresents correlation as causation. (See “Resources for Data Journalism Novices” for a list of popular programs, ranging from programming platforms to free online services for building graphics.)

JPL mission history infographic with colorfuul lines arcing from illustrations of planets. — To advertise a new website inviting NASA fans to make their own space-themed infographics, the Jet Propulsion Laboratory (JPL) showcased a graphic of their own. The chart explores JPL mission history with a colorful time series. NASA/JPL-Caltech

Presentation: What Should Readers Take Away from a Data Story?

Just as your questions can drive your data analysis, potential questions from your audience can drive your presentation. “Good science communication thinks about its audience,” Geere says. Good data visualizations do the same; they tell a story that the audience will be able to follow, whether that audience is highly science-literate readers of a trade publication or young readers of an educational site.

Geere outlines the basics of storytelling through data in a blog post: Like any other story, he explains, visualizations need a beginning (an entry point), a middle (answers to readers’ key questions), and an end (a final takeaway for the reader, whether this is a better understanding of a scientific issue or a connection to their own life).

Different visualization formats can highlight different aspects of a dataset. Kaiser Fung, data-science expert and founder of the blog Junk Charts (which highlights errors in data visualization in the media), lays out some ground rules in an article for the Data Journalism Handbook site. Pie charts (if they can’t be avoided) should be designed with careful consideration to color and order of sections, as readers’ eyes will be drawn to the largest sections. Bar charts and dot plots allow for easy comparison between groups. Scatter plots call attention to trends, and regression lines may be added to guide readers’ interpretation.

But there are more ways to present data than in static charts. In recent years, data reporters have increasingly sought out new ways of making their stories interactive, from Johns Hopkins’s COVID-19 tracker, which shows the virus’s global spread, to Stacker’s data-based slideshows, which add photos and context to each figure in the datasets upon which they rely. (Disclosure: Stacker is my employer.) Interactive features can help readers narrow or broaden the scope of a story according to their interest, to see how the data directly apply to them. And such features do not necessarily require extensive coding, either; searchable databases, for example, which are essentially public spreadsheets hosted by journalistic organizations, are a useful tool for readers to find specific information and do their own research.

However you present your data work, though, one guideline is always relevant: Make it simple. “I am always striving to make things that you can look at and immediately, as clearly as possible, understand what you’re being shown,” Aldhous says.

Chodosh agrees, noting that the more data are pared down, the easier it is to follow a story. Ensure that readers can follow one variable or one group of values at a time, and test your visualization by showing it to colleagues who aren’t familiar with the data. Simple color schemes, large text, and clear captions can also help make visualizations more accessible to readers who may otherwise have trouble following them.

In addition to considering your audience in the presentation of your data itself, consider your audience in writing a methodology section. A methodology can be a direct link to your code, a precise series of steps, or simply a paragraph at the end of your article. The complexity and location of your methodology section should depend on your audience: How much do you anticipate that this audience will want to understand precisely how you arrived at your conclusions?

In its most basic form, a methodology section should include a clear link to your data source and the major steps you took to analyze the data, written in simple language without jargon, as well as any caveats or major exceptions.

Several free resources can help you build data visualizations without coding; see the resource list at the end of the article. (But fair warning: If you travel down the path into the world of data reporting, you may find yourself seeing coding as a means of accomplishing more complex and more customizable presentations. For more coding resources, check out NICAR, as well as data journalism courses on Coursera, Code Academy, and the Northeastern University School of Journalism’s Storybench publication.)

Resources for Data Journalism Novices

Here are some popular programs that data journalists use for analysis:

Microsoft Excel is the classic software for organizing data; formulas, filters, and pivot tables can be used to easily clean and pare down large datasets. Microsoft has extensive support forums, and blog posts such as this list of ten key Excel functions to know, by Adept Marketing’s Katie Cunningham, can help you get started.
Google Sheets has many of the same organization and formula capacities of Excel, but it’s free and hosted online. This online hosting makes Google Sheets a preferred platform for collaboration on data projects, whether you’re working on an analysis with a colleague or linking multiple spreadsheets together.
RStudio is a popular program for running R, a programming language used by scientists and data reporters alike for conducting exploratory analysis and making graphics; there are endless examples of visualizations built with the code package ggplot2. RStudio is free and open-source, and resources such as .Rddj and R for Journalists can help you get started.
The Jupyter Notebook is an open-source program, run through your web browser, that can support R, Python, and other programming languages. In a Jupyter Notebook, you can scrape data from the web, analyze it, and create visualizations; this program is preferred by some journalists because Notebooks (as the program’s files are called) may be easily shared and reproduced. To see the potential of this program and try it out yourself, check out this GitHub collection of tutorials and analysis examples.
QGIS is an open-source platform for analysis that involves Geographic Information Systems, or GIS. In QGIS, you can geocode data points, analyze them based on location, and build charts to present that analysis. This article by Canadian Broadcasting Corporation data journalist Jacques Marcoux further explains the potential of GIS, and this free course by the University of Texas’s Knight Center for Journalism can help you get started.
MySQL is an open-source platform which runs SQL (pronounced “sequel”), a popular (though older) programming language for coders who deal with large databases. The program allows users to import data, then run queries on specific variables, filter data, and find patterns; see this free course module from Stanford for more information or test out the language with the SQL Murder Mystery.
OpenRefine is an open-source program based on JavaScript that allows users to clean messy datasets, analyze them, and reconcile data with web services such as taxonomic databases and Wikidata—all without doing coding themselves. You can export your workflow as a JSON file, making it easy to share and reproduce projects. This DigitalNomad tutorial can help you get started.

Here are some online programs you can use to build data visualizations:

Datawrapper, an online platform (with free and premium versions) that can generate charts, tables, and graphs without coding.
Tableau Public, a free publication service for data visualizations that also features a blog, sample data, how-to videos, and other resources.
FlowingData, a blog run by data-viz expert Nathan Yau that features exceptional data visualization projects as well as tutorials and courses for members.
The R Graph Gallery, a collection of visualizations and instructive tools focusing on the R packages tidyverse and ggplot2.
Data-Driven Documents (or D3), a JavaScript library that allows users to build interactive visualizations in a web browser. Mike Bostock’s tutorial on Observable is a good place to start.
Geojournalism, a collection of tutorials and examples geared towards helping environmental journalists use geographic data.

Betsy Ladyzhets is a data journalist and science writer based in Brooklyn, New York. She is a research associate at Stacker, where she manages the publication’s Science and Lifestyle verticals. She’s also a member of the National Association of Science Writers and a volunteer for the COVID Tracking Project. Find her on Twitter @betsyladyzhets, and check out her newly minted newsletter, the COVID-19 Data Dispatch.