- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
- Create a Github account and publish a document in Markdown Due by next class
- Critique a piece of data journalism Due by next class
- List your 10 favorite restaurants in a CSV file Due by next class
- List 10 news-related Twitter accounts that you find interesting Due by next class
Jump to the full details on homework assignments
Every piece of data is a story. So is how and why that data was collected. And when data isn't collected; that too is a story.
From Nate Silver's "What the Fox Knows":
You may have heard the phrase the plural of anecdote is not data. It turns out that this is a misquote. The original aphorism, by the political scientist Ray Wolfinger, was just the opposite: The plural of anecdote is data.
Wolfinger’s formulation makes sense: Data does not have a virgin birth. It comes to us from somewhere. Someone set up a procedure to collect and record it. Sometimes this person is a scientist, but she also could be a journalist.
The missing, speeding cops
In the case of the Sun-Sentinel investigation, the story of how they collected their data for the cop speeds database and analysis – using official toll booth records and then driving around with GPS devices – is interesting enough. But there's the story of the data that wasn't collected – cops pulled over for speeding – because it just didn't exist (for various bureaucratic and institutional reasons and oversights).
In this respect, data journalism has the same roots as any journalism: no dataset has the complete answers within it. And knowing what datasets even exist, and why others simply do not exist, is a result of thorough research and understanding your beat.
Let's watch a video
From 2011, in the state of Florida:
HOLLYWOOD, Fla. – A city of Miami police officer was caught driving 120 mph and a Florida Highway Patrol trooper followed him for several minutes before issuing him a ticket, said FHP.
(A complete video of the chase is here)
Public affairs reporting?
A video of one cop pulling over another is certainly sensational news, but is this a story in the public interest?
Since police are entrusted with enforcing the law, it is in the public interest to know if they are alleged to break the law. Even if it's just this one incident, the public is right to ask: How did this happen?
Here's a TV news report of the incident
From the TV news report, we can speculate that this cop's speeding wasn't a one-time event.
What are the data points we can assume exist, just from this video?
- The time that the officer began the chase
- The location (geocoordinates) the officer when the chase began
- The time that the suspect finally pulled over
- The location of the stop
- the license plate of the driver
- the age of the driver
- the maximum speed clocked by the arresting officer
- the duration of the stop
- the result of the stop (arrest, citation, nothing)
- the jurisdiction of the arresting officer
- the jurisdiction of the arrested officer
- what the speed limit for that road is
- what the average speed of caught speeders is
What other data points might exist?
At the end of the news clip, the state trooper says:
This is not…this is not a first time occurrence with y'all. Y'all come from that way all the time…this Miami Police car…and we never catch it!
Clearly, sightings of this speeding cop car have happened before. So maybe there are past instances when this car and/or this particular driver have been stopped.
What useful data most likely doesn't exist?
On the other hand, the arrested Miami officer is surprised – "I didn't know you were stopping me, officer!" – which indicates that whether or not he sped in the past, being arrested (as a cop, at least) for speeding is a first-time occurrence to him.
We might assume that police officers don't often pull each other over, no matter the speed. If this is the case, then official documentation that Florida cops are speeding needlessly (and illegally) may not exist.
Look to the past
Speeding laws exist for a reason, ostensibly to prevent high-speed accidents. So if police have been frequently speeding, it's possible that over a span of years, they've had their share of unfortunate accidents.
This is one of the data sources the Sun-Sentinel newspaper used in their 3-month investigation of speeding cops:
Database specialist Dana Williams analyzed seven years of accident reports and found that speeding cops in Florida had caused 320 crashes, killing and maiming at least 21 people. Only one officer went to jail – for 60 days.
The victims included a 14-year-old girl killed by a sheriff’s deputy driving twice the speed limit to a routine traffic stop and a college student now severely brain damaged after a police officer slammed into him going 104 mph for no apparent reason.
The crash data provided another angle – that police officers receive special treatment. Of the accidents blamed on police speeding, only 12 percent of the officers were ticketed. By contrast, 55 percent of other motorists who were speeding when they crashed received a citation.
Here's an early graf from their first story, "For cops, no limit":
Speeding cops can kill. Since 2004, Florida officers exceeding the speed limit have caused at least 320 crashes and 19 deaths. Only one officer went to jail — for 60 days.
The third part in their series, Ruined lives, focuses on the victims, both police and bystanders:
Is a history of past tragic incidents enough "data"?
Over the span of 8 years, does fewer than 20 officer-involved fatal crashes indicate a trend? It would require at least a comparison against the total number of Floridian fatal crashes in 8 years. But even that may not be a direct enough of a comparison. We would have to know under what conditions and circumstances these 19 accidents took place, as not all of them involved off-duty officers. Given the emergency-related responsibilities of police officers, the conditions under which they drive may be different than for the average citizen.
Over an eight-year time span, 19 deaths and 320 crashes may not seem like a lot, especially if the incidents are spread out over various Florida law enforcement agencies, and if no one is keeping track.
What is the indirect data?
(Note: I'm deliberately presenting a jumbled chronology of the Sun-Sentinel's data-discovery process, going from more obvious to less obvious: the analysis of past accident records may have been done after/in conjunction of the data collection described below – )
With no direct evidence, the Sun-Sentinel reporters turned to indirect evidence, from which, with a little math, would strongly suggest that at least a few police officers were frequently breaking the law as well as their departments' internal regulations:
Video of the trooper chasing and eventually handcuffing the uniformed officer went viral, and the stories drew hundreds of comments. We suspected plenty of other cops were routinely speeding, but how could we document it?
We considered GPS devices in police cruisers, but too few agencies used the technology, and those that did immediately put up a fight about releasing the data.
Then it dawned on us: Florida’s toll system, SunPass, records the date, location and time down to the hundredth of a second when a car passes through a toll booth. If we got those records for police vehicles, we could calculate their speed based on the distance and time it took to go from one toll location to the next.
SunPass officials initially told us the data was not public, but ultimately agreed with our position and released 1.1 million toll transactions for 3,900 South Florida police transponders. Three months and many miles later, we published the results of our investigation (“Above the Law,” sunsentinel.com/speedingcops).
The Sun-Sentinel posted a database of the records that indicate speeding (72,000+ incidents). Here's what that looked like:
It's important to note that at least three of those fields were not in the original SunPass data: travel time, distance, and average speed…which makes sense: why would the SunPass toll booths need to collect that data (even if they could)?
The Sun-Sentinel's solution was simple and intuitive:
The investigation combined technology and data with old-fashioned shoe-leather reporting. Obtaining the SunPass data was just the first step.
To determine how fast the cops were driving, we needed to know the distance between toll booths, and to our surprise, the state did not have precise mileages. We ruled out measuring distances with our car odometers, which can be off for a lot of reasons, and went with the suggestion of traffic engineers – a portable GPS device.
Garmin and other manufacturers make units for joggers and cyclists that fit into the palm of your hand and are accurate to within a few feet. We went with a Garmin Edge that you can pick up on Amazon for $150.
With one person driving and another operating the GPS, we carefully measured each leg of our toll highways, logging a total of 2,500 miles over three counties.
Is this complicated math? No, it is literally grade school arithmetic. But it is a spectacularly clever demonstration of why math is important in our thinking. A simple calculation revealed a truth that was simply ignored in the official recordkeeping.
A summary of their findings:
Nearly 800 officers from a dozen police agencies drove from 90 to 130 mph during the previous year, often while off-duty and commuting to or from work. Most did not appear to be fighting crime – they were city cops outside their jurisdictions.
For the Miami cop Fausto Lopez caught by the trooper, the SunPass data was especially damning (note what happens to his driving after his arrest):
What is the impact of data journalism
Faced with the reporting and data analysis by the Sun-Sentinel reporters, Florida agencies launched internal investigations even before the first story was published on Feb. 12, 2012:
The extent of the problem uncovered by the newspaper shocked South Florida’s police brass. All the agencies started internal investigations. “Excessive speed,” Margate Police Chief Jerry Blough warned his officers, is a “blatant violation of public trust.”
In June 2012, the Sun-Sentinel reported that 138 officers statewide had been punished, including 39 Miami-Dade detectives who had lost their take-home cars for a month. The Sun-Sentinel also reported that the police chief said that one or more officers would be fired.
The Sun-Sentinel well-deservedly won the 2013 Pulitzer for Public Service.
Incidentally, this story still continues to be a public affairs data story: The state trooper who ticketed the Miami cop is suing 25 Florida law agencies, alleging that they violated the Driver Privacy Protect Act. The trooper filed a public records request and found that her state license information had been accessed by "88 different officers in 25 different agencies over 200 times in just a three-month span"
Learn to count
An anecdote about the power of simple math combined with a similarly simple, but careful attention to detail, via Atul Gawande's book, "Better: A Surgeon's Notes on Performance (p. 254)":
MY THIRD ANSWER for becoming a positive deviant: Count something. Regardless of what one ultimately does in medicine—or outside medicine, for that matter—one should be a scientist in this world. In the simplest terms, this means one should count something…
When I was a resident I began counting how often our surgical patients ended up with an instrument or sponge forgotten inside them. It didn’t happen often: about one in fifteen thousand operations, I discovered. But when it did, serious injury could result. One patient had a thirteen-inch retractor left in him that tore into his bowel and bladder. Another had a small sponge left in his brain that caused an abscess and a permanent seizure disorder. Then I counted how often such mistakes occurred because the nurses hadn’t counted all the sponges as they were supposed to or because the doctors had ignored nurses’ warnings that an item was missing. It turned out to be hardly ever.
The numbers began to make sense. If nurses have to track fifty sponges and a couple of hundred instruments during an operation – already a tricky thing to do –it is understandably much harder under urgent circumstances or when unexpected changes require bringing in lots more equipment.
Our usual approach of punishing people for failures wasn’t going to eliminate the problem, I realized. Only a technological solution would—and I soon found myself working with some colleagues to come up with a device that could automate the tracking of sponges and instruments.
A data-reporting project
Every substantial reporting project will be a substantial investment of time. But an obstacle particular to student journalists is a lack of access (at first) to primary sources.
Data alone doesn't solve this problem. But if you keep in mind the record-keeping gaps common to most institutions, and our overall tendency as humans to forget history, then independent data collection (and research, and analysis) is itself an act of journalism, and can also help you get up to speed on a beat.
The majority of the class grade will come from a final project. Throughout the quarter, we'll learn the tools to become more efficient at working with data. But conceiving of data projects, and doing the research and data collection for them, can begin as soon as the first class.
Here's an example of an outstanding data reporting project: The NFL’s Uneven History Of Punishing Domestic Violence by FiveThirtyEight's Allison McCann.
Here's what I liked about it:
- It seeks to find evidence for something that "everyone knows" about
- It used existing data sources and improved upon them.
- It explains the data in the context of the NFL's history
- Sign up for a Github account.
- Email me the name of your Github account.
- Create a New Repository and name it however you’d like (some tips here).
- Create a new README.md file. Write anything you like, but do it in Markdown
- Note: This repo will be public. As soon as you email me the name of your Github account, I’ll set you up with a private repo, to which you can post homework assignments. Don’t post homework assignments to your public repo (unless you want to, in which case, bravo).
Find a story that purports to use data, read it, and answer the following questions:
- What are the datasets used here, and where did they come from?
- What is the “birth story” of the data, e.g. how were they created in the first place?
- What claims does the story make based on the data?
- What are the limitations of the data?
- Can you find where this (original) data exists online? If so, post the relevant URLs. Write this memo in Markdown format and post it to your private Github repo. If this hasn’t been set up for you, then just email it to me.
Here are examples of data-based stories that you can use for this assignment if you don’t want to look for your own:
- Drugging Our Kids - San Jose Mercury News
- Unseen Toll: Wages of Millions Seized to Pay Past Debts - ProPublica
- When Caregivers Harm: - ProPublica
- Water’s edge: the crisis of rising sea levels - Reuters
- Children and Guns: The Hidden Toll New York Times
- Sliver of Medicare Doctors Get Big Share of Payouts - New York Times
- The Undeserving Poor - Reuters
- Stop and Seize - Washington Post
- LAPD misclassified nearly 1,200 violent crimes as minor offenses - Los Angeles Times
- Child-care scams rake in thousands - Milwaukee Journal Sentinel
- ‘Stop-and-Frisk’ Is All but Gone From New York - New York Times
Create a comma-seperated value (CSV) file with your 10 favorite restaurants. Include these fields: name,address,city,state,category,yelp_url
Send it to me by email or post it to your private Github repo.
Make a list of the Twitter account names (or their URLs), one account per line. Send it to me via email or post it on Github as a text file.
An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
- The importance of spreadsheets
- Counting murders
- Making calls
- A crowdsourced spreadsheet
Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
- Phillip Reese speaks
Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
- Why maps work
- Why maps don't work
- Introduction to Fusion Tables and TileMill
A continuation of learning mapping tools, with a focus on borders and shapes
- Working with KML files
- Intensity maps
- Visual joins and intersections
The first in several sessions on learning SQL for the exploration of large datasets.
- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
- Inner joins
- One-to-one relationships
- Our politicians and what they tweet
Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
- Left joins
- NULL values
- Which Congressmembers like Ellen Degeneres?
A casual midterm covering the range of data analysis and programming skills acquired so far.
- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
- Polling and pollsters
- Following the campaign finance money
- Competitive U.S. Senate races
With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
- Statistical significance
- Poll reliability
Do your on-the-ground reporting
- No class because of Election Day Coverage
While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
- Review of the midterm
- The importance of good data in visualizations
- How visualization can augment the Serial podcast
One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
- Dirty data
Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
- Ellen, World Cup, and other masses of Twitter data
When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
Discussion of final projects before the Thanksgiving break.
Holiday - no class
Holiday - no class
Last-minute help on final projects.
In-class presentations of our final data projects.