COMM 273D | Fall 2014

Tuesday, September 30

DIY Databases

Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.

Topics

The importance of spreadsheets
Counting murders
Making calls
A crowdsourced spreadsheet

Homework

Create a data memo for your beat Due by next class
Read Phillip Reese's stories and write a data critique on Phillip Reese's stories Due by next class
Practice pivot tables with homicide data Due by next class

Jump to the full details on homework assignments

Relevant Daily Show clip: A Shot in the Dark

The Daily Show
Get More: Daily Show Full Episodes,Indecision Political Humor,The Daily Show on Facebook

Samantha Bee attempts to uncover statistics about the excessive use of lethal force by the police, only to discover that this data is mysteriously nonexistent. (7:09)

It would be nice if all of the data which sociologists require could be enumerated because then we could run them through IBM machines and draw charts as the economists do. However, not everything that can be counted counts, and not everything that counts can be counted. – William Bruce Cameron, “Informal Sociology: A Casual Introduction to Sociological Thinking”

If there is one correlation we've seen so far, it's that there's an unfortunate inverse relationship between the importance and the reliability of a dataset. We have solid data on how much we tweet per second, pay for a cab, pass through a subway turnstile. And not enough on occurrence of sexual assaults, officer-involved homicides (or speeding), the size of homeless population.

This is not so different from just life, where the more important something is, the harder it is to get. If data is the record of an observation, then we should expect that the more significant an observation, the more significant its political and bureaucratic and personal implications.

Homicide Watch

In many cases, detectives and officers do not know what happens to a case once an arrest is made. They never discover the outcome.

From Laura Amico, founder of Homicide Watch

Homicide Watch goes beyond the collection of victim names and demographics, and their suspects, but also the resolution of the cases: tracking when the suspects were arrested, what they were charged with, and eventually, what they were charged with or not. The judicial process can take years and often falls beyond the scope of what the police collect, which is why there is no official database connecting arrests to convictions.

The collection of data allows her to fact check the officials independently.

The database also tells me where my holes are. About once I week I go through the database and I look for what’s missing: obits, photos, times, anything, and I try to fill in what’s missing. Doing that means I’m always ready to slice and dice the numbers at a moment’s notice. Just recently we had three homicides in one night. As I was writing the story it occurred to me that, combined with deaths earlier in the week, that week might have the most deaths of any this year. I was able to prove that in just minutes and get that story up. I haven’t seen any other reporter with a database robust and agile enough to do that. - An interview

What running Homicide Watch has taught me about crime in America

On the Media interview

Interview on Prosecutor's Discretion

Innovation in Journalism Goes Begging for Support

Sage wisdom

There are two kinds of databases: those that you create yourself and those that were created by somebody else, usually without your convenience in mind.

Meyer, Philip; Meyer, Philip (2002-02-25). Precision Journalism: A Reporter's Introduction to Social Science Methods (p. 191).

I cover the point of CSV and spreadsheets in this tutorial: Why Spreadsheets?

Homicide and shooting data

Slate is providing its year of gun deaths as a CSV, which you can download here.

Fatal Encounters, which is attempting to record all instances of officer-related deaths, has posted their spreadsheet here.

Deadspin also launched a police-involved shooting database project, covering the last three years of U.S incidents. Their Google spreadsheet of submitted incidents can be found here.

Another hand-compiled database: 10 years of arrest-related death records in New Jersey, compiled by NJ Advance Media.

The difficulty in categorizing deaths

via: Of Course Tamerlan Tsarnaev Is on Slate’s List of People Killed With Guns

The death of Tamerlan Tsarnaev illustrates how difficult it can be to count gun-related deaths.

For starters, when such a list is used to make a point about "gun violence victims", some would (strongly) disagree with Tsarnaev being characterized as a victim.
Even as a shooting-related death, law-enforcement-involved or not, Tsarnaev's death is ambiguous. He was hit by police gunfire, but he was also run over by the SUV driven by his brother. His death certificate states the cause of death as “gunshot wounds of torso and extremities and blunt trauma to head and torso.”. Slate's editor told The Atlantic that their dataset includes "killed-by-gun-plus-other-thing" incidents.

We'll cover more of these kinds of amiguities when reading "Final Forms", by Kathryn Schulz.

Independent counting

The Homicide Watch project, founded by Laura and Chris Amico:

From [Homicide Watch](http://contentsmagazine.com/articles/homicide-watch-an-interview/:

Homicide Watch is sometimes used as an example of “data-driven” journalism, but I have to say, it doesn’t feel like a database or the kind of analytical, number-crunching project I associate with that term. It’s a very human-centric site. How does data enable what you do as a reporter? And conversely, how do your editorial choices shape your use of data?

Chris: In a lot of conversations about data in journalism, “data” is synonymous with “numbers.” We do plenty of numbers stories, but when we talk about data, we’re talking about organizing the information a reporter gathers into a regular and recurring structure. There are patterns in beat reporting, and we can build software that knows what we’re usually looking for, and that will tell you what you’re missing. For example, every murder victim has a name, age, race, gender, place of death (at the scene or at the hospital), cause of death (shooting, stabbing), and so on. We know when suspects are arrested, for which case, and how those cases end. That’s all data. What we’ve done that’s different, I think, is woven the data-centric parts of the site into the narrative parts. There’s really no separation. If the site works as it should, a user can land on a story and immediately see which victims and suspects are involved, understand the backstory, and where the case stands. Users shouldn’t have to guess about these things.

Why timely, reliable data on mass killings is hard to find

Homework

Create a data memo for your beat

Due by next class
Similar to what you had to do for your other classes, draft a memo that is focused on the kinds of data you either know exists on your beat, or that you hope to find, or to collect yourself.

List five different sources (actual or hopeful) as well as:
- Why this data source is interesting and what you’re curious about.
- Where this data exists, or where you expect to find it
- Anticipated problems in collecting or analyzing it
This is not a final draft. I just want you to start thinking of data as soon as possible, and we’ll work on refining and researching the possibilities in the next couple of weeks. Mostly, this is less about your grasp of data work than about the research you’ve done so far on your beat.

How to submit

In your private Github repo, create a new file named “draft-data-beat-memo.md” (yes, it will be a Markdown file).

Read Phillip Reese's stories and write a data critique on Phillip Reese's stories

Due by next class
We’re having a guest speaker on Tuesday: Phillip Reese, the computer-assisted reporting genius at the Sacramento Bee. He and his colleague, Cynthia Hubert, were finalists for the 2014 Pulitzer in Investigative Reporting for their reporting on a Las Vegas mental hospital that bused more than 1,500 psychiatric patients out to 48 states in 5 years.

Read two of the following stories and write up at least five questions you have about the reporting or data analysis process and prepare to challenge Phillip:
How to submit

In your private Github repo, create a new file named “phillip-reese-questions.md” (yes, it will be a Markdown file).

Practice pivot tables with homicide data

Due by next class
Using the homicide-related datasets we looked at in class, practice exploring them with pivot tables, write a quick analysis of what you tried and what you found. It’s OK if the data seems difficult to summarize, and if that’s the case, explain the problems you ran into (because these problems are the same as ones we’ll be seeing in other datasets).

Here’s the Google spreadsheets that you can copy and then pivot on:
Note: In Google Spreadsheets, you can make a copy of the spreadsheet by going to File->Make a Copy…. Or, if you want to view it in Excel, select File->Download as… and you can choose CSV or Excel format.

How to submit

In your private Github repo, create a new file named “homicide-data-analysis.md” (yes, it will be a Markdown file).

Course schedule

Tuesday, September 23

The singular of data is anecdote

An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
Thursday, September 25

Bad big data

Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
Tuesday, September 30

DIY Databases

Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
- The importance of spreadsheets
- Counting murders
- Making calls
- A crowdsourced spreadsheet
Thursday, October 2

Data in the newsroom

Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
- Phillip Reese speaks
Tuesday, October 7

The points of maps

Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
- Why maps work
- Why maps don't work
- Introduction to Fusion Tables and TileMill
Thursday, October 9

The shapes of maps

A continuation of learning mapping tools, with a focus on borders and shapes
- Working with KML files
- Intensity maps
- Visual joins and intersections
Tuesday, October 14

Introduction to SQL for Data Journalism

The first in several sessions on learning SQL for the exploration of large datasets.
- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
Thursday, October 16

A needle in multiple haystacks

The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
- Inner joins
- One-to-one relationships
- Our politicians and what they tweet
Tuesday, October 21

Haystacks without needles

Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
- Left joins
- NULL values
- Which Congressmembers like Ellen Degeneres?
Thursday, October 23

Midterm Malarkey with Military Surplus

A casual midterm covering the range of data analysis and programming skills acquired so far.
- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
Tuesday, October 28

Campaign Cash Check

The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
- Polling and pollsters
- Following the campaign finance money
- Competitive U.S. Senate races
Thursday, October 30

Predicting the elections

With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
- Statistical significance
- Poll reliability
- Forecasting
Tuesday, November 4

Election day (No class)

Do your on-the-ground reporting
- No class because of Election Day Coverage
Thursday, November 6

Storytelling with Data Visualization

While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
- Review of the midterm
- The importance of good data in visualizations
- How visualization can augment the Serial podcast
Tuesday, November 11

Dirty data, cleaned dirt cheap

One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
- Dirty data
- OpenRefine
- Clustering
Thursday, November 13

Guest speaker: Simon Rogers

Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
- Ellen, World Cup, and other masses of Twitter data
Tuesday, November 18

What we say and what we do

When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
Thursday, November 20

Project prep and discussion

Discussion of final projects before the Thanksgiving break.
Tuesday, November 25

Thanksgiving break

Holiday - no class
Thursday, November 27

Thanksgiving break

Holiday - no class
Tuesday, December 2

Project wrapup

Last-minute help on final projects.
Thursday, December 4

Project Show-N-Tell

In-class presentations of our final data projects.

Tuesday, September 30

DIY Databases

Topics

Homework

Homicide Watch

Sage wisdom

Homicide and shooting data

The difficulty in categorizing deaths

Independent counting

Homework

Create a data memo for your beat

How to submit

Read Phillip Reese's stories and write a data critique on Phillip Reese's stories

How to submit

Practice pivot tables with homicide data

How to submit

Course schedule