- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
- Investigate Uber's purported impact on SF's DUI reports Due by next class
- Explore a snapshot of SF crime incident reports from 2014 with pivot tables Due by next class
- Read several essays on tracking homicides and shootings Due by next class
- Read The New Precision Journalism, Chapter 8: Databases Due by next class
Jump to the full details on homework assignments
In the previous class, we saw a scenario in which there was effectively zero data. And then, how investigative reporters used an indirectly-related, but plentiful and reliable dataset, to make a convincing case that police weren't policing themselves.
Today's scenario is different. There is seemingly enough data. But on closer investigation, the data itself is at every level vulnerable to inaccuracies through miscounting if not outright manipulation.
So what's the point of knowing how to use data at all if it can be so easily subverted?
In 2010, Baltimore Sun crime reporter Justin Fenton not only built a clear, logical refutation against something as infuriatingly opaque and politically controversial as rape statistics. But he did it so convincingly that his work immediately effected a dramatic change in policy. And he started with the same flawed, public data that the police and the politicians had access to but apparently overlooked.
We'll attempt to recreate Fenton's analysis based on what he published. Obviously, it's easy to follow someone's footsteps, but the exercise is worth it to see the variety and scope of comparison and investigation Fenton did to build his case. His project is a textbook case of how to analyze numbers – even the tainted kind – to get closer to the truth.
A dramatic and strange decline
In the year 2009, the Baltimore police reported 158 rapes. Moving past the sentiment that one such crime is too many, the first thing to question is: how significant is 158? Is that more than expected for a city of Baltimore's size? Fewer? The easiest route of comparison is to look at that number historically, i.e. the number of rape reports, year over year:
via the Baltimore Sun's graphics department:
If we restrict our inquiry to the past decade, we might conclude that 2009 is typical in terms of reported rapes. Although from 2000 to 2003, we do see a noticeable drop, implying that in 2009, Baltimore is near its historic low in this statistic.
However, when we expand the time frame further back into the 1980s:
– it appears that Baltimore has experienced a dramatic drop: nearly an 80 percent drop from a high of 749 rape reports in 1992.
It's worth asking: is 80 percent a lot? According to the caption of the Baltimore Sun graphic, yes, as FBI statistics indicate an 8 percent drop nationwide.
Control for population
What if Baltimore's population changed significantly?
The first thing we must do is control for changes in population. We should expect Baltimore's crime overall to drop 80 percent if 80 percent of its population has moved out.
A Google search for "Baltimore population" seems to invalidate that theory:
Overall success in crime prevention?
It's possible that Baltimore has had exceptional success with reducing crime overall compared to the rest of the nation. It's possible there has been a significant investment in police funding, as well as an implementation of innovative law enforcement strategies (a real-life Hamsterdam?).
We can confirm that theory by comparing across categories. We should see a dramatic drop in other categories of crime, such as homicide. Let's look at the FBI Uniform Crime Reporting for Baltimore's homicide and rape numbers (officially labeled as, "Murder and nonnegligent manslaughter" and "Forcible rape", respectively)
Because the population fluctuates, it's more relevant to look at crime rates: in this case, number of crimes per 100,000 citizens:
|Year||Homicide rate||Rape rate|
Charting that data leads to the graph below: whatever has led to the drop in rape reports has not affected the rate of homicides:
Ratio of homicides to rapes, nationwide and locally
In fact, the ratio of rape reports to homicides is so low that it places Baltimore among a very few cities in which there are more homicides than rape reports, according to this Baltimore Sun infographic:
More homicides than rapes: Nationally, there are five rapes for every murder committed. Out of 272 cities with a population of 100,000 or more, Baltimore is one of five recording more murders than rapes in 2009.
Volume and handling of rape calls
If the incidence of sexual assault had truly dropped, then we would expect not just a drop in the rape reports handled by detectives, but a drop in emergency calls about rape. The Baltimore Sun included this graph, showing a relatively steady volume of calls from 2003 to 2009:
Unfortunately, the Sun doesn't have the number of 911 calls about rape in years before 2003, so it's not possible to see if there was a corresponding drop (or not) as there was in rape reports.
However, the Baltimore Sun's analysis of how rape calls were handled, from 2003 through June 2010, revealed that "that four in 10 calls to 911 during [the past 5 years] never made it to detectives specializing in sex crimes, having been dismissed by police officers at the scene with no report taken."
A leader in unfounded rape cases
For the calls that did make it to Baltimore detectives, more than 30 percent of them would be "marked
unfounded by detectives, meaning police believed the victim was lying." In this statistic, Baltimore led the country.
via the Baltimore Sun:
An advocate interviewed by Fenton offers this rhetorical question:
"Why is it that women in the greater Baltimore area are more disposed to lying about sexual assault than anyplace else in America?" asked Branson of Turn Around. "Is it in the water? What exactly would make us the ones most likely to tell a story about being sexually assaulted?"
Case work at the detective-level
Baltimore detectives interviewed by Fenton offered this rebuttal: the increase in "unfounded" reports was due to better police work](http://www.baltimoresun.com/news/bs-md-ci-rapes-20100519,0,2195896,full.story):
Current and former sex offense detectives in Baltimore defended their investigations. Part of their mission, they say, involves rooting out illegitimate complaints that in the past would result in wasted effort and false arrests.
Many reports of rape are made for "ill gain, in order to gain assistance or cover up not coming home," said one of the commanders of the unit, Lt. Thomas Uzarowski, in a March interview.
"The bottom line is, the case is only unfounded when the investigative facts prove the crime did not occur," said Uzarowski, who retired from the department this month. "It's not an opinion. It's not anything other than where the facts fall."
This is a claim that's hard to directly measure. But Fenton scrutinizes the rape report numbers by individual detectives:
The Baltimore squad that investigates sexual assaults and child abuse comprises 50 detectives. One of them, Detective Anthony Faulk Jr., is responsible for one-fifth of the unfounded reports, shelving 14 cases last year, including the alleged attack in midtown. No other detective had more than six such cases, and some have none. Attempts to reach Faulk through the department and police union were unsuccessful.
It's not a conclusive number, but it raises some doubt about better police work being the reality. If that were the case, how is it that one detective could have such a lopsided rate of unfounded reports? Is it because luck of the draw, or because he's doing the most thorough investigative work? Or is it, as critics believe, that there's simply no standard for evaluating rape claims, allowing detectives to too easily dismiss valid cases?
Impact of Baltimore Sun investigation
Even before Fenton's first story hit the presses, Baltimore Mayor Stephanie Rawlings-Blake ordered an audit after she was contacted by reporters.
Baltimore Police Commissioner Frederick H. Bealefeld III declined to be interviewed in Fenton's first story, but after The Sun's story broke, Bealefeld admitted that there was a "crisis" and said of The Sun's reporting, "I can't tell you that we would have seen that by ourselves."
The Sun kept following the numbers during the months after Fenton's initial report. Before his June 2010 story, rape cases were down 15 percent for the year. By the end of August, the number of rapes investigated by Baltimore police rose 20 percent since the year before.
Via the Sun's graphics department:
A major category of crime is up substantially in Baltimore, and police and city officials are pleased.
A year after The Baltimore Sun revealed that the city led the country in the number of rape reports discarded by detectives — part of what women's advocates and victims said was a broader pattern of ignoring sexual assaults — the number of rapes being reported is up more than 50 percent.
The power of numbers
Rape occupies a special category of crime, a crime that most often takes place behind closed doors and in which victim and perpetrator are acquainted, making it a crime in which its incidence is less directly affected by the actions and policies of law enforcement. That alone makes rape difficult to count, even before acknowledging the unique social stigma of the crime.
Rape is different from other crimes. Not only does it involve a violation more profound than any other crime but it also comes with a social stigma that forces victims to relive the pain again and again. No one suggests that a victim of a carjacking was really asking for it. No one asks whether an assault might really have been consensual. When a robbery victim is on the witness stand, the most private details of her life are not dissected under cross examination.
There's little argument that rape is a heinous crime. But there's apparently much disagreement on how to judge an accusation. This uncertainty makes it difficult to objectively know whether or not rapes are being thoroughly investigated. And so for a decade before Fenton's 2010 report, how many valid accusations have been dismissed? And how many more would still be ignored today were it not for The Sun's investigation?
Below is the chart of rape and homicide rates for Baltimore through 2013, according to FBI statistics:
Just as profound as the Sun investigation impact on Baltimore's rape statistics (presumably a result from the overhaul of the Baltimore Police's sex crimes unit), is the realization that all the numbers – all the data, good and bad – were easily accessible to the FBI, the Baltimore police, and the politicians.
But data being accessible is not at all the same as the data being known. Kudos to Justin Fenton and the Baltimore Sun for paying attention and building a decisive investigation despite the inherently bad data.
Examining San Francisco DUIs with Pivot Charts
An introduction to Pivot Charts for the summarization of data.
Other examples of crime stat questions
On October 28, a pathologist ruled the death of Tiara Groves a homicide by “unspecified means.” This rare ruling means yes, somebody had killed Groves, but the pathologist couldn’t pinpoint the exact cause of death…With the stroke of a computer key, she was airbrushed out of Chicago’s homicide statistics.
The change stunned officers. Current and former veteran detectives who reviewed the Groves case at Chicago’s request were just as incredulous. Says a retired high-level detective, “How can you be tied to a chair and gagged, with no clothes on, and that’s a [noncriminal] death investigation?” (He, like most of the nearly 40 police sources interviewed for this story, declined to be identified by name, citing fears of disciplinary action or other retribution.)
Was it just a coincidence, some wondered, that the reclassification occurred less than two weeks before the end of the year, when the city of Chicago’s final homicide numbers for 2013 would be tallied? “They essentially wiped away one of the murders in the city, which is crazy,” says a police insider. “But that’s the kind of shit that’s going on.”
For the case of Tiara Groves is not an isolated one. Chicago conducted a 12-month examination of the Chicago Police Department’s crime statistics going back several years, poring through public and internal police records and interviewing crime victims, criminologists, and police sources of various ranks. We identified 10 people, including Groves, who were beaten, burned, suffocated, or shot to death in 2013 and whose cases were reclassified as death investigations, downgraded to more minor crimes, or even closed as noncriminal incidents—all for illogical or, at best, unclear reasons.
Many officers of different ranks and from different parts of the city recounted instances in which they were asked or pressured by their superiors to reclassify their incident reports or in which their reports were changed by some invisible hand. One detective refers to the “magic ink”: the power to make a case disappear. Says another: “The rank and file don’t agree with what’s going on. The powers that be are making the changes.”
Detroit – The Detroit Police Department is systematically undercounting homicides, leading to a falsely low murder rate in a city that regularly ranks among the nation's deadliest, a Detroit News review of police and medical examiner records shows.
The police department incorrectly reclassified 22 of its 368 slayings last year as "justifiable" and did not report them as homicides to the FBI as required by federal guidelines. There were at least 59 such omissions over the past five years, according to incomplete records obtained from the police department through the Freedom of Information Act.
The Uber effect
In June 2010, the Uber ride-sharing service launched in San Francisco. Besides changing our transportation habits, and the bottom-line of existing cab services, Uber believes it has also had a profound effect on public safety.
From their blog: DUI rates decline in Uber cities
We estimate that the entrance of Uber in Seattle caused the number of arrests for DUI to decrease by more than 10%. These results are robust and statistically significant. The diagram below illustrates the “Uber effect” relative to the baseline of DUIs.
…Uber is responsible for approximately -.7 DUIs per day, or more than a 10% reduction overall. However, this approach is inherently weakened by the fact that many things could have caused DUI to go down around the time when Uber entered. In order to test the robustness of this estimate, we use San Francisco as a control city in a “differences-in-differences” framework. The result is consistent:
The Washington Post's Wonk Blog took a second look (and included Lyft's entrance) at San Francisco and concluded that more research was needed, but the raw numbers don't seem to contradict Uber's claims. The WaPo produced this chart from SF DUI numbers:
Counter-claim: It's all about the policing
Some law enforcement officials remain skeptical, SCPR.org reported:
Seattle Police Detective Drew Fowler said Uber shouldn’t be taking credit for reducing DUIs.
“We have seen a decrease in the number of DUI arrests made," Fowler said. "If part of that can be assigned to the introduction of Uber, fantastic. But I don’t think proving the veracity of that is going to be very easy to do.”
Fowler said it's more likely that DUIs have fallen because of a crackdown on drunk driving as part of a statewide initiative called Target Zero.
Sometimes crime numbers drop because there are fewer police to make reports. Here, the Seattle detective thinks an increased police presence drove down DUI rates.
And it always pays to remember: Correlation does not equal causation. It seems intuitive that Uber would allow more people to get home drunk without driving. But maybe those people stayed home before Uber was around, due to the inconvenience of having to find a ride home otherwise. Maybe Uber's real effect is that more people get to party who otherwise wouldn't – which is not a bad feat, by mind you.
Let's make a pivot chart!
So what's the truth? Finding all the datasets across the different domains (such as number of liquor sales and bar business over time) is beyond our ability. But one thing we can do is just verify just one piece of Uber's evidence, something that's well within our means: the DUI reports in San Francisco.
The big picture: We want to re-create the Washington Post Wonkblog's chart; But we need to get the data in the right form. In fact, much of your difficult work in visualization (or any analysis) will be to get the data in the right form for whatever chart-making tool you're using.
We start out with a spreadsheet with every DUI report in the San Francisco crime incident database. Every report has its own row, and we may be able to count the total number of incidents (by counting up all the rows), but we can't figure out how many incidents there are by, say, per month and year:
We need a thing that looks like this: A row for every month-year combination, and a column that counts number of incidents that have that month-year
And then we can chart it.
As an exercise, let's first chart the rate of DUIs by hour.
- Open Google spreadsheet
- Make a copy
- Add some frozen panes, stretch out some fields
- Add a column to the end of the sheet and name it
Add the formula
timecolumn) and calculate the formula down all the rows.
- Now make a Pivot Table
For the Rows editor, we want to Group by: hour (you can uncheck the Show totals box, too). This ennumerates all the possible values in the
Now we need to create a second listing in our pivot table: how many DUIs were reported per hour; in other words, the kind of summary we want is a count of the rows. So in the Values section, it doesn't matter what column you pick out. But for consistencies' sake, just pick
By default, the pivot table will try to Display the SUM of the incident numbers…which makes no sense for what we want to do. Change the Display option to COUNT
- To visualize this, we select from the menu: Insert a new Chart. Check the Use column A as labels (remember that the first column has what will be the hour labels) and select the Line chart (Google Charts may recommend a few irrelevant chart types at first):
As expected, based on Uber's findings, people tend to get caught driving drunk in the night (and comparatively rarely during, say, 10AM).
Let's move on to our original goal: charting by month and year.
DUIs by month and year
Return to the original sheet (it should be called "Sheet1" in the bottom tabs) and now add a new column:
We want the values in this column to be a combination of the year and month derived from the
date column. This can all be done in one formula:
This is an admittedly esoteric-looking formula…but it's important because we need to extract just the year and the month so that we are summarizing (i.e. counting) incidents per year and month, and not by each day. The new column will look like this:
You can either make a new pivot table, or return to the previous one. If you do the latter, you have to Edit [the] range since we've added a new column to the report.
Now, when you Add field to Rows, the
year_and_month option should show up.
Select that. And then for Values, again pick a column and choose Summarize by: COUNT
The pivot table will now look like this:
And then insert a new chart as before to get something pretty similar to what the WaPo Wonkblog has:
Simplifying the data
In the Wonkblog piece, they note a curious trend in both the SF and Philly DUI numbers:
It's also striking that San Francisco and Philadelphia show a steep and parallel rise in DUIs long before these services ever came to town; on both charts, it looks as if DUI numbers may be returning to an older normal as much as they've been falling. Perhaps these services have arrived on the market just in time to ride the benefits of an improving economy? (More theories on this welcome below as well).
Let's make some theories of our own. We can't do anything for Philly right now, but let's check out our SF numbers. Eyeballing the chart, it seems like the spike happened somewhere between 2008 to 2009 (Uber launched in June 2010)
The problem with our current pivot table is there's too much data. So go back to the original spreadsheet and add two more columns, just
month (you can figure out the formulas yourself). Then re-adjust your pivot table to include those columns.
Let's make a simpler chart: Instead of grouping the Rows by
year_and_month, simply group by
year and make a column chart of the results:
2009 seems like a banner year for DUI reports. Let's look even closer at that number by examining the data at a monthly level.
Crosstab across year and month
So far, we've been using the pivot tables in the simplest way: to aggregate across one dimension (counting incidents per hour, for instance). But pivot tables are especially useful for multiple dimensions, such as incidents per year, per month.
Well, we kind of did that with the "year_and_month" field, but that only generated a two column chart. You'll see the difference in what I mean if you do the following:
- In the Rows section of the Report Editor, change the Group by field to the
- In the Columns section, add the
- Set the Values as before
You'll now get a nice cross-tabulation of the data:
Notice anything strange? In February 2009 (the column headed by
2), there were 60 DUI reports. In February 2008 and 2010, there were 35 and 26 incidents respectively. Make a chart of this pivot table to see the difference visually:
In the homework, you'll practice the use of Pivot Tables. I've posted a complete walkthrough for the homework here.
Note: I’ve created a video walkthrough on using pivot tables for this data. Check it out here.
In class, we started to examine a claim by Uber that its ride-sharing service has likely caused a decrease in drunk-driving behavior. One of their datasets is the San Francisco Police Department’s incident reports categorized as “driving under the influence.”
I’ve created another SFPD dataset: all the incidents from 2003 to 2013 that were categorized as DUIs, “DRUNKENNESS”, or “KIDNAPPINGS”.
Using the mathematical reasoning we examined in the Baltimore Sun’s coverage of declining rape statistics, use this dataset to either support or cast doubt on Uber’s analysis.
In other words, your analysis must be able to complete this statement:
If it were true that Uber’s launch in San Francisco in June 2010 had a noticable impact on DUI reports, then we expect to see (this or that) in (yadda yadda yadda) kind of comparison. As the data shows, we (do/do not).
Hint: Read the Washington Post Wonkblog analysis and note what they did not fully investigate.
Make copies of one or the other or both. And then just explore them. In class, we looked at raw counts of categories of incidents, such as DUIs. But there’s more to the data, including subcategories of the incidents and how they were resolved. Examine a category of crime. Slice it up by time or date intervals. Make a pivot chart and even a visualization if you’d like. Then write up a quick one-pager describing to me what you did and anything you noticed.
Read the following essays, turn in a one-page report answering these questions:
- Why is there no comprehensive and official data source on either number of homicides, gun-related homicides, or police-related homicides?
- What official sources exist and what are their limitations? And why do those limitations exist?
- How have independent researchers been able to collect data?
- What obstacles have they run into?
The essays: - What Slate learned from trying, and failing, to record every death by gun in America since Newtown. - “Homicide Watch: An Interview” - Nobody Knows How Many Americans The Police Kill Each Year - Why the Fatal Encounters project exists
Read from the 1991 edition online: http://www.unc.edu/~pmeyer/book/Chapter8.htm. No writeup necessary, but you should probably read it rather than risk being confused in the next class.
An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
- The importance of spreadsheets
- Counting murders
- Making calls
- A crowdsourced spreadsheet
Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
- Phillip Reese speaks
Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
- Why maps work
- Why maps don't work
- Introduction to Fusion Tables and TileMill
A continuation of learning mapping tools, with a focus on borders and shapes
- Working with KML files
- Intensity maps
- Visual joins and intersections
The first in several sessions on learning SQL for the exploration of large datasets.
- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
- Inner joins
- One-to-one relationships
- Our politicians and what they tweet
Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
- Left joins
- NULL values
- Which Congressmembers like Ellen Degeneres?
A casual midterm covering the range of data analysis and programming skills acquired so far.
- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
- Polling and pollsters
- Following the campaign finance money
- Competitive U.S. Senate races
With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
- Statistical significance
- Poll reliability
Do your on-the-ground reporting
- No class because of Election Day Coverage
While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
- Review of the midterm
- The importance of good data in visualizations
- How visualization can augment the Serial podcast
One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
- Dirty data
Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
- Ellen, World Cup, and other masses of Twitter data
When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
Discussion of final projects before the Thanksgiving break.
Holiday - no class
Holiday - no class
Last-minute help on final projects.
In-class presentations of our final data projects.