COMM 273D | Fall 2014

Thursday, September 25

Bad big data

Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.


  • Fighting bad data with bad data
  • Baltimore's declining rape statistics
  • FBI crime reporting
  • The Uber effect on drunk driving
  • Pivot tables


  • Investigate Uber's purported impact on SF's DUI reports Due by next class
  • Explore a snapshot of SF crime incident reports from 2014 with pivot tables Due by next class
  • Read several essays on tracking homicides and shootings Due by next class
  • Read The New Precision Journalism, Chapter 8: Databases Due by next class

Jump to the full details on homework assignments

In the previous class, we saw a scenario in which there was effectively zero data. And then, how investigative reporters used an indirectly-related, but plentiful and reliable dataset, to make a convincing case that police weren't policing themselves.

Today's scenario is different. There is seemingly enough data. But on closer investigation, the data itself is at every level vulnerable to inaccuracies through miscounting if not outright manipulation.

So what's the point of knowing how to use data at all if it can be so easily subverted?

In 2010, Baltimore Sun crime reporter Justin Fenton not only built a clear, logical refutation against something as infuriatingly opaque and politically controversial as rape statistics. But he did it so convincingly that his work immediately effected a dramatic change in policy. And he started with the same flawed, public data that the police and the politicians had access to but apparently overlooked.

We'll attempt to recreate Fenton's analysis based on what he published. Obviously, it's easy to follow someone's footsteps, but the exercise is worth it to see the variety and scope of comparison and investigation Fenton did to build his case. His project is a textbook case of how to analyze numbers – even the tainted kind – to get closer to the truth.

A dramatic and strange decline

In the year 2009, the Baltimore police reported 158 rapes. Moving past the sentiment that one such crime is too many, the first thing to question is: how significant is 158? Is that more than expected for a city of Baltimore's size? Fewer? The easiest route of comparison is to look at that number historically, i.e. the number of rape reports, year over year:

via the Baltimore Sun's graphics department:


If we restrict our inquiry to the past decade, we might conclude that 2009 is typical in terms of reported rapes. Although from 2000 to 2003, we do see a noticeable drop, implying that in 2009, Baltimore is near its historic low in this statistic.

However, when we expand the time frame further back into the 1980s:


– it appears that Baltimore has experienced a dramatic drop: nearly an 80 percent drop from a high of 749 rape reports in 1992.

It's worth asking: is 80 percent a lot? According to the caption of the Baltimore Sun graphic, yes, as FBI statistics indicate an 8 percent drop nationwide.

Control for population

What if Baltimore's population changed significantly?

The first thing we must do is control for changes in population. We should expect Baltimore's crime overall to drop 80 percent if 80 percent of its population has moved out.

A Google search for "Baltimore population" seems to invalidate that theory:


Overall success in crime prevention?

It's possible that Baltimore has had exceptional success with reducing crime overall compared to the rest of the nation. It's possible there has been a significant investment in police funding, as well as an implementation of innovative law enforcement strategies (a real-life Hamsterdam?).

We can confirm that theory by comparing across categories. We should see a dramatic drop in other categories of crime, such as homicide. Let's look at the FBI Uniform Crime Reporting for Baltimore's homicide and rape numbers (officially labeled as, "Murder and nonnegligent manslaughter" and "Forcible rape", respectively)

Year Population Homicides Rape
1985 771,097 213 592
1986 783,542 240 660
1987 764,893 226 595
1988 763,880 234 517
1989 763,138 262 541
1990 736,014 305 687
1991 748,099 304 701
1992 755,517 335 749
1993 732,968 353 668
1994 739,180 321 637
1995 712,209 322 683
1996 716,446 330 641
1997 719,587 312 480
1998 662,253 313 469
2000 651,154 261 366
2001 660,826 256 296
2002 671,028 253 178
2003 644,554 270 204
2004 634,279 276 182
2005 641,097 269 162
2006 637,556 276 138
2007 624,237 282 146
2008 634,549 234 137
2009 638,755 238 158

Because the population fluctuates, it's more relevant to look at crime rates: in this case, number of crimes per 100,000 citizens:

Year Homicide rate Rape rate
1985 27.6 76.8
1986 30.6 84.2
1987 29.5 77.8
1988 30.6 67.7
1989 34.3 70.9
1990 41.4 93.3
1991 40.6 93.7
1992 44.3 99.1
1993 48.2 91.1
1994 43.4 86.2
1995 45.2 95.9
1996 46.1 89.5
1997 43.4 66.7
1998 47.3 70.8
2000 40.1 56.2
2001 38.7 44.8
2002 37.7 26.5
2003 41.9 31.6
2004 43.5 28.7
2005 42 25.3
2006 43.3 21.6
2007 45.2 23.4
2008 36.9 21.6
2009 37.3 24.7

Charting that data leads to the graph below: whatever has led to the drop in rape reports has not affected the rate of homicides:


Ratio of homicides to rapes, nationwide and locally

In fact, the ratio of rape reports to homicides is so low that it places Baltimore among a very few cities in which there are more homicides than rape reports, according to this Baltimore Sun infographic:

More homicides than rapes: Nationally, there are five rapes for every murder committed. Out of 272 cities with a population of 100,000 or more, Baltimore is one of five recording more murders than rapes in 2009.

City Homicides Rapes
New Orleans 174 98
Baltimore 238 158
Baton Rouge 75 55
Newark 80 68
Detroit 361 335

Volume and handling of rape calls

If the incidence of sexual assault had truly dropped, then we would expect not just a drop in the rape reports handled by detectives, but a drop in emergency calls about rape. The Baltimore Sun included this graph, showing a relatively steady volume of calls from 2003 to 2009:


Unfortunately, the Sun doesn't have the number of 911 calls about rape in years before 2003, so it's not possible to see if there was a corresponding drop (or not) as there was in rape reports.

However, the Baltimore Sun's analysis of how rape calls were handled, from 2003 through June 2010, revealed that "that four in 10 calls to 911 during [the past 5 years] never made it to detectives specializing in sex crimes, having been dismissed by police officers at the scene with no report taken."


A leader in unfounded rape cases

For the calls that did make it to Baltimore detectives, more than 30 percent of them would be "marked unfounded by detectives, meaning police believed the victim was lying." In this statistic, Baltimore led the country.

via the Baltimore Sun:


An advocate interviewed by Fenton offers this rhetorical question:

"Why is it that women in the greater Baltimore area are more disposed to lying about sexual assault than anyplace else in America?" asked Branson of Turn Around. "Is it in the water? What exactly would make us the ones most likely to tell a story about being sexually assaulted?"

Case work at the detective-level

Baltimore detectives interviewed by Fenton offered this rebuttal: the increase in "unfounded" reports was due to better police work](,0,2195896,full.story):

Current and former sex offense detectives in Baltimore defended their investigations. Part of their mission, they say, involves rooting out illegitimate complaints that in the past would result in wasted effort and false arrests.

Many reports of rape are made for "ill gain, in order to gain assistance or cover up not coming home," said one of the commanders of the unit, Lt. Thomas Uzarowski, in a March interview.

"The bottom line is, the case is only unfounded when the investigative facts prove the crime did not occur," said Uzarowski, who retired from the department this month. "It's not an opinion. It's not anything other than where the facts fall."

This is a claim that's hard to directly measure. But Fenton scrutinizes the rape report numbers by individual detectives:

The Baltimore squad that investigates sexual assaults and child abuse comprises 50 detectives. One of them, Detective Anthony Faulk Jr., is responsible for one-fifth of the unfounded reports, shelving 14 cases last year, including the alleged attack in midtown. No other detective had more than six such cases, and some have none. Attempts to reach Faulk through the department and police union were unsuccessful.

It's not a conclusive number, but it raises some doubt about better police work being the reality. If that were the case, how is it that one detective could have such a lopsided rate of unfounded reports? Is it because luck of the draw, or because he's doing the most thorough investigative work? Or is it, as critics believe, that there's simply no standard for evaluating rape claims, allowing detectives to too easily dismiss valid cases?

Impact of Baltimore Sun investigation

Even before Fenton's first story hit the presses, Baltimore Mayor Stephanie Rawlings-Blake ordered an audit after she was contacted by reporters.

Baltimore Police Commissioner Frederick H. Bealefeld III declined to be interviewed in Fenton's first story, but after The Sun's story broke, Bealefeld admitted that there was a "crisis" and said of The Sun's reporting, "I can't tell you that we would have seen that by ourselves."

The Sun kept following the numbers during the months after Fenton's initial report. Before his June 2010 story, rape cases were down 15 percent for the year. By the end of August, the number of rapes investigated by Baltimore police rose 20 percent since the year before.

Via the Sun's graphics department:


"A year later, progress in Baltimore sex offense investigations"

A major category of crime is up substantially in Baltimore, and police and city officials are pleased.

A year after The Baltimore Sun revealed that the city led the country in the number of rape reports discarded by detectives — part of what women's advocates and victims said was a broader pattern of ignoring sexual assaults — the number of rapes being reported is up more than 50 percent.

The power of numbers

Rape occupies a special category of crime, a crime that most often takes place behind closed doors and in which victim and perpetrator are acquainted, making it a crime in which its incidence is less directly affected by the actions and policies of law enforcement. That alone makes rape difficult to count, even before acknowledging the unique social stigma of the crime.

As the Baltimore Sun's editorial board put it:

Rape is different from other crimes. Not only does it involve a violation more profound than any other crime but it also comes with a social stigma that forces victims to relive the pain again and again. No one suggests that a victim of a carjacking was really asking for it. No one asks whether an assault might really have been consensual. When a robbery victim is on the witness stand, the most private details of her life are not dissected under cross examination.

There's little argument that rape is a heinous crime. But there's apparently much disagreement on how to judge an accusation. This uncertainty makes it difficult to objectively know whether or not rapes are being thoroughly investigated. And so for a decade before Fenton's 2010 report, how many valid accusations have been dismissed? And how many more would still be ignored today were it not for The Sun's investigation?

Below is the chart of rape and homicide rates for Baltimore through 2013, according to FBI statistics:


Just as profound as the Sun investigation impact on Baltimore's rape statistics (presumably a result from the overhaul of the Baltimore Police's sex crimes unit), is the realization that all the numbers – all the data, good and bad – were easily accessible to the FBI, the Baltimore police, and the politicians.

But data being accessible is not at all the same as the data being known. Kudos to Justin Fenton and the Baltimore Sun for paying attention and building a decisive investigation despite the inherently bad data.

Examining San Francisco DUIs with Pivot Charts

An introduction to Pivot Charts for the summarization of data.

Map: The locations of SFPD DUI stops from 2003 to 2013


Other examples of crime stat questions

  • The Truth About Chicago’s Crime Rates

    On October 28, a pathologist ruled the death of Tiara Groves a homicide by “unspecified means.” This rare ruling means yes, somebody had killed Groves, but the pathologist couldn’t pinpoint the exact cause of death…With the stroke of a computer key, she was airbrushed out of Chicago’s homicide statistics.

    The change stunned officers. Current and former veteran detectives who reviewed the Groves case at Chicago’s request were just as incredulous. Says a retired high-level detective, “How can you be tied to a chair and gagged, with no clothes on, and that’s a [noncriminal] death investigation?” (He, like most of the nearly 40 police sources interviewed for this story, declined to be identified by name, citing fears of disciplinary action or other retribution.)

    Was it just a coincidence, some wondered, that the reclassification occurred less than two weeks before the end of the year, when the city of Chicago’s final homicide numbers for 2013 would be tallied? “They essentially wiped away one of the murders in the city, which is crazy,” says a police insider. “But that’s the kind of shit that’s going on.”

    For the case of Tiara Groves is not an isolated one. Chicago conducted a 12-month examination of the Chicago Police Department’s crime statistics going back several years, poring through public and internal police records and interviewing crime victims, criminologists, and police sources of various ranks. We identified 10 people, including Groves, who were beaten, burned, suffocated, or shot to death in 2013 and whose cases were reclassified as death investigations, downgraded to more minor crimes, or even closed as noncriminal incidents—all for illogical or, at best, unclear reasons.

    Many officers of different ranks and from different parts of the city recounted instances in which they were asked or pressured by their superiors to reclassify their incident reports or in which their reports were changed by some invisible hand. One detective refers to the “magic ink”: the power to make a case disappear. Says another: “The rank and file don’t agree with what’s going on. The powers that be are making the changes.”

  • Detroit police routinely underreport homicides

    Detroit – The Detroit Police Department is systematically undercounting homicides, leading to a falsely low murder rate in a city that regularly ranks among the nation's deadliest, a Detroit News review of police and medical examiner records shows.

    The police department incorrectly reclassified 22 of its 368 slayings last year as "justifiable" and did not report them as homicides to the FBI as required by federal guidelines. There were at least 59 such omissions over the past five years, according to incomplete records obtained from the police department through the Freedom of Information Act.

The Uber effect

In June 2010, the Uber ride-sharing service launched in San Francisco. Besides changing our transportation habits, and the bottom-line of existing cab services, Uber believes it has also had a profound effect on public safety.

From their blog: DUI rates decline in Uber cities

We estimate that the entrance of Uber in Seattle caused the number of arrests for DUI to decrease by more than 10%. These results are robust and statistically significant. The diagram below illustrates the “Uber effect” relative to the baseline of DUIs.


…Uber is responsible for approximately -.7 DUIs per day, or more than a 10% reduction overall. However, this approach is inherently weakened by the fact that many things could have caused DUI to go down around the time when Uber entered. In order to test the robustness of this estimate, we use San Francisco as a control city in a “differences-in-differences” framework. The result is consistent:


The Washington Post's Wonk Blog took a second look (and included Lyft's entrance) at San Francisco and concluded that more research was needed, but the raw numbers don't seem to contradict Uber's claims. The WaPo produced this chart from SF DUI numbers:


Counter-claim: It's all about the policing

Some law enforcement officials remain skeptical, reported:

Seattle Police Detective Drew Fowler said Uber shouldn’t be taking credit for reducing DUIs.

“We have seen a decrease in the number of DUI arrests made," Fowler said. "If part of that can be assigned to the introduction of Uber, fantastic. But I don’t think proving the veracity of that is going to be very easy to do.”

Fowler said it's more likely that DUIs have fallen because of a crackdown on drunk driving as part of a statewide initiative called Target Zero.

Sometimes crime numbers drop because there are fewer police to make reports. Here, the Seattle detective thinks an increased police presence drove down DUI rates.

And it always pays to remember: Correlation does not equal causation. It seems intuitive that Uber would allow more people to get home drunk without driving. But maybe those people stayed home before Uber was around, due to the inconvenience of having to find a ride home otherwise. Maybe Uber's real effect is that more people get to party who otherwise wouldn't – which is not a bad feat, by mind you.

Let's make a pivot chart!

So what's the truth? Finding all the datasets across the different domains (such as number of liquor sales and bar business over time) is beyond our ability. But one thing we can do is just verify just one piece of Uber's evidence, something that's well within our means: the DUI reports in San Francisco.

The big picture: We want to re-create the Washington Post Wonkblog's chart; But we need to get the data in the right form. In fact, much of your difficult work in visualization (or any analysis) will be to get the data in the right form for whatever chart-making tool you're using.

We start out with a spreadsheet with every DUI report in the San Francisco crime incident database. Every report has its own row, and we may be able to count the total number of incidents (by counting up all the rows), but we can't figure out how many incidents there are by, say, per month and year:


We need a thing that looks like this: A row for every month-year combination, and a column that counts number of incidents that have that month-year


And then we can chart it.

As an exercise, let's first chart the rate of DUIs by hour.

  1. Open Google spreadsheet
  2. Make a copy
  3. Add some frozen panes, stretch out some fields
  4. Add a column to the end of the sheet and name it hour
  5. Add the formula =HOUR(E2) (assuming E is the time column) and calculate the formula down all the rows.


  6. Now make a Pivot Table
  7. For the Rows editor, we want to Group by: hour (you can uncheck the Show totals box, too). This ennumerates all the possible values in the hour column, i.e. 0 (midnight) to 23 (11PM):


  8. Now we need to create a second listing in our pivot table: how many DUIs were reported per hour; in other words, the kind of summary we want is a count of the rows. So in the Values section, it doesn't matter what column you pick out. But for consistencies' sake, just pick incidntnum.

  9. By default, the pivot table will try to Display the SUM of the incident numbers…which makes no sense for what we want to do. Change the Display option to COUNT


  10. To visualize this, we select from the menu: Insert a new Chart. Check the Use column A as labels (remember that the first column has what will be the hour labels) and select the Line chart (Google Charts may recommend a few irrelevant chart types at first):


As expected, based on Uber's findings, people tend to get caught driving drunk in the night (and comparatively rarely during, say, 10AM).

Let's move on to our original goal: charting by month and year.

DUIs by month and year

Return to the original sheet (it should be called "Sheet1" in the bottom tabs) and now add a new column: year_and_month.

We want the values in this column to be a combination of the year and month derived from the date column. This can all be done in one formula:


This is an admittedly esoteric-looking formula…but it's important because we need to extract just the year and the month so that we are summarizing (i.e. counting) incidents per year and month, and not by each day. The new column will look like this:


You can either make a new pivot table, or return to the previous one. If you do the latter, you have to Edit [the] range since we've added a new column to the report.

Now, when you Add field to Rows, the year_and_month option should show up.

Select that. And then for Values, again pick a column and choose Summarize by: COUNT

The pivot table will now look like this:


And then insert a new chart as before to get something pretty similar to what the WaPo Wonkblog has:



Simplifying the data

In the Wonkblog piece, they note a curious trend in both the SF and Philly DUI numbers:

It's also striking that San Francisco and Philadelphia show a steep and parallel rise in DUIs long before these services ever came to town; on both charts, it looks as if DUI numbers may be returning to an older normal as much as they've been falling. Perhaps these services have arrived on the market just in time to ride the benefits of an improving economy? (More theories on this welcome below as well).

Let's make some theories of our own. We can't do anything for Philly right now, but let's check out our SF numbers. Eyeballing the chart, it seems like the spike happened somewhere between 2008 to 2009 (Uber launched in June 2010)

The problem with our current pivot table is there's too much data. So go back to the original spreadsheet and add two more columns, just year and month (you can figure out the formulas yourself). Then re-adjust your pivot table to include those columns.

Let's make a simpler chart: Instead of grouping the Rows by year_and_month, simply group by year and make a column chart of the results:


2009 seems like a banner year for DUI reports. Let's look even closer at that number by examining the data at a monthly level.

Crosstab across year and month

So far, we've been using the pivot tables in the simplest way: to aggregate across one dimension (counting incidents per hour, for instance). But pivot tables are especially useful for multiple dimensions, such as incidents per year, per month.

Well, we kind of did that with the "year_and_month" field, but that only generated a two column chart. You'll see the difference in what I mean if you do the following:

  1. In the Rows section of the Report Editor, change the Group by field to the year column.
  2. In the Columns section, add the month field.
  3. Set the Values as before

You'll now get a nice cross-tabulation of the data:


Notice anything strange? In February 2009 (the column headed by 2), there were 60 DUI reports. In February 2008 and 2010, there were 35 and 26 incidents respectively. Make a chart of this pivot table to see the difference visually:


In the homework, you'll practice the use of Pivot Tables. I've posted a complete walkthrough for the homework here.


  • Explore a snapshot of SF crime incident reports from 2014 with pivot tables

    Due by next class

    The San Francisco Police Department posts all of their crime incident reports online. I’ve create a couple of 2014 snapshots: January to March and May to July.

    Make copies of one or the other or both. And then just explore them. In class, we looked at raw counts of categories of incidents, such as DUIs. But there’s more to the data, including subcategories of the incidents and how they were resolved. Examine a category of crime. Slice it up by time or date intervals. Make a pivot chart and even a visualization if you’d like. Then write up a quick one-pager describing to me what you did and anything you noticed.

  • Read The New Precision Journalism, Chapter 8: Databases

    Due by next class

    Read from the 1991 edition online: No writeup necessary, but you should probably read it rather than risk being confused in the next class.

Course schedule

  • Tuesday, September 23

    The singular of data is anecdote

    An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
    • Count something interesting
    • Make friends with math
    • The joy of text
    • How to do a data project
  • Thursday, September 25

    Bad big data

    Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
    • Fighting bad data with bad data
    • Baltimore's declining rape statistics
    • FBI crime reporting
    • The Uber effect on drunk driving
    • Pivot tables
  • Tuesday, September 30

    DIY Databases

    Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
    • The importance of spreadsheets
    • Counting murders
    • Making calls
    • A crowdsourced spreadsheet
  • Thursday, October 2

    Data in the newsroom

    Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
    • Phillip Reese speaks
  • Tuesday, October 7

    The points of maps

    Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
    • Why maps work
    • Why maps don't work
    • Introduction to Fusion Tables and TileMill
  • Thursday, October 9

    The shapes of maps

    A continuation of learning mapping tools, with a focus on borders and shapes
    • Working with KML files
    • Intensity maps
    • Visual joins and intersections
  • The first in several sessions on learning SQL for the exploration of large datasets.
    • MySQL / SQLite
    • Select, group, and aggregate
    • Where conditionals
    • SFPD reports of larceny, narcotics, and prostitution
    • Babies, and what we name them
  • Thursday, October 16

    A needle in multiple haystacks

    The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
    • Inner joins
    • One-to-one relationships
    • Our politicians and what they tweet
  • Tuesday, October 21

    Haystacks without needles

    Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
    • Left joins
    • NULL values
    • Which Congressmembers like Ellen Degeneres?
  • A casual midterm covering the range of data analysis and programming skills acquired so far.
    • A midterm on SQL and data
    • Data on military surplus distributed to U.S. counties
    • U.S. Census QuickFacts
  • Tuesday, October 28

    Campaign Cash Check

    The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
    • Polling and pollsters
    • Following the campaign finance money
    • Competitive U.S. Senate races
  • Thursday, October 30

    Predicting the elections

    With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
    • Statistical significance
    • Poll reliability
    • Forecasting
  • Tuesday, November 4

    Election day (No class)

    Do your on-the-ground reporting
    • No class because of Election Day Coverage
  • While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
    • Review of the midterm
    • The importance of good data in visualizations
    • How visualization can augment the Serial podcast
  • Tuesday, November 11

    Dirty data, cleaned dirt cheap

    One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
    • Dirty data
    • OpenRefine
    • Clustering
  • Thursday, November 13

    Guest speaker: Simon Rogers

    Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
    • Ellen, World Cup, and other masses of Twitter data
  • Tuesday, November 18

    What we say and what we do

    When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
    • Proxy variables
    • Thanks Google for figuring out my commute
    • How racist are we, really?
    • How web sites measure us
  • Thursday, November 20

    Project prep and discussion

    Discussion of final projects before the Thanksgiving break.
  • Tuesday, November 25

    Thanksgiving break

    Holiday - no class
  • Thursday, November 27

    Thanksgiving break

    Holiday - no class
  • Tuesday, December 2

    Project wrapup

    Last-minute help on final projects.
  • Thursday, December 4

    Project Show-N-Tell

    In-class presentations of our final data projects.