- Working with KML files
- Intensity maps
- Visual joins and intersections
- A map of political change in the United States Due by next class
- Install Sequel Pro or SQLite Manager (optional) Due by next class
Jump to the full details on homework assignments
Today we discuss more on the theory and practical use cases of mapping, and learn to (superficially) work with shapefile data.
Next week, when we are deep in the soul-crushing debugging of Structured Query Language, think of shape-mapping, particularly where borders intersect and overlap, as a visual representation of "inner" and "outer" joins.
Related tutorial: Fusion Tables Intensity Maps with Custom Shapes
Voronoi cholera map
Continuing from last lesson's mention of John Snow…
Below is an interactive version of John Snow's cholera map. The red dots represent the casualties, the orange dots represent the pumps.
The shaded region around each pump represents the area in which that pump is the closest to, as the crow flies. Thus, the red dots within each pump's shaded region represent the casualties that were nearer to that pump than to the neighboring pump.
This is a simple Voronoi diagram to help better quantify the deaths of people whose main water source was the Broad Street pump.
Technical note: There are a few technical problems with this map, particularly my poor implementation of the D3 voronoi function and how the "deaths" were geocoded (the stacked dots all belong to a particular building/geopoint)…but it's meant to be a rough demonstration of the voronoi concept.
More technical notes: The digitized death point data comes from Rusty Dodson of NCGIA/Santa Barbara. The code to draw the map was inspired by Waldo Tobler's work here. The Voronoi-drawing code is a crude alteration of Mike Bostock's demonstration.
John Snow's hand-drawn Voronoi diagram
The voronoi-map diagram helps with visualizing each pump's "region of influence", but drawing the boundaries based on as-the-crow-flies distance misses an important nuance: lugging water containers, whether by hand or by wheeled-cart, cannot be done as if you were a bird. In other words, people will gravitate to a water pump in which the roads (or walkways) to get there are more convenient, even if the crows-flies distance is farther.
This was a nuance that Dr. Snow captured in this lesser-known version of his cholera map (which is considered a voronoi-type diagram, even though such a concept wasn't invented in Snow's time).
Via Stephen Johnson in "The Ghost Map: The Story of London's Most Terrifying Epidemic":
After presenting to the Epidemiological Society, Snow had realized that his original map was still vulnerable to a miasmatic interpretation. Perhaps the concentration of deaths around the Broad Street pump was merely evidence that the pump was releasing noxious fumes into the air. And so Snow realized he needed a way to represent graphically the foot-traffic activity around the pump that he had so painstakingly reconstructed. He needed to show lives, not just deaths; he needed to show the way the neighborhood was actually traversed by its residents.
via the John Snow archive:
The key takeaway is that the raw geographic data – location of the pumps, location of the dead residents – was not enough to make a scientific argument. Snow had to understand how the dead lived: the hand-drawn loop around the Broad Street pump is based off of his understanding of the traffic patterns at the human level, giving a more precise illustration of how people accessed the Broad Street pump.
And while the deaths inside the boundary line are interesting, if not conclusive evidence, the boundary line also makes it easy to identify counterexamples, i.e. anomalies that seem to contradict Dr. Snow's hypothesis.
These counterexamples underscore the strength of Dr. Snow's reporting and scientific method. There are at least two kinds "counterexamples" that are interesting:
There are several buildings within the voronoi-boundary that have strikingly few deaths. Dr. Snow investigated these individually. An example from Tufte's retrospective analysis:
There is a brewer in Broad Street, near the pump, and on perceiving that no brewer's men were registered as having died of cholera, I called on Mr. Huggins, the propietor. He informed me that there were above seventy workmen employed in the brewery, and that none of them had suffered from cholera, at least in a severe form…The men are allowed a certain quantity of malt liquor, and Mr. Huggins believes they do not drink water at all…There is a deep well in the brewery, in addition to the New River water.
There are clusters of deaths that are far from the Broad Street pump. In his investigation, Dr. Snow found that many residents willingly traveled a farther distance to the Broad Street pump because it was thought to have better-quality water.
So to continue hammering on the point: it's not the visualization, it's the reporting.
Near/inside a shape: Earthquakes and schools
via Center for Investigative Reporting: Palo Alto schools near seismic hazards
A good example of "joining" data: The shapefiles that represent the quicksand zones and fault lines are one dataset. The locations of schools are another dataset. Where the two intersect: that may be a story.
Outside a shape: Campaign contributions outside a district
via WNYC's John Keefe:
Still, no one can really claim to be pulling their support–at least financially–directly from the people they represent.
In U.S. presidential elections, the most common way to show election results is via a state-by-state voting map. Via the NYT:
Reflect on why this map just works: if you're a long-time U.S. resident, you have a decent fix for what these shapes represent. In many cases, such as with "Florida", you're able to find it on this map much faster than you would in a standard list/table.
However, if you were completely unfamiliar with U.S. geography, this map would seem to illustrate an incredibly close political race, one in which "Red" seems to have won. And that's because without prior political knowledge, this kind of map ends up being not a map of political votes, but of geographic area, such that states like Montana and Ohio have an outsized/undersized visual impact.
To underscore this point, look at the vote tally by county: unless you have an impressive memorization of county-level demographics, this map very strongly conveys the impression of a Republican victory:
The cartogram is one kind of compromise to deal with the population-vs-geography mismatch. In the map below, each electoral vote is given the same size, and then placed roughly where it corresponds in relative geography. The tradeoff is that, well, it's not as easy to reflexively locate many of the less-influential states, such as Iowa:
Illustrative policy map: gun permit policies
Here's a non-election way of using the U.S. map to quickly show the state of state laws:
NYT Graphic: Easing Restrictions on Gun Permits
NYT Graphic: Restoring Gun Rights
Illustrative policy map: failure-to-protect laws
We discussed BuzzFeed reporter Alex Campbell's investigation into failure-to-protect laws in the previous class. I'm including some notes and discussion here, as it is relevant for today's assignment.
Here's the illustrative U.S. map used in Campbell's story:
The deeper the red, the harsher the maximum sentence for a failure-to-protect conviction.
This is a decent example of using the U.S. map to illustrate the spread of policy, but I highlight this story's map in particular because they include the spreadsheet they use to build it. Here's what the sheet's layout looks like (split in half to better fit the page here):
Campbell's investigation also includes this sidebar with how cases were found, and the supporting spreadsheets are included. There's also a list of the kinds of data BuzzFeed asked for from each state.
In class, we saw at least two examples of using the United States map to illustrate the prevalence of policy: BuzzFeed’s investigation into failure-to-protect laws and NYT’s look at restrictions on gun permits.
Check out this tutorial: Fusion Tables Intensity Maps with Custom Shapes
And then make your own U.S. policy map. It can be like BuzzFeed’s, in which there are several colors for the several categories of severity. Or, like NYT’s, which does a side-by-side of two shaded U.S. maps showing how much policy has changed.
First, pick a policy. One example would be: states that allow gay marriage today versus 10 years ago. But don’t do that, because that’s been done plenty of times in the past week.
Then, make the map, which will involve making a spreadsheet similar to BuzzFeed’s example and then importing it into Fusion Tables. Or, you could try TileMill, in which case, you would handcode the MSS styles as appropriate. Whatever you feel most comfortable with.
This is not really a test of your mapping skills, as the state-level KML data is provided for you. Instead, it’s an exercise in reporting and organizing your notes. The fact that you organize it in a spreadsheet then makes it trivial to turn into an illustrative chart.
Time for some database fun. The lab computers have Sequel Pro installed. If you want to do database work at home, it’s up to you to install this software on your own machine. Here’s a guide to get you started.
Note: You should probably at least try getting SQLite Manager to work on your own computer. It’s fairly easy to get up and running, and it lets you build datamaps with the SQLite format (which you might want to later on).
An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
- The importance of spreadsheets
- Counting murders
- Making calls
- A crowdsourced spreadsheet
Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
- Phillip Reese speaks
Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
- Why maps work
- Why maps don't work
- Introduction to Fusion Tables and TileMill
A continuation of learning mapping tools, with a focus on borders and shapes
- Working with KML files
- Intensity maps
- Visual joins and intersections
The first in several sessions on learning SQL for the exploration of large datasets.
- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
- Inner joins
- One-to-one relationships
- Our politicians and what they tweet
Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
- Left joins
- NULL values
- Which Congressmembers like Ellen Degeneres?
A casual midterm covering the range of data analysis and programming skills acquired so far.
- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
- Polling and pollsters
- Following the campaign finance money
- Competitive U.S. Senate races
With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
- Statistical significance
- Poll reliability
Do your on-the-ground reporting
- No class because of Election Day Coverage
While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
- Review of the midterm
- The importance of good data in visualizations
- How visualization can augment the Serial podcast
One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
- Dirty data
Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
- Ellen, World Cup, and other masses of Twitter data
When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
Discussion of final projects before the Thanksgiving break.
Holiday - no class
Holiday - no class
Last-minute help on final projects.
In-class presentations of our final data projects.