- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
- Do the Midterm Due Thursday, October 30
Jump to the full details on homework assignments
The notes for this section consist entirely of the description for the dataset used for the midterm, example queries to help you understand the dataset, and then a few questions for you to tackle on your own.
Update: Here are the answers
Regarding the time-series (or histogram) chart, by political party, of Congressmembers and the number of Tweets mentioning "ebola" as of October 20, 2014.
SELECT members.first_name, members.last_name, members.party, members.current_role, members.state, social_accounts.twitter_screen_name, tweets.text, DATE(tweets.created_at) as tweet_date FROM members JOIN social_accounts ON members.bioguide_id = social_accounts.bioguide_id JOIN tweets ON social_accounts.twitter_screen_name = tweets.screen_name WHERE tweets.text LIKE '%ebola%' ORDER BY tweet_date
I've made a couple of editorial choices: the database contains only 2 tweets (both by Democrats) in the year 2013, and only 2 tweets by an Independent. The outliers would skew the chart (i.e. create a lot of empty space) so I've filtered them out, so that the chart only shows 2014 tweets by Democrats and Republicans.
Along the x-axis, I've chosen to do the week number (which can be found using Google Spreadsheet's
WEEKNUM function, or MySQL's
WEEK function) derived from each tweet's
The resulting chart:
Note: Alternatively, if you wanted to reduce the amount of Pivot Table work, you could've done the aggregations inside of SQL.
SELECT members.party, WEEK(tweets.created_at) AS tweet_week, YEAR(tweets.created_at) AS tweet_year, COUNT(*) AS tweet_count FROM members JOIN social_accounts ON members.bioguide_id = social_accounts.bioguide_id JOIN tweets ON social_accounts.twitter_screen_name = tweets.screen_name WHERE tweets.text LIKE '%ebola%' GROUP BY party, tweet_year, tweet_week ORDER BY party, tweet_year, tweet_week
For SQLite (remember the list of format strings for the
SELECT members.party, strftime('%W', tweets.created_at) AS tweet_week, strftime('%Y', tweets.created_at) AS tweet_year, COUNT(*) AS tweet_count FROM members JOIN social_accounts ON members.bioguide_id = social_accounts.bioguide_id JOIN tweets ON social_accounts.twitter_screen_name = tweets.screen_name WHERE tweets.text LIKE '%ebola%' GROUP BY party, tweet_year, tweet_week ORDER BY party, tweet_year, tweet_week
The following section is all about the midterm and the data we're using.
Download the data:
The 1033 Program
The 1033 Program was established to allow the U.S. Defense Dept. to give away its surplus equipment to U.S. law enforcement agencies. More than $5.1 billion of military gear has been distributed to the nearly 8,000 law enforcement agencies who participate in the program.
Congress initially authorized the transfer of military gear for the War on Drugs. The 1033 program was later open to any U.S. law enforcement agency, though preference is given to requests related to counter-drug and counter-terrorism requests.
The program has, of late, come under scrutiny because of photos of heavily-equipped police during the civil unrest in Ferguson during 2014.
NPR has some documentation arising from their analysis of the federal 1033 Program data and statewide data that they've acquired. You may want to read it to figure out what some of the data columns mean.
The California Governor's Office of Emergency Services has posted a helpful 1033 Program page in which you can find links to more background information and data.
MuckRock has been making public records requests for additional data from each state. I haven't included their tables into this midterm, but their reporting is well-worth reading to understand the issues and context of the 1033 program.
The five data tables in the midterm database
- The SQL version.
The SQLite version
leso - the main table of interest. It contains the list of military surplus items distributed by the 1033 program to law enforcement agencies from January 2006 to July 2014. It is derived from the official spreadsheet maintained by the Law Enforcement Support OFfice, which you can download in Excel format here.
county_ansi - County names and FIPS codes. I've "cleaned" up the
countyfield so that it matches the ways counties were (mis)spelled in the
lesotable. Read more about ANSI and FIPS here.
psc - A list of Product Service Codes and descriptions, from the Federal Procurement Data System. You may find it useful for browsing the general categories of things that the 1033 program distributes. Read more about PSCs here.
- census_quickfacts - A variety of U.S. Census demographic and population statistics per county. Please read the DataDict.txt file provided on the U.S. Census's overview page.
Making sense of the U.S. Census QuickFacts table
To reiterate: the
census_quickfacts table is not going to make much sense because the headers are things like
AGE775213 (Persons 65 years and over, percent, 2013) and
RHI725213 (Hispanic or Latino, percent, 2013). So you must read the Census Quickfacts Data Dictionary to make sense of it.
Technical data munging note
Like every data set we've dealt with so far, the Census and LESO data is what we call, "messy". For example, the original LESO spreadsheet, which I downloaded (direct link) via the California state site, comprised separate lists divided by alphabetical ordering of the states. So I had to manually copy and paste the sheets together to make one sheet.
The Census QuickFacts table was not in itself "messy"; the problem is that the LESO data only refers to state and county names, in all uppercase, and the LESO data does not follow the ANSI standard, apparently. For example, "Bristol City, Virginia" is referred in LESO as just "BRISTOL". Whereas "Baltimore city" is correctly keyed as "BALTIMORE CITY". The inconsistencies were without rhyme or reason, leaving me to apply custom queries to fix up at least several hundreds of rows.
The upshot: if you think you have a cool story from this midterm data, don't run with it just yet. We'll want to doublecheck that I didn't mess up my data munging in such a way that it impacts your queries.
Most of the work will involve just inspecting the tables and what they contain. Here are some example queries to demonstrate what's in them.
Quick syntax note
When referring to a column that has a space in the name, such as
Item Cost, you must enclose the table name either inside of backticks (the little used key in the top-left corner of your keyboard, under the tilde sign), `
SELECT `Item Cost` FROM leso
This will also work:
SELECT leso.`Item Cost` FROM leso
And this if you really want to be explicit about table names:
SELECT `leso`.`Item Cost` FROM leso
SELECT county_ansi.state, county_ansi.county, AGE775213 as senior_pop, county_boundaries.geometry FROM county_ansi INNER JOIN census_quickfacts ON county_ansi.fips = census_quickfacts.fips INNER JOIN county_boundaries ON county_ansi.fips = county_boundaries.fips
Exporting the result of this table to Fusion Tables creates a map like this:
Deriving a column totaling the LESO acquisition costs
leso table includes a
Quantity column and an
Acquisition Cost. If you want to find the total cost per line item, then you need a column in which
Quantity is multiplied by
SELECT `Item Name`, UI, Quantity,`Acquisition Cost`, (Quantity * `Acquisition Cost`) AS totes_cost FROM leso ORDER BY Quantity DESC LIMIT 10
|Item Name||UI||Quantity||Acquisition Cost||totes_cost|
|5 56 EXPENDED BALL BRASS||LB||7700||1.00||7700.00|
If you want, you can create a new column named
totes_cost, specify it as an
DECIMAL and then update it with the calculation:
UPDATE leso SET totes_cost = (Quantity * `Acquisition Cost`);
This way, you don't have to keep doing the multiplication for every query.
What categories of military surplus aren't being distributed?
Not every kind of military surplus is requested or eligible to be shipped out to your local police department. To find out which, do a
LEFT JOIN from the
psc to the
SELECT psc.`PSC CODE`, psc.`PRODUCT AND SERVICE CODE FULL NAME`, psc.`PRODUCT AND SERVICE CODE NOTES`, psc.`PRODUCT AND SERVICE CODE INICLUDES` FROM psc LEFT JOIN leso ON psc.`psc code` = leso.psc_code WHERE leso.psc_code IS NULL /* include only PSC Codes that are 4 characters long */ AND LENGTH(psc.`PSC CODE`) = 4
Apparently, nuclear bombs and flamethrowers are not yet eligible to be acquired through the 1033 Program:
|PSC CODE||PRODUCT AND SERVICE CODE FULL NAME||PRODUCT AND SERVICE CODE NOTES||PRODUCT AND SERVICE CODE INICLUDES|
|1025||Guns, over 150 mm through 200 mm Includes Firing Platforms; Mounts; Gun Shields.|
|1030||Guns, over 200 mm through 300 mm Includes Gun Yokes; Rammers; Reflectors.|
|1035||Guns, over 300 mm||Breech Mechanisms; Training Gears; Power Drives.|
|1040||Chemical Weapons and Equipment||Flame Throwers; Smoke Generators.|
|1045||Launchers, Torpedo and Depth Charge||Depth Charge Tracks; Torpedo Tubes.|
|1055||Launchers, Rocket and Pyrotechnic||Airborne Rocket Launchers adaptable to guided missile use.|
|1070||Nets and Booms, Ordnance||This class includes nets and booms for harbor defense only.|
|1075||Degaussing and Mine Sweeping Equipment|
|1105||Nuclear Bombs||This class includes nuclear weapons (including bombs), which are designed to be dropped from an aircraft.||Ballistic cases, tail assemblies, retardation devices, and other peculiar components which are not classifiable elsewhere.|
|1110||Nuclear Projectiles||This class includes nuclear weapons which are designed to be propelled from a recoilless rifle, gun, howitzer, or the like, and which are not designed to be self propelled.||Ogive sections, body sections, bases, and other peculiar components which are not classifiable elsewhere.|
For all of these questions, besides giving the answer, please include the SQL quer(ies) and other spreadsheet/data steps you took to get to the answer.
- Query the database and create a list of every county (with state) that has acquired at least one mine-resistant armored vehicle. Then map that list (Fusion Tables will probably be the easiest method). The map but not exactly the same, as the list behind this NYT interactive map of armored vehicles.
- Query the database and find the most expensive single item that can be acquired through the 1033 program. Then generate a list of all counties that has so far acquired at least one of these things (you'll likely do two queries to answer this question).
- Query the database to get the top 10 counties ordered by the total number of guns acquired through the 1033 program. Your answer should look very similar to what NPR found under the Total Guns Acquired graph (e.g. 3,452 guns for Los Angeles)
- Query the database to get the top 10 counties ordered by number of guns acquired per 1,000 people using the Census's 2013 population estimate. Again, your answers should look similar (but not exact) to what NPR found, e.g. 28 guns/1,000 people in Franklin, KY.
- Create a time-series/histogram showing something you find interesting in the 1033 Program data. Examples: Number of gas masks versus night-goggles distributed by year. Monetary value of pants and trousers versus rifles, by year.
- Find all of the counties that have not acquired a single thing in the
lesotable, and then map those counties. If you order your list by population, the top two counties should be Kings County and Bronx County in New York (for reasons that should make sense if you have lived in New York).
- Find the county that has acquired the most from the 1033 Program in terms of total acquisition cost and generate a list of the items that county has acquired.
Write a 300-word story memo on an interesting query (or queries) of your own. The memo should include examples/results of a data query and why you think said results are newsworthy or worth further investigation. Your memo, ideally, would involve some additional research to see if anyone else has found anything related to your inquiry. A chart/graphic is optional.
Example story ideas:
- Finding the smallest counties (by population) that have acquired the most heavy equipment despite no obvious need for such equipment. Is that equipment in use, and how? What is the maintenance and training cost of the equipment?
- Which counties have acquired a high amount of surplus equipment in earlier years (2006 to 2007) but very little, or none at all, in recent years?
- What kinds of equipment have only recently been distributed through the 1033 Program, but are currently in seemingly high demand?
- What are the surplus items that have no acquisition cost? And is the lack of cost consistent across all records of those items (and similar items)? Is the lack of cost because of sloppy data entry? Or because of the type of item?
Submit the answers to these questions, including all relevant queries, links, screenshots, etc., as a Markdown file named midterm.md in your private Github repo.
- Create a file named
midterm.mdin your private Github repo
- Answer the question in the lecture notes.
- Create a file named
An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
- The importance of spreadsheets
- Counting murders
- Making calls
- A crowdsourced spreadsheet
Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
- Phillip Reese speaks
Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
- Why maps work
- Why maps don't work
- Introduction to Fusion Tables and TileMill
A continuation of learning mapping tools, with a focus on borders and shapes
- Working with KML files
- Intensity maps
- Visual joins and intersections
The first in several sessions on learning SQL for the exploration of large datasets.
- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
- Inner joins
- One-to-one relationships
- Our politicians and what they tweet
Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
- Left joins
- NULL values
- Which Congressmembers like Ellen Degeneres?
A casual midterm covering the range of data analysis and programming skills acquired so far.
- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
- Polling and pollsters
- Following the campaign finance money
- Competitive U.S. Senate races
With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
- Statistical significance
- Poll reliability
Do your on-the-ground reporting
- No class because of Election Day Coverage
While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
- Review of the midterm
- The importance of good data in visualizations
- How visualization can augment the Serial podcast
One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
- Dirty data
Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
- Ellen, World Cup, and other masses of Twitter data
When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
Discussion of final projects before the Thanksgiving break.
Holiday - no class
Holiday - no class
Last-minute help on final projects.
In-class presentations of our final data projects.