- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
- Read the SQL tutorials Due by next class
- Investigate your own name Due by next class
Jump to the full details on homework assignments
Learning Structured Query Language, abbreviated as SQL, will be the most technically difficult part of this course. But for good reason: if you can read and write SQL, then not only have you learned a whole new language, you have also inadvertently become a programmer, one who is (clumsily) on the path of telling computers what to do rather than the other way around.
I've written a series of tutorials for this class on SQL. The good news is that if you've mastered spreadsheets and pivot tables, then you're already clever enough to know all the concepts that underlie our use of SQL. The bad news is, you certainly won't feel very clever as you diagnose one obtuse syntax error after another. But that can be overcome with the kind of repetition and practice you need to master any new language.
SQL in the wild
We've looked at important data that was scarce and unimportant data that was plentiful. But sometimes data that is both important and plentiful is just sitting there unexamined because no one knows quite knows what to do with it. In their investigation, "Unfit for Duty", Sarasota Herald-Tribune reporters Matt Doig's and Anthony Cormier's first step was to analyze Florida's police misconduct database:
This was a case where the government had this wonderful, informative dataset and they weren’t using it at all except to compile the information. I remember talking to one person at an office and saying: “How could you guys not know some of this? In five minutes of (SQL) queries you know everything about these officers?” They basically said it wasn't their job. That left a huge opportunity for us.
Why is SQL so frustrating to learn? It being a language, learning it involves a fair amount of tedious memorization and practice. The level of existential frustration you may experience will be similar to what it feels like to spend years learning French only to be able to say things less intelligently than a 5-year-old.
Why learn a whole new language? Sometimes, feelings are just more succinctly expressed in a particular language, such as Sacré bleu or J'adore le fromage dans la bibliothèque.
SQL lets us express our data desires in a structured way:
SELECT Descript, Date, Time, Location FROM sfpd_incidents WHERE Category="ASSAULT" AND Date > "2012";
Which results in a data table like this:
|BATTERY||2012-01-01||00:25||200.0 Block of MARKET ST|
|AGGRAVATED ASSAULT WITH A KNIFE||2012-01-01||02:07||600.0 Block of EDDY ST|
|AGGRAVATED ASSAULT WITH BODILY FORCE||2012-01-01||01:20||300.0 Block of MASON ST|
|INFLICT INJURY ON COHABITEE||2012-01-01||01:20||300.0 Block of MASON ST|
|INFLICT INJURY ON COHABITEE||2012-01-01||02:00||DRUMM ST / SACRAMENTO ST|
|BATTERY||2012-01-01||02:05||200.0 Block of KING ST|
|BATTERY||2012-01-01||22:22||400.0 Block of GEARY ST|
|BATTERY||2012-01-01||03:35||800.0 Block of MARKET ST|
|BATTERY||2012-01-01||12:50||1500.0 Block of OAKDALE AV|
|BATTERY||2012-01-01||08:38||800.0 Block of BRYANT ST|
Whereas with user-friendly spreadsheet software, we have to digest a vague instruction set, which assumes you've memorized the interface of a program such as Excel:
- Open up the spreadsheet named
- Open the Filter menu and filter for rows in which the
Categorycolumn is equal to
"ASSAULT"and in which the
Datecolumn contains a value greater than
- Hide all of the columns except
No, not at all. Part of the reason that SQL is very good at working with massive datasets is because it doesn't try to do the fancy kind of things that spreadsheets do. For example, you won't be writing SQL to (directly) generate a chart. Instead, you have to export from the database into a spreadsheet (or other visualization software).
Spreadsheets are great for manipulating, re-organizing, and visualizing data – but all those features make them inefficient for pure data-querying. A database, on the other hand, won't go out of its way to make things easy for you. But what it does, it does very well and very quickly, whether you're working with a thousand or a billion rows. http://www.nytimes.com/2013/04/19/opinion/krugman-the-excel-depression.html
Exploring SFPD data with SQL and spreadsheets
Below is a demonstration of both the thought process in exploring a database and the actual queries. I'm using the Sequel Pro GUI for Mac and a MySQL dump of the incident reports from the San Francisco Police Department's CABLE system.
I won't be using just SQL here; the standard spreadsheet techniques are still important for analysis and visualization.
After the SQL dump has been imported, we can retrieve and display all 1.49 million records with this query:
SELECT * FROM sfpd_incidents
The result is not much different than what we get when opening a spreadsheet:
Since we're not interested in mapping the records (just yet), we can be selective in which columns to retrieve. In this query, I ignore the location and coordinate fields:
SELECT Category, Descript, Date, Time FROM sfpd_incidents
|FRAUD||FORGERY, CREDIT CARD||2003-02-18||16:30|
|LARCENY/THEFT||GRAND THEFT PICKPOCKET||2003-02-18||16:05|
|DRUG/NARCOTIC||SALE OF BASE/ROCK COCAINE||2003-02-18||17:00|
|OTHER OFFENSES||PROBATION VIOLATION||2003-02-18||17:00|
|DRUNKENNESS||UNDER INFLUENCE OF ALCOHOL IN A PUBLIC PLACE||2003-03-31||22:08|
To filter for a specific category of crime, we use the
SELECT Category, Descript, Date, Time FROM sfpd_incidents WHERE Category = "VANDALISM"
|VANDALISM||MALICIOUS MISCHIEF, VANDALISM||2003-05-10||08:18|
|VANDALISM||MALICIOUS MISCHIEF, GRAFFITI||2003-01-13||23:44|
|VANDALISM||MALICIOUS MISCHIEF, BREAKING WINDOWS||2003-02-18||17:00|
|VANDALISM||MALICIOUS MISCHIEF, BREAKING WINDOWS||2003-02-18||20:16|
|VANDALISM||MALICIOUS MISCHIEF, BREAKING WINDOWS||2003-02-18||20:30|
|VANDALISM||MALICIOUS MISCHIEF, VANDALISM||2003-02-18||16:45|
|VANDALISM||MALICIOUS MISCHIEF, GRAFFITI||2003-02-17||07:00|
|VANDALISM||MALICIOUS MISCHIEF, GRAFFITI||2003-02-18||22:55|
|VANDALISM||MALICIOUS MISCHIEF, GRAFFITI||2003-02-18||23:58|
|VANDALISM||MALICIOUS MISCHIEF, VANDALISM||2003-02-19||00:15|
A common inquiry of crime statistics is: just how many of each kind of crime was reported?
In SQL, we use a
GROUP BY clause to aggregate the records by
Category, which is similar to a simple pivot table
SELECT Category, COUNT(*) FROM sfpd_incidents GROUP BY Category
|DRIVING UNDER THE INFLUENCE||4052|
We can order this list by count of reports in descending order and show only the top 5 with the
ORDER BY and
LIMIT clauses, respectively:
SELECT Category, COUNT(*) AS ct FROM sfpd_incidents GROUP BY Category ORDER BY ct DESC LIMIT 5
So from 2003 to 2013, these were the five most reported categories of incidents to the SFPD:
Change over time
Aggregate counts over long periods of time aren't particularly helpful. We can group by year to get the count of reports per year:
SELECT YEAR(Date) AS yr, COUNT(*) FROM sfpd_incidents GROUP BY yr ORDER BY yr
The count of reports by year:
Again, not particularly interesting. We don't know if one kind of crime went up while another went down. So we can make our grouping more granular: the count of incident reports, by year, and by category:
SELECT Category, YEAR(Date) as yr, COUNT(*) as count FROM sfpd_incidents GROUP BY yr, Category ORDER BY yr, Category
The result contains 403 rows, one for each combination of year and category of report:
|DRIVING UNDER THE INFLUENCE||2003||315|
Reduce the clutter
Let's step back a bit. Instead of looking at every year, let's look at the 5-year change in reports, i.e. 2008 and 2013:
SELECT Category, YEAR(Date) as yr, COUNT(*) as count FROM sfpd_incidents GROUP BY yr, Category HAVING yr = 2008 OR yr = 2013 ORDER BY yr, Category
The result is 74 rows:
|SEX OFFENSES, NON FORCIBLE||2013||13|
A time for pivoting
While this is a more manageable list, it's still not easy to discern which categories have changed the most from 2008 to 2013.
At this point, it's worth remembering why we turned to SQL in the first place: because our spreadsheet software couldn't handle an import of 1.5 million rows (Excel caps out at 1,048,576 rows). By using the
GROUP BY aggregate clause, we could reduce the data rows enough to then export into a spreadsheet right off the bat. In fact, the kind of grouping and filtering we've just accomplished may have been easier in pivot tables.
So again, there is no one perfect data tool: if you feel more comfortable organizing data in a spreadsheet, then it's fine to use a database simply to open a large dataset and export the parts you need.
Now that we have 74 rows of aggregate counts (for the year 2008 and 2013), we can simply copy and paste them into a spreadsheet, and arrange them so that we can do the "subtract one cell from the other" formulas that we're familiar with.
Pasting the list directly into a spreadsheet, and then doing a pivot table with
Category as Rows and
year as Columns, and then a SUM of the
count, will generate something that looks like this:
|DRIVING UNDER THE INFLUENCE||408||429|
Then we add two columns to calculate the change:
2013column minus the
Change pct: the
Changecolumn, divided by the
2008column. Then multiplied by 100 and rounded to make for a nice number.
When sorting the list by
Change Pct, here are the top 10 categories based on drop in incidents from 2008 to 2013
Reversing that sort, we can get the top 10 categories according to increase in incidents from 2008 to 2013:
|DRIVING UNDER THE INFLUENCE||408||429||21||5|
Watch me write out the query, execute it, and paste the results into a Google Spreadsheet to find the change from 2008 to 2013 in incident reports. In this demo, I don't use Pivot Tables, but just do some extra cutting-and-pasting as needed:
Preview of JOINs
Note: This is just a technical aside, for those of you who wonder if there's a way to avoid spreadsheets altogether.
We can actually do this all in a SQL query with a
JOIN clause and subqueries, which are two intermediate concepts we haven't yet learned. But here's a preview of the fun that will be:
SELECT ty.Category, COUNT(1) AS count_2008, ty.count_2013, (ty.count_2013 - COUNT(1)) AS `Change`, /* the difference in count */ ((ty.count_2013 - COUNT(1)) / COUNT(1)) AS `Change pct` /* the difference expressed as a ratio of year 2008 incidents*/ FROM sfpd_incidents AS tx INNER JOIN /* create a new table of just 2013 incidents, name it `ty` */ ( SELECT Category, COUNT(1) AS count_2013 FROM sfpd_incidents WHERE YEAR(Date) = 2013 GROUP BY Category ) AS ty ON tx.Category = ty.Category WHERE YEAR(tx.Date) = 2008 GROUP BY ty.Category ORDER BY `Change pct`
Which results in:
Pondering the numbers
Looking at the change in incident reporting over 2008 to 2013 brings up a few potential angles of investigation.
An increase in larceny
Among all the categories,
"LARCENY/THEFT" has the biggest jump in numbers of reports: 25,797 to 35,661, in 2008 and 2013, respectively. Let's run a query on the database to see that jump compared to the years pre-2008:
SELECT YEAR(Date) as yr, COUNT(*) as count FROM sfpd_incidents WHERE Category = 'LARCENY/THEFT' GROUP BY yr ORDER BY yr
Graphing that table results in:
My first guess would have been to attribute the drastic rise in thefts to the widespread adoption of smartphones. New York Mayor Michael Bloomberg blamed an apparent spike in 2012 crime to Apple lovers:
Major crime in New York City inched up this year, and Mayor Michael R. Bloomberg on Friday fingered the culprit: too many iPhones and iPads were being swiped…
“If you just took away the jump in Apple, we’d be down for the year,” said Marc La Vorgna, the mayor’s press secretary.
On the radio, Mr. Bloomberg said that Apple products appeared to be the preference for many thieves, noting that he was not including thefts of competing devices, like the Samsung Galaxy, in his count.
From 2011 to 2012, theft reports jumped from 25,161 to 30,119, so San Francisco may have experienced the same trend, though a closer look at the subcategories of theft (i.e. theft from auto, pickpocketing, grand theft, etc.) is necessary.
A decrease in narcotics
Let's turn our attention to categories of reports that have decreased and give credit where credit is due. According to our query,
DRUG/NARCOTIC reports have dropped by 4,780 (42%).
Of course, a lot has happened since 2008, including a relaxation of laws and attitudes towards cannabis, in California and across the nation. Is this the main cause of the drop in
DRUG/NARCOTIC reports from 2008 to 2013 in San Francisco?
The easiest way to confirm this is to group the data reports by subcategory, or the
SELECT Descript, YEAR(Date) as yr, COUNT(*) as count FROM sfpd_incidents WHERE Category = 'DRUG/NARCOTIC' GROUP BY yr, Descript HAVING yr = 2008 OR yr = 2013 ORDER BY yr, Descript
Using the previous strategy of pasting the results into a pivot table and creating a spreadsheet of the calculated changes, here are the top subcategories of narcotics reports, sorted by change in count:
|POSSESSION OF BASE/ROCK COCAINE||1617||393||-1224|
|SALE OF BASE/ROCK COCAINE||1270||245||-1025|
|POSSESSION OF NARCOTICS PARAPHERNALIA||2359||1353||-1006|
|POSSESSION OF BASE/ROCK COCAINE FOR SALE||914||284||-630|
|POSSESSION OF MARIJUANA||1108||764||-344|
|SALE OF MARIJUANA||391||133||-258|
|POSSESSION OF MARIJUANA FOR SALES||571||350||-221|
|SALE OF CONTROLLED SUBSTANCE||233||78||-155|
|SALE OF HEROIN||174||43||-131|
|POSSESSION OF CONTROLLED SUBSTANCE||404||296||-108|
|POSSESSION OF HEROIN FOR SALES||199||120||-79|
|POSSESSION OF CONTROLLED SUBSTANCE FOR SALE||251||173||-78|
|POSSESSION OF COCAINE||267||214||-53|
|SALE OF COCAINE||70||20||-50|
Marijuana-related reports do drop, but not as much as cocaine-related reports.
As always, our strategy should involve flipping our perspective: if cocaine-related reports have dropped, has any other kind of report increased?
Here is the table sorted in reverse-order of change:
|POSSESSION OF METH-AMPHETAMINE||383||956||573|
|POSSESSION OF METH-AMPHETAMINE FOR SALE||129||266||137|
|TRANSPORTATION OF METH-AMPHETAMINE||24||49||25|
|POSSESSION OF OPIATES||19||40||21|
|UNDER INFLUENCE OF DRUGS IN A PUBLIC PLACE||40||53||13|
Meth-related reports have increased by nearly 140%, though the change in terms of actual number of reports (~700) is significantly smaller than the change in cocaine-related reports (~3,000).
Investigating a drop in prostitution reports
One more category of reports is worth looking at: while
DRUG/NARCOTICS reports dropped by 42 percent from 2008 to 2013,
PROSTITUTION reports dropped by 84 percent. Like drug-related offenses, prostitution is a category of crime in which attitudes have seemingly shifted. In 2008, more than 40 percent of San Francisco residents voted in favor of decriminalizing prostitution. Though the measure ultimately failed (59 percent opposed), the political debate was heated, thanks to the measure's surprising popularity:
SAN FRANCISCO — When Proposition K was added to Tuesday’s ballot, many people likely snickered at the possibility that San Francisco might take its place alongside such prostitute-friendly havens as Amsterdam and a few rural counties in nearby Nevada.
But this week, it became readily apparent that city officials are not laughing anymore about the measure, which would effectively decriminalize the world’s oldest profession in San Francisco. At a news conference on Wednesday, Mayor Gavin Newsom and other opponents seemed genuinely worried that Proposition K might pass.
“This is not cute. This is not fanciful,” Mr. Newsom said, standing in front of the pink-on-pink facade of a closed massage parlor in the Tenderloin district. “This is a big mistake.”
Since 2008, laws on prostitution haven't changed as significantly as they have for marijuana usage. But technology has most certainly affected all parts of sex and society, both legal and illegal activity, and maybe this has impacted the prostitution business? Or perhaps this drop in reports is simply a result of a coding policy. Let's query the dataset for possible explanations.
For reference's sake, I've posted a spreadsheet of the more than 12,000 incident reports categorized as "PROSTITUTION" from 2003 to 2013.
Visualizing location of incidents
Even in cities where prostitution is illegal, the term "red light district" is used to refer where a concentration of such activity may be found. So instead of looking at bar graphs, let's use the location data in the SFPD incident reports to get an assessment of how the geography of activity has changed, if at all.
The SQL query here is simple, because we're only concerned with getting data in a form usable by a mapping application.
SELECT YEAR(Date) as yr, Descript, Date, Time, Y AS latitude, X AS longitude FROM sfpd_incidents WHERE Category = "PROSTITUTION" HAVING yr = 2008 OR yr = 2013
Using TileMill, we can generate a quick map. Here, the incidents in 2008 are orange, the incidents in 2013 are blue:
Here are the two years side-by-side:
Ignoring my crude mapping skills, we can at least tell that the number of reports may have decreased – the visual confusion is partly because records will be geocoded to a specific corner, so many of them will sit atop the exact same coordinate – but the general location of prostitution reports hasn't changed significantly.
As in the drug reports, it's worth taking a look at the
SELECT Descript, Year(Date) as yr, COUNT(1) as c FROM sfpd_incidents WHERE Category = 'PROSTITUTION' GROUP BY yr, Descript HAVING yr = 2008 OR yr = 2013 ORDER BY yr, Descript
|INMATE/KEEPER OF HOUSE OF PROSTITUTION||8||0||-8||-100|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||987||1||-986||-99.9|
|SOLICITS LEWD ACT||12||1||-11||-91.7|
|ENGAGING IN LEWD CONDUCT - PROSTITUTION RELATED||21||5||-16||-76.2|
|SOLICITS FOR ACT OF PROSTITUTION||602||171||-431||-71.6|
|PLACING WIFE IN HOUSE OF PROSTITUTION||3||1||-2||-66.7|
|INDECENT EXPOSURE - PROSTITUTION RELATED||1||31||30||3000|
The majority of
PROSTITUTION reports fall under
SOLICITS FOR ACT OF PROSTITUTION and
SOLICITS TO VISIT HOUSE OF PROSTITUTION. Both have fallen drastically in a five-year period, though the latter subcategory has all been eliminated: going from 987 reports in 2008 to just one report in 2013.
Let's make a chart
Let's visually quantify the drop in the main two subcategories of prostitution reports, across all the years from 2003 to 2013:
SELECT Descript, YEAR(Date) as yr, COUNT(1) as x FROM sfpd_incidents WHERE Category = "PROSTITUTION" AND Descript = "SOLICITS FOR ACT OF PROSTITUTION" OR Descript = "SOLICITS TO VISIT HOUSE OF PROSTITUTION" GROUP BY Descript, yr ORDER BY Descript, yr
The following results table can be pasted into a spreadsheet and pivoted (with
yr creating the rows and
Descript creating the columns).
|SOLICITS FOR ACT OF PROSTITUTION||2003||1111|
|SOLICITS FOR ACT OF PROSTITUTION||2004||825|
|SOLICITS FOR ACT OF PROSTITUTION||2005||732|
|SOLICITS FOR ACT OF PROSTITUTION||2006||645|
|SOLICITS FOR ACT OF PROSTITUTION||2007||899|
|SOLICITS FOR ACT OF PROSTITUTION||2008||602|
|SOLICITS FOR ACT OF PROSTITUTION||2009||521|
|SOLICITS FOR ACT OF PROSTITUTION||2010||476|
|SOLICITS FOR ACT OF PROSTITUTION||2011||385|
|SOLICITS FOR ACT OF PROSTITUTION||2012||224|
|SOLICITS FOR ACT OF PROSTITUTION||2013||171|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2003||843|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2004||627|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2005||291|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2006||563|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2007||876|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2008||987|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2009||855|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2010||117|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2011||7|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2012||2|
|SOLICITS TO VISIT HOUSE OF PROSTITUTION||2013||1|
The resulting area chart:
Researching the data
Never take data at face value. This interesting drop in prostitution reports could be a result of a data glitch or ignorance on my part.
Look for previously-reported facts: On April 19, 2011, the San Francisco Examiner published an article titled, San Francisco police crack down on Polk Street prostitution. One of the numerical details:
More cops are on the streets making arrests, Mannix said. During a 24-hour period last week, at least four prostitutes were arrested, including a 16-year-old runaway.
Translating the above statement into a query:
SELECT IncidntNum, Descript, Date, Time, PdDistrict, Resolution, Location FROM sfpd_incidents WHERE Category = 'PROSTITUTION' AND Date BETWEEN '2011-04-11' AND '2011-04-19' AND Location LIKE '%POLK%'
|110296727||PANDERING||2011-04-12||01:38||NORTHERN||ARREST, BOOKED||PINE ST / POLK ST|
|110296727||PIMPING||2011-04-12||01:38||NORTHERN||ARREST, BOOKED||PINE ST / POLK ST|
|110302437||SOLICITS FOR ACT OF PROSTITUTION||2011-04-13||23:00||NORTHERN||ARREST, CITED||CALIFORNIA ST / POLK ST|
|110302540||SOLICITS FOR ACT OF PROSTITUTION||2011-04-14||00:05||NORTHERN||ARREST, CITED||POLK ST / CALIFORNIA ST|
|110302653||SOLICITS FOR ACT OF PROSTITUTION||2011-04-14||01:05||NORTHERN||ARREST, CITED||POLK ST / CALIFORNIA ST|
|110308758||SOLICITS FOR ACT OF PROSTITUTION||2011-04-16||00:35||NORTHERN||ARREST, CITED||POLK ST / CALIFORNIA ST|
|110308952||SOLICITS FOR ACT OF PROSTITUTION||2011-04-16||02:15||NORTHERN||ARREST, CITED||POLK ST / CALIFORNIA ST|
|110309160||SOLICITS FOR ACT OF PROSTITUTION||2011-04-16||03:40||NORTHERN||ARREST, CITED||POLK ST / CALIFORNIA ST|
|110308827||SOLICITS FOR ACT OF PROSTITUTION||2011-04-16||01:29||NORTHERN||ARREST, CITED||POLK ST / CALIFORNIA ST|
There were four prostitution-related arrests on April 16, but frankly, that could just be a coincidence. Capt. Mannix doesn't specify if those four arrests all happened on Polk Street, for example.
Another story of a sweep, published Jan. 20, 2010: Prostitution sting nets 51 so far
A crackdown on street prostitution in response to complaints from Polk Street and Russian Hill residents has netted 51 arrests, but officials say the crackdown can’t last forever…Of the 51 people arrested so far in the current operation, 36 were women accused of prostitution offenses, according to Lt. Tom Cleary of the vice crimes unit. Fifteen were men paying for sex. Cleary emphasized that a majority of those arrested came from outside The City.
Using this aggregate SQL:
SELECT SUBSTRING(Date, 1, 7) as yearmth, COUNT(1) AS count FROM sfpd_incidents WHERE Category = 'PROSTITUTION' AND pddistrict = 'NORTHERN' AND YEAR(Date) BETWEEN 2008 AND 2013 GROUP BY yearmth order by count DESC
– we observe that the month of January 2010 seems to be a high-water mark in prostitution reports from the Northern district (which covers Polk St. and Russian Hill):
Should an 84% drop in prosecution reports from 2008 to 2013 be a surprise, given voter support for decriminalizing prostitution? Whatever the ambivalence of the average resident, I didn't find any direct reports of the SFPD deciding to go lenient on prostitution.
On June 16, 2010, the SF Weekly took an in-depth look at the SFPD's arguably aggressive anti-prostitution stings:
In San Francisco, where 41 percent of voters approved a measure that would have essentially decriminalized prostitution and where a sex worker is currently running for the Board of Supervisors, men are regularly arrested just for talking to undercover cops posing as streetwalkers. If a man avoids eye contact and walks on by, he's fine. But if he shows any interest, the decoy will approach him. By the time he knows she is selling sex, he's headed for trouble. After that, any expression of interest — even a sarcastic "Oops, don't have enough cash, guess I'll come back later" — is enough to land him in handcuffs…
A decade ago, almost all the men arrested in the program's stings had shown more than interest; they had agreed to sex acts for money, according to statistics from the District Attorney's office. But in recent years, as would-be johns have gotten savvy to police tactics, roughly a third of those arrested in the stings — between 100 and 200 a year — were cited simply for demonstrating, in the eyes of the police, that they had "an intent to commit prostitution." According to the California penal code, there's no "single circumstance or combination of circumstances" that proves this intent; it has to be determined case by case.
The article cites a figure from a 2009 audit of the city's budget: previously, the sting program was funded by the fees paid by johns. However, "as prostitution has shifted to the Internet, it has become more difficult for police officers to make arrests in street-level stings. Since 2005, the number of johns arrested has dropped steadily."
In 2011, the San Francisco Public Press reported on the controversial use of anti-trafficking resources to do street-level sweeps.
This past summer, San Francisco prostitution made the news again when business owners complained about activity in Union Square.
This article quotes police as being more interested the stemming the bigger issues surrounding prostitution, rather than focusing on street-level busts:
Police said they plan to clean up Union Square, but also want to focus on human trafficking and helping women who have been forced into prostitution.
So what's the explanation for the 84% drop in SFPD prostitution reports? I sent a request to SFPD's public information officer but haven't heard back. I'm still open to the possibility of a mis-query on my part, or a coding change in the SFPD CABLE system that classifies the reports differently. Either way, I hope this has given you some ideas on how to query data along a variety of dimensions and time frames to find leads for what may be interesting stories.
Now all you have to do is grind in the syntax. Good luck!
I’ve put together a series of step-by-step tutorials on basic SQL syntax. Read through them and repeat every exercise if necessary. Don’t just copy and paste; sometimes, typing out the queries forces you to slow down and think them through.
Using the Social Security Administration’s database of baby names, find out how popular (or unpopular) your name has been since 1980 (or 1880, if you wish).
Download the appropriate file for your system. You may want to try the 1980 file if your computer is having trouble with the full 1880 file.
To get some inspiration, you can check out the SSA baby names website in which they let you perform a limited search on names.
Please do this homework in a Markdown file in your private Github repo and call it,
- In the year 2013, find out how many babies had your name
- Find the highest-rank (and the year) that your name has achieved among baby names
- Between the years 1980 and 2013, find how many babies in total have been listed by the Social Security Administration in this data.
- Find out how many babies (roughly) have had your name from 1980 to 2013
- Find the year that had the most male babies born
- Find the year in which your name had the highest increase in names-per-100k-babies born
- Find the year in which your name had the highest decrease in names-per-100k-babies born
- Make a line chart showing how your name has changed in popularity over the years
- Find out who in our class had the most popular baby name in 1980.
- Find out who in our class had the most popular baby name in 2013.
- Find out who in our class has the name that the most babies throughout U.S. history have.
For this section, pick a adoptive state if you were not born in the U.S.
- In your home state, find out how many babies had your name in the year that you were born.
- In your home state, find out how many babies had your name in 2013
- Find the state in which your name is the most popular in 2013
- Find the state in which your name is least popular (but still registers, i.e. has a minimum of 5 babies) in 2013
Make a line chart with two lines:
- The popularity of your name in your home state
- The popularity of your name in California
(If you were born in California, pick another state, like Florida.)
An introduction to public affairs reporting and the core skills of using data to find and tell important stories.
- Count something interesting
- Make friends with math
- The joy of text
- How to do a data project
Just because it's data doesn't make it right. But even when all the available data is flawed, we can get closer to the truth with mathematical reasoning and the ability to make comparisons, small and wide.
- Fighting bad data with bad data
- Baltimore's declining rape statistics
- FBI crime reporting
- The Uber effect on drunk driving
- Pivot tables
Learn how to take data in your own hands. There are two kinds of databases: the kind someone else has made, and the kind you have to make yourself.
- The importance of spreadsheets
- Counting murders
- Making calls
- A crowdsourced spreadsheet
Phillip Reese of the Sacramento Bee will discuss how he uses data in his investigative reporting projects.
- Phillip Reese speaks
Mapping can be a dramatic way to connect data to where readers are and to what they recognize.
- Why maps work
- Why maps don't work
- Introduction to Fusion Tables and TileMill
A continuation of learning mapping tools, with a focus on borders and shapes
- Working with KML files
- Intensity maps
- Visual joins and intersections
The first in several sessions on learning SQL for the exploration of large datasets.
- MySQL / SQLite
- Select, group, and aggregate
- Where conditionals
- SFPD reports of larceny, narcotics, and prostitution
- Babies, and what we name them
The ability to join different datasets is one of the most direct ways to find stories that have been overlooked.
- Inner joins
- One-to-one relationships
- Our politicians and what they tweet
Sometimes, what's missing is more important than what's there. We will cover more complex join logic to find what's missing from related datasets.
- Left joins
- NULL values
- Which Congressmembers like Ellen Degeneres?
A casual midterm covering the range of data analysis and programming skills acquired so far.
- A midterm on SQL and data
- Data on military surplus distributed to U.S. counties
- U.S. Census QuickFacts
The American democratic process generates loads of interesting data and insights for us to examine, including who is financing political campaigns.
- Polling and pollsters
- Following the campaign finance money
- Competitive U.S. Senate races
With Election Day coming up, we examine the practices of polling as a way to understand various scenarios of statistical bias and error.
- Statistical significance
- Poll reliability
Do your on-the-ground reporting
- No class because of Election Day Coverage
While there are many tools and techniques for building data graphics, there is no magic visualization tool that will make a non-story worth telling.
- Review of the midterm
- The importance of good data in visualizations
- How visualization can augment the Serial podcast
One of the most tedious but important parts of data analysis is just cleaning and organizing the data. Being a good "data janitor" lets you spend more time on the more fun parts of journalism.
- Dirty data
Simon Rogers, data editor at Twitter, talks about his work, how Twitter reflects how communities talk to each other, and the general role of data journalism.
- Ellen, World Cup, and other masses of Twitter data
When the data doesn't directly reveal something obvious, we must consider what its structure and its metadata implies.
- Proxy variables
- Thanks Google for figuring out my commute
- How racist are we, really?
- How web sites measure us
Discussion of final projects before the Thanksgiving break.
Holiday - no class
Holiday - no class
Last-minute help on final projects.
In-class presentations of our final data projects.