In this lesson, we'll learn ways to have more flexible, "fuzzier" filters when querying data. At the very least, knowing these keywords will save you from having to write a tedious number of conditional statements just to get variations of a data value.
For example, instead of this:
SELECT * from sfpd_incidents WHERE Resolution = 'ARREST, BOOKED' OR Resolution = 'ARREST, CITED'
SELECT * from sfpd_incidents WHERE Resolution LIKE 'ARREST%'
Note: For this SQL lesson, I will be using the Sequel Pro GUI for the MySQL database engine and will be querying the SFPD incident reports categorized as
If you want to see exactly the same results I do, you'll want to download and import the__ MySQL database of 2003 to 2013 assault reports from the SFPD__.
I've also created the SQLite version of that database, which should functionally be the same as the MySQL version.
For both MySQL and SQLite, I've also created a database of all the SFPD reports from 2003 to 2013. All the queries should work the same, except you can explore all the different categories of crime. The tradeoff is that the database is much bigger, and so will be slower to download, import, and query. If you're completely new to all this, I would just go with the assaults database, just so any errors you make don't take even longer to figure out.
More comparison operators
So far, we've only looked at when a data field is equal or not equal to a given value, e.g. to find all incident reports in which the location was likely unknown or improperly geocoded:
SELECT * FROM sfpd_incidents WHERE Y = 0
But we can also look for relative comparisons, e.g. greater than, less than, and their or equal to variants:
Find all incidents north (i.e. greater than ) of
SELECT * from sfpd_incidents WHERE Y > 37.78
Find all incidents south (i.e. less than ) of
SELECT * from sfpd_incidents WHERE Y < 37.75
Find all incidents at or south of
SELECT * from sfpd_incidents WHERE Y <= 37.728080750
Relative comparisons of string literals
The greater than/less than operators will also work for string literals. This will effectively select all records that begin with
"AGGRAVATED ASSAULT", including
"AGGRAVATED ASSUALT WITH A KNIFE", but not
"B" comes after any term that starts with
"A", when you're sorting by alphabetical order:
SELECT * from sfpd_incidents WHERE Descript >= 'AGGRAVATED ASSAULT' AND Descript < 'AGGRAVATED ASSAULT ZZZZ'
This is not the ideal way to do this (we'll soon learn about
LIKE and wildcards), but it's worth exploring the concept.
Dates as (hacky) string literals
This is a more of a technical aside: Because MySQL and SQLite have a significant difference in handling date/time operations, for the purposes of this lesson, I will consider them to be just string literals. In other words, we won't be using something like Excel's
YEAR(some_cell) to extract just the year from
However, there is a way around this with string literals. To get all the records in the year
2004, we can simply define a lower bound, i.e.
"2004" and an upper bound
SELECT * from sfpd_incidents WHERE Date >= '2004' AND Date < '2005'
Very important: This is kind of a hack for date and time values, so it'll seem awkward and finicky. The most important rule here is: treat the numbers as string literals.
- This is good:
...WHERE Date >= '2012
- This is bad:
...WHERE Date >= 2012
To find records by year and month, such as April 2008, we can do:
SELECT * FROM sfpd_incidents WHERE Date >= '2008-04' AND Date < '2008-05'
To find records that happened in 2012 from 9PM to before midnight, we'll throw in the time field:
SELECT * FROM sfpd_incidents WHERE Date >= '2012' AND Date < '2013' AND Time >= '21:00' AND Time < '23:59'
To include the wee hours of the morning, say before 5AM, we can continue with this logic, but it'll start to look ugly:
SELECT * FROM sfpd_incidents WHERE Date >= '2012' AND Date < '2013' AND( Time >= '21:00' AND Time < '23:59' OR Time >= '00:00' AND Time < '05:00' )
Treating dates and times as just literal strings is an ugly hack for dealing with dates and times. We lose the ability, for example, to easily extract just the month from a value of
2012-05-12 or the hour from
07:30. This means finding all crimes that happened in December will be very inconvenient using the technique above.
But it will do for now, as this syntax works on both SQLite and MySQL. We can learn more time/date specific syntax later. For this lesson, let's work with what we know for now, which is strings and comparison operators.
If you're not a fan of a forest of less-than and greater-than symbols, both SQLite and MySQL support the
So the following two queries are equivalent:
SELECT * FROM sfpd_incidents WHERE Y BETWEEN 37.75 AND 37.78
SELECT * FROM sfpd_incidents WHERE Y >= 37.75 AND Y <= 37.78
IN operator can simplify a long series of
OR statements. The following two queries are equivalent:
SELECT * FROM sfpd_incidents WHERE Descript IN( 'BATTERY OF A POLICE OFFICER', 'ASSAULT ON A POLICE OFFICER WITH A DEADLY WEAPON', 'THREATS TO SCHOOL TEACHERS') AND Resolution IN( 'NONE', 'UNFOUNDED')
SELECT * FROM sfpd_incidents WHERE ( Descript = 'BATTERY OF A POLICE OFFICER' OR Descript = 'ASSAULT ON A POLICE OFFICER WITH A DEADLY WEAPON' OR Descript = 'THREATS TO SCHOOL TEACHERS') AND (Resolution = 'NONE' OR Resolution = 'UNFOUNDED')
Matching patterns with
IN operator is nice, but we often won't know all of the possible variations for a value. For example, how many kinds of
'AGGRAVATED ASSAULT' are there? The
LIKE operator lets us match by patterns.
LIKE, we can create a conditional statement like, "Find me all records in which the description begins with the word 'AGGRAVATED ASSAULT'". With
LIKE, we can also use the wildcard symbol,
% (percentage sign), which will match any sequence of characters:
SELECT * FROM sfpd_incidents WHERE Descript LIKE 'AGGRAVATED ASSAULT%'
This query will return any row in which
Descript begins with
AGGRAVATED ASSAULT, followed by anything. This will return 26,858 records, with 8 distinct values for
AGGRAVATED ASSAULT OF POLICE OFFICER, SNIPING AGGRAVATED ASSAULT OF POLICE OFFICER,BODILY FORCE AGGRAVATED ASSAULT ON POLICE OFFICER WITH A GUN AGGRAVATED ASSAULT ON POLICE OFFICER WITH A KNIFE AGGRAVATED ASSAULT WITH A DEADLY WEAPON AGGRAVATED ASSAULT WITH A GUN AGGRAVATED ASSAULT WITH A KNIFE AGGRAVATED ASSAULT WITH BODILY FORCE
(Note: You might be wondering, how do you just list the distinct values of a field? We'll find out in a later lesson on the
GROUP BY clause. For now, I'm listing the distinct
Descript values for your convenience)
The above query for
'AGGRAVATED ASSAULT%' will match any record with any variation of words after the string,
'AGGRAVATED ASSAULT'. We can look for variations of words before a given string by placing the wildcard symbol accordingly.
For example, to find all records in which
Descript ends with
SELECT * FROM sfpd_incidents WHERE Descript LIKE '%WITH A GUN'
The different types of
Descript in this results set are:
AGGRAVATED ASSAULT ON POLICE OFFICER WITH A GUN AGGRAVATED ASSAULT WITH A GUN ATTEMPTED HOMICIDE WITH A GUN ATTEMPTED MAYHEM WITH A GUN MAYHEM WITH A GUN
Surrounded by wildcards
Are there incidents that involved a gun, but that weren't described as "something something WITH A GUN"?
Let's be even more "wild" and use two wildcard symbols:
SELECT * FROM sfpd_incidents WHERE Descript LIKE '%GUN%'
As it turns out, apparently all the assault records (2,507 of them) that have the word
"GUN" in them, indeed take the form of,
"something something WITH A GUN".
What about police officer-involved incidents? The term
POLICE occurs in the middle of the
'AGGRAVATED ASSAULT ON POLICE OFFICER WITH A GUN'
Querying for all instances of
SELECT * FROM sfpd_incidents WHERE Descript LIKE '%POLICE%'
– returns 3,171 records, with 7 distinct values for
AGGRAVATED ASSAULT OF POLICE OFFICER, SNIPING AGGRAVATED ASSAULT OF POLICE OFFICER,BODILY FORCE AGGRAVATED ASSAULT ON POLICE OFFICER WITH A GUN AGGRAVATED ASSAULT ON POLICE OFFICER WITH A KNIFE ASSAULT BY POLICE OFFICER ASSAULT ON A POLICE OFFICER WITH A DEADLY WEAPON BATTERY OF A POLICE OFFICER
For our hacky time and date filtering, this would select all records in which the
Date took place in May:
SELECT * from sfpd_incidents WHERE Date LIKE '%-05-%'
This would match values such as:
The underscore wildcard
Sometimes we don't need the kind of catch-all power of the
% wildcard. The underscore character, i.e.
_, can act as a stand-in for a single character. For example, to catch all records in which the preposition
'ON' were used:
SELECT * FROM sfpd_incidents WHERE Descript LIKE 'AGGRAVATED ASSAULT O_'
= has its opposite,
!=, the opposite of
LIKE can be expressed with
SELECT * FROM sfpd_incidents
WHERE Descript NOT LIKE '%BATTERY%'
What's not to
Being able to search for patterns is very powerful. However, as you can imagine, searching for
'BA%' (e.g. everything that starts with the letters
'BA') is going to be more computationally complex than finding an exact match for
It's beyond the scope of this lesson to get into SQL performance tweaking. But the general rule is, don't be more flexible than you need to be, or else what could've been a half-second query will turn into a 2-hour query as your computer dutifully and unquestioningly carries out your insanely wide-ranged search.
Fuzzy matching is a powerful feature when working with messily defined data. But nothing is free, and so this more powerful kind of querying will require more time to successfully execute.
- Find all records that took place between the year 2003 and before April 2005
- Retrieve all incident reports from the
"MISSION"districts that are not described as
SELECT * FROM sfpd_incidents WHERE Date >= '2003' AND Date < '2005-04'
Using the 'BETWEEN' operator:
SELECT * FROM sfpd_incidents WHERE Date BETWEEN '2003-01-01' AND '2005-03-31'
SELECT * FROM sfpd_incidents WHERE PdDistrict IN('TENDERLOIN', 'MISSION', 'CENTRAL') AND Descript NOT LIKE '%AGGRAVATED%'