WHERE clause is where SQL starts to become very interesting for searching large datasets. With
WHERE, we can now filter the results according to conditions we explicitly set, such as, "Show all incident reports where the
Date is before 2009 and the
Category is listed as
The query for that looks like this:
SELECT * FROM sfpd_incidents WHERE Category = 'ASSAULT' AND Date > '2009'
Note: For this SQL lesson, I will be using the Sequel Pro GUI for the MySQL database engine and will be querying the SFPD incident reports categorized as
If you want to see exactly the same results I do, you'll want to download and import the__ MySQL database of 2003 to 2013 assault reports from the SFPD__.
I've also created the SQLite version of that database, which should functionally be the same as the MySQL version.
For both MySQL and SQLite, I've also created a database of all the SFPD reports from 2003 to 2013. All the queries should work the same, except you can explore all the different categories of crime. The tradeoff is that the database is much bigger, and so will be slower to download, import, and query. If you're completely new to all this, I would just go with the assaults database, just so any errors you make don't take even longer to figure out.
WHERE clause occurs in conjunction with
SELECT `column_1`, `column_2` FROM `some table` WHERE `a conditional statement that evaluates to true or false`
The SQL query will operate on all rows in which the
WHERE clause evaluates to true.
The simplest example of
WHERE using our
sfpd_incidents table looks like this:
SELECT * FROM sfpd_incidents WHERE 1 = 1
Try it out. The results should look exactly as they would if you omitted the
WHERE clause. Why is that? Because the statement, "is the number 1 equal to the number 1" is always true. So the
SELECT statement operates across all rows in the
Now give the
WHERE clause a statement that is always false:
SELECT * FROM sfpd_incidents WHERE 1 = 2
The result set should be empty, as the number 1 is never equal to the number 2.
Generally, we're interested in conditional statements that aren't tautologies. For example, in our dataset, we know that the assault reports contain a
Descript field which specifies the subcategories of assaults, such as
"MAYHEM WITH A KNIFE
A common form of conditional statement will look like this:
... WHERE column_name = 'SOME VALUE'
To retrieve only the assault reports that were described as being
SELECT * FROM sfpd_incidents WHERE Descript = 'STALKING'
A quick note about string literals
The value we want to filter for is enclosed in single-quotes. Double quotes will also work but single quotes are preferred. The technical description here is that single quotes denote a string literal.
In programming languages, a string refers to a sequence of text characters, whether it be
'hello world', or
'one plus 2'.
When we refer to
'STALKING', we are telling the query interpreter that we are interested in the string, 'STALKING', literally, as it exists in our dataset. Compare this to the keywords
FROM…we are not interested in those as literal values. Those happen to be keywords integral to the SQL language.
And the column name of
Descript is not a string literal. It is a word that points to the name of a column.
So to summarize: quote marks are important, pay attention to their usage very carefully, as omitting them will likely be the most frequent source of errors at this point. And if you're new to programming, the concept of a string literal will be strange. Here's an analogous example in English:
- Alice told me my name is "Bob" – i.e. My name is "Bob", according to someone named Alice.
- Alice told me, "My name is 'Bob'" – i.e. Alice told me that her name is actually "Bob"
In the second example, quotation marks are used to denote the string literal of "My name is 'Bob'", because I'm quoting a phrase as it was spoken, literally. In the first example, the words "my name is 'Bob'" are an assertion of my identity, rather than what "James" literally told me.
Confused? After writing many more queries (some of them with errors), it'll make more sense.
Read more technical notes [on string literals in SQLite's documentation](https://www.sqlite.org/lang_keywords.html.
A quicker note about numbers
In the given
sfpd_incidents datafiles, the
Y fields (longitude and latitude, respectively), are designated as numbers (floating point numbers, to be kind of exact). Unlike literal strings, literal numerical values do not need quote marks. In fact, setting off literal numerical values in quotes will probably return an erroneous result.
This is good:
SELECT X, Y FROM sfpd_incidents WHERE X = 0
As is this:
SELECT X, Y FROM sfpd_incidents WHERE X = 0.0
But this is not good. It might actually work, but it depends on your database configuration:
SELECT X, Y FROM sfpd_incidents WHERE X = '0'
We'll cover data types, i.e. the difference between strings, numbers, dates, etc., in a different lesson. If you follow the tutorial with the provided dataset, you should be able to continue in blissful ignorance…for now.
To create a filter in which something is not equal to a given value, we prefix the equals sign with an exclamation mark.
To retrieve all the non-
SELECT * FROM sfpd_incidents WHERE Descript != 'STALKING'
|003032800||ASSAULT||BATTERY||2003-04-10||10:16||PARK||ARREST, BOOKED||300 Block of WOODSIDE AV||-122.452194214||37.745666504|
|030206159||ASSAULT||BATTERY OF A POLICE OFFICER||2003-02-19||09:30||SOUTHERN||ARREST, BOOKED||400 Block of NATOMA ST||-122.406684875||37.781009674|
|030204181||ASSAULT||BATTERY||2003-02-18||18:15||CENTRAL||ARREST, BOOKED||300 Block of COLUMBUS AV||-122.407066345||37.798183441|
|030204329||ASSAULT||BATTERY||2003-02-18||19:00||TENDERLOIN||ARREST, BOOKED||300 Block of ELLIS ST||-122.412330627||37.784889221|
|030204329||ASSAULT||THREATS AGAINST LIFE||2003-02-18||19:00||TENDERLOIN||ARREST, BOOKED||300 Block of ELLIS ST||-122.412330627||37.784889221|
|030204711||ASSAULT||INFLICT INJURY ON COHABITEE||2003-02-18||20:40||BAYVIEW||ARREST, BOOKED||1900 Block of JENNINGS ST||-122.387695312||37.728080750|
AND operator, we can specify multiple conditions to filter upon:
SELECT * FROM sfpd_incidents WHERE Descript = 'STALKING' AND Resolution = 'UNFOUNDED'
AND is used to join multiple conditional statements, whereas with
SELECT, we used commas to refer to multiple columns:
SELECT Date, Time, Location FROM sfpd_incidents WHERE Descript = 'STALKING' AND Resolution = 'UNFOUNDED'
The resulting dataset includes the date, time, and location of all assault reports described as
STALKING and were considered to be
|2003-09-12||18:00||700 Block of POST ST|
|2004-01-30||17:00||0 Block of SACRAMENTO ST|
|2006-02-06||08:40||24TH ST / BRYANT ST|
|2006-12-16||20:50||2600.0 Block of MISSION ST|
|2007-01-19||09:00||600.0 Block of MONTGOMERY ST|
|2009-04-09||09:00||24TH ST / MISSION ST|
|2010-01-22||12:00||100.0 Block of WOOD ST|
|2010-04-05||07:45||20TH ST / CAPP ST|
|2010-11-09||19:15||500.0 Block of THE EMBARCADERONORTH ST|
|2011-03-13||08:39||1200.0 Block of STANYAN ST|
|2011-09-06||12:00||200.0 Block of MCALLISTER ST|
|2011-12-01||00:01||300.0 Block of BAY SHORE BL|
|2012-12-20||09:35||800.0 Block of HAMPSHIRE ST|
To find all
STALKING reports which were not
SELECT Date, Time, Location FROM sfpd_incidents WHERE Descript = 'STALKING' AND Resolution != 'UNFOUNDED'
If a query includes a series of
AND statements, then all of them must be true. The following query, for example, would return an empty set:
SELECT * FROM sfpd_incidents WHERE 1 = 1 AND 2 = 2 AND 1 = 2
In the context of our current dataset, this set of
AND statements would also be mutually exclusive:
SELECT * FROM sfpd_incidents WHERE Descript = "THREATS AGAINST LIFE" AND Descript = "STALKING"
This will return no results because each row in
sfpd_incidents has only one type of
Descript value. In other words, no incident report will be described as both
THREATS AGAINST LIFE and
If we do want reports that have either this or that, we use the
SELECT * FROM sfpd_incidents WHERE Descript = "THREATS AGAINST LIFE" OR Descript = "STALKING"
The following query will return all of the dataset rows even though the first
OR statement will always be false and the second one will be true only sometimes. The third
OR statement will always be true, thus, the entire conditional test will be true for every row:
SELECT * FROM sfpd_incidents WHERE 1 = 2 OR Descript = "STALKING" OR 1 = 1
A common mistake is to combine
AND operators in such a way that you get more or less than you actually wanted:
SELECT * from sfpd_incidents WHERE Resolution = "UNFOUNDED" AND Descript = "STALKING" OR Descript = "THREATS AGAINST LIFE"
Presumably, the author of the above query is thinking this:
I want all records that involved an
UNFOUNDEDreport of either
THREATS AGAINST LIFE
However, the query interpreter reads this as:
The user wants all records that are either:
- a report of
THREATS AGAINST LIFE
This is because
AND has precedence over the
This is not dissimilar to order of operations in arithmetic:
10 + 20 * 2 + 10 = 60
(10 + 20) * 2 + 10 = 70
(10 + 20) * (2 + 10) = 360
Thus, the original query looks like this to the interpreter:
SELECT * from sfpd_incidents WHERE (Resolution = "UNFOUNDED" AND Descript = "STALKING") OR Descript = "THREATS AGAINST LIFE"
If you run this on the 2003 to 2013 dataset, you'll end up with 27,311 rows.
When in doubt, don't be afraid to use parentheses to be more explicit about how you want the logic to flow. To fix the original query, one pair of parentheses will do:
SELECT * from sfpd_incidents WHERE Resolution = "UNFOUNDED" AND (Descript = "STALKING" OR Descript = "THREATS AGAINST LIFE")
The result will now only have 185 rows. In other words, there were relatively few cases of reported stalking or death threats in which the police deemed the complaint to be unfounded.
Yes, even I'll muck up the order of the clauses, like:
~~~sql SELECT Descript, Time, Date WHERE Descript = "BATTERY" FROM sfpd_incidents ~~~
Not using quotation marks to denote literal strings:
~~~sql SELECT Descript, Time, Date FROM sfpd_incidents WHERE Descript = BATTERY ~~~
Using quote marks to incorrectly denote things as literal strings:
~~~sql SELECT Descript, Time, Date FROM sfpd_incidents WHERE 'Descript' = 'BATTERY' AND 'Resolution = NONE' ~~~
Along the same lines, remember that literal numerical values should not be set off by quote marks. This is not recommended (though it still might work):
~~~sql SELECT Descript, Time, Date FROM sfpd_incidents WHERE X = '0' ~~~
This is fine: ~~~sql SELECT Descript, Time, Date FROM sfpd_incidents WHERE X = 0 ~~~
(we'll cover numerical and date datatypes in a later tutorial)
Even though SQL syntax is case-insensitive, when referring to a literal string value, this is not always the case with string literals that you're trying to match:
The following would work as intended:
~~~sql select descript, time, date from sfpd_incidents where descript = 'STALKING' ~~~
Descriptvalues are uppercase in the dataset, the following might not work depending on your database configuration:
~~~sql select descript, time, date from sfpd_incidents where descript = 'stalking' ~~~
Creating logically exclusive (or too-inclusive) statements because you've mussed up how your
ORstatements should be evaluated:
This would return no results: ~~~sql SELECT * FROM sfpd_incidents WHERE descript = 'STALKING' AND descript = 'THREATS AGAINST LIFE' ~~~
This would return more results than you likely intended, if you only cared about
SELECT * FROM sfpd_incidents WHERE Resolution = 'UNFOUNDED' AND Descript = 'THREATS AGAINST LIFE' OR Descript = 'BATTERY'
Misplaced parentheses. This is OK:
SELECT * FROM sfpd_incidents WHERE (Resolution = 'UNFOUNDED' OR Resolution = 'NONE') AND Descript = 'BATTERY'
This will result in an error:
SELECT * FROM sfpd_incidents (WHERE Resolution = 'UNFOUNDED' OR Resolution = 'NONE') AND Descript = 'BATTERY'
Now we're starting to see a little of the power of SQL. What we've learned is not much different than the kind of filtering you can do in a spreadsheet. But in the next lesson, we'll learn how to do even more flexible filtering with fuzzy matching operators.
- Produce a list of reports described as
"ATTEMPTED SIMPLE ASSAULT"that were either resolved as
- Modify the previous query to restrict the results to incidents in the
Produce a list of reports with only the columns
Resolution, that either:
- Took place in the
"NORTHERN"police district that were resolved as
"UNFOUNDED"but not described as
- Or took place in either the
"RICHMOND"districts that were described as
"BATTERY"but in which the resolution was
(The result should be 71 rows, in the 2003 to 2013 dataset)
- Took place in the
Simplify the following query:
SELECT Date, Time, Descript, Resolution FROM sfpd_incidents WHERE (Descript = 'AGGRAVATED ASSAULT WITH A GUN' OR Descript = 'THREATS AGAINST LIFE') AND (Descript != 'AGGRAVATED ASSAULT WITH A GUN' AND Resolution = 'NONE')