Orange is my favorite color

I normally stick to pretty vanilla SQL when I’m writing my database queries. I deviate only when necessary so six months down the road when I revisit a query I haven’t completely fooled myself. There’s a lot of SQL worth using on a regular basis though.

Let’s say you have two tables:

CREATE TABLE customers (
  id integer PRIMARY KEY
 ,name varchar(50)
);
CREATE TABLE options (
  customer_id integer REFERENCES customers (id)
 ,option_name varchar(50)
);

INSERT INTO customers VALUES (1, 'Brian');
INSERT INTO customers VALUES (2, 'Jennifer');
INSERT INTO customers VALUES (3, 'Jason');

INSERT INTO options VALUES (1, 'Superuser');
INSERT INTO options VALUES (1, 'Editor');
INSERT INTO options VALUES (2, 'Editor');
INSERT INTO options VALUES (3, 'Admin');
INSERT INTO options VALUES (3, 'Subscriber');

We’ve got customers with 1-n options, nice and normalized. It’s pretty common to want to report on one of these options to get a list of, say, all of the editors in the system. You could write the following query:

SELECT id
  ,name
  ,(SELECT COUNT(*)
    FROM options
    WHERE customer_id = customers.id
    AND option_name = 'Editor') AS isEditor
FROM customers

This would give you the results you want with either a 0 or a non-zero number in the computed field isEditor. In ColdFusion, you could treat this as a boolean to test if the person is an editor and display your report accordingly:

id name isEditor
1 Brian 1
2 Jennifer 1
3 Jason 0

It turns out there is a problem with this approach. Because your subquery is performing a COUNT(*) on the options table, it needs to scan the entire table (which may or may not use an index) in order to report how many it found. Pretend the options table had a few million rows in it and you can get an idea for where this query might go sideways (ignore that there may be many other valid and better ways to model this data – it is a trivialized example after all).

I wrote about using EXISTS before in a WHERE clause to dramatically speed up certain kinds of queries with large numbers of values to filter against.

Using EXISTS

Well you can also use the EXISTS to generate a column in your query too, using syntax like:

SELECT id
  ,name
  ,EXISTS (SELECT 1
   FROM options
   WHERE customer_id = customers.id
   AND option_name = 'Editor') AS isEditor
FROM customers

Of course, we get the same data:

id name isEditor
1 Brian 1
2 Jennifer 1
3 Jason 0

The beauty of this approach is that the EXISTS subquery terminates as soon as it finds a single row. Instead of scanning the entire table for a result, the first hit here will return a boolean true and it will move on to matching the next record in the query.

Results

In this simple example, the performance difference between the two is stark. The query planner for my database, PostgreSQL, shows the COUNT(*) approach adds an aggregate calculation:

Aggregate  (cost=17.95..17.97 rows=1 width=0) (actual time=0.018..0.020 rows=1 loops=3)

By virtue of skipping this step, the EXISTS approach executes about 33% faster. Now this isn’t a silver bullet. In fact, many real-world situations may see the reverse behavior in terms of which approach is faster. Database query planners take many variables into consideration when executing a query that can trump even the most clever optimization but nonetheless it’s a handy tool for your toolbox.

The important thing is to have a few options at your disposal and understand how to use EXPLAIN to view and understand the query.

8 Comments

  1. Andrew said:

    on January 31, 2009 at 10:47 am

    I realize this is a simplified example, but my main issue with these 2 sql selects is that they are not equivalent. At the postgres level EXISTS returns a boolean, where as count(*) returns a bigint.

    As an example, since options doesn’t have a unique contraint on customer_id and option_name it’s possible that a customer_id has multiple Editor option_name entries, and the result of the 2 sql will be different. Now if the first select was ‘… select count(*) > 0 from options …’ then replacing it with exists would be equivalent.

    However, I whole heartily agree with using EXPLAIN.

  2. brian said:

    on January 31, 2009 at 10:51 am

    @Andrew – I knew someone was going to point it out! Technically you are correct in the integer vs boolean and your fix would correct it. Consuming those two queries above with ColdFusion however would be equivalent. Both approaches would allow you to do say:


    <cfif isEditor>...</cfif>

    Thanks though, you are definitely right. I thought about fixing the first query but it was late and I wanted to go home. :)

  3. Zack Steinkamp said:

    on January 31, 2009 at 5:25 pm

    Not sure if Postgres has a “LEFT JOIN”, but in MySql, this can be done without a subquery…

    SELECT name, IF(option_name = "Editor", 1, 0) FROM customers LEFT JOIN options ON (customers.id = options.customer_id) AND options.option_name='Editor';

    The nature of the LEFT JOIN is that it includes all rows from the “left” table, and uses NULLs for values from the right table if there is not a row to join.

  4. brian said:

    on February 1, 2009 at 9:24 am

    @Zack – The concept is intended to be database agnostic although Postgres does support all JOIN syntaxes (INNER, OUTER, CROSS, etc). Is the IF() a MySQL extension for processing outer joins? There are many ways to skin the cat of coalescing values but the power of the EXISTS approach is that it can be significantly more narrow than a typical JOIN would afford. My query should also be portable across most database platforms, although I believe database portability to be a red herring concern for most web applications.

    I find the more correctly normalized my schema, the more hoops I need to jump through when it comes time to reporting. Some people accomplish it with reporting packages but I typically roll up some queries as needed. I have yet to find a really good reporting suite with an acceptable end-user interface for non-technical users. Actuate has some tools on top of BIRT that come close but I’m still waiting for “the one” I can drop into my app.

  5. Zack Steinkamp said:

    on February 1, 2009 at 9:47 am

    Brian — IF() is a built-in function to MySql that you can use in your SELECT fields to mutate a value.

    Without the IF(), the second column value would either be “Editor” or NULL. If I was building the app, I’d leave it at that and test for NULL when rendering the output.

    I’m curious to see a performance comparison to the EXISTS(-subquery-) method vs. the LEFT JOIN method…

  6. brian said:

    on February 1, 2009 at 10:38 am

    Sounds like the equivalent of Postgres’ COALESCE(). I’ll try running some additional analysis when I get some time for the outer join as well as some other scenarios to better flesh out the trade offs in approaches.

  7. Ben Nadel said:

    on February 2, 2009 at 10:17 am

    IF(), COALESCE()… you can always use a CASE statements in the SELECT:

    (CASE WHEN EXISTS( … ) THEN 1 ELSE 0 END) AS IsEditor

  8. John Whish said:

    on February 24, 2009 at 7:05 am

    Just a thought, but wouldn’t you be better off using a derived table? As far as I know the subquery will execute for each row, whereas a derived table runs once.

{ RSS feed for comments on this post}