Orange is my favorite color

Partial screenshot of pgfouine reportHave you had your application in production for awhile? Maybe you have a little traffic now and you’re starting to see some places where performance could be better? The easiest place to find gains that will benefit your entire application is usually in the database. Because the schema tends to be one of the first things we brainstorm as developers, it is often abused through the rest of the development process as requirements evolve. While I work predominantly with PostgreSQL, the tips below will benefit your application no matter what database you use.

Where to start?

The first question should be: where do I start? If you’re like me, you probably have a handful of queries that smell like they need attention, but you might also be surprised by how long apparently simple queries run due to secondary factors. The best way to find out? Turn on logging or performance metrics for your database.

In PostgreSQL 8.3, you can set the following in your postgresql.conf:

log_destination = 'syslog'
logging_collector = on
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'
log_min_duration_statement = 100 # log any query taking more than 100ms or 0 to log all
log_line_prefix = 'user=%u,db=%d '

If you add an entry for “local0.* /var/log/database” to your syslog.conf file and restart Postgres, you’ll see your queries logged along with timing information. These log files are typically usable on their own but running a log file parser against them will give you more bang for the buck. We use PgFouine, which is a terrible name but a great utility that produces reports showing the Top N Slowest queries, the Top N Most Frequently Run queries and the Top N Queries That Used The Most Time (time * number of queries). In an instant, this tells you which queries need your attention and how urgently.

In either case, find where you want to start and try the following…

Does every table have a primary key?

Through the development process, it’s surprisingly easy to accidentally remove or miss an index or primary key. While you probably aren’t missing a primary key on a major lookup table, this query found three auxiliary tables that were missing primary keys three years after my application went into production!

You can use this query on any database system that supports the information_schema standard (PostgreSQL, SQL Server, MySQL…) to find all tables that don’t have an index or primary key:

SELECT c.table_schema, c.table_name, c.table_type
FROM information_schema.tables c
WHERE c.table_schema NOT IN('information_schema', 'pg_catalog') AND c.table_type = 'BASE TABLE'
AND NOT EXISTS(SELECT i.tablename
FROM pg_catalog.pg_indexes i
WHERE i.schemaname = c.table_schema
AND i.tablename = c.table_name AND indexdef LIKE '%UNIQUE%')
AND
NOT EXISTS (SELECT cu.table_name
FROM information_schema.key_column_usage cu
WHERE cu.table_schema = c.table_schema AND
cu.table_name = c.table_name)
ORDER BY c.table_schema, c.table_name;

How big is your database?

I’ll go out on a limb and say 95% of web applications out there have a small dataset. If you can fit your dataset into memory, you’ll see huge performance gains as opposed to going to disk. Memory caching and management is why “real” relational databases are so much faster than file-based databases like MS Access. Usually you can’t measure the size of the database by simply looking at the raw disk usage though; you need to profile it another way. The following query will give you the size of each database in PostgreSQL:

SELECT datname, pg_size_pretty(pg_database_size(datname)) FROM pg_database;

Google is your friend for finding options for MS SQL and MySQL

Views are convenient, but not necessarily fast

In my event registration system, I have two views that bring together personal data with registration data: vueLookupAttendee and and vueLookupAssignment. Under each view are roughly five-way joins and for reporting I would join the two views together. In one case, I even re-joined a view to itself!

Now, if you’re shaking your head and wondering why, I’ll say: it was easy. I started out with two views; then I needed them joined together, and eventually I needed the data from the view joined back and it was convenient to use the views.

But what’s happening under the sheets? A poorly optimized 14-way join that was really easy to code but takes two full seconds to return 150 records. Ouch! I revisited this particular query and explicitly joined the tables together, experimenting with join order to find the optimal performance. In Postgres, using EXPLAIN ANALYZE in front of the query will produce a detailed query plan that makes it easy to see how the query is working (or not, as it may be). In particular, you want to be looking for sequential scans which are slow compared to index scans (see my second tip!):

-> Seq Scan on tblLookupGrid f
(estimated cost=0.0..41.55 rows=1355 width=55)
(actual time=0.058..3.007 rows=1323 loops=1)

Learn your vendor options

Most database vendors have hundreds of options, many of which can be set on a per-query or per-connection basis. In the previous example, my rewrite resulted in a 9-way join when all was said and done. It turns out this was still resulting in a poor query plan because Postgres has a setting called join_collapse_limit that defaults to 8. The 9th join in my query, regardless of indexing, would not join efficiently because it was more than 8. Time to change it with a non-standard SQL vendor option:

# change database permanently...
ALTER DATABASE db SET join_collapse_limit = 10;
# or per query
SET join_collapse_limit = 10;

Changing this setting and rewriting my joins reduced my query from an original 2040ms down to 75ms! This is a query that is run many times per session and users noticed the improvement.

Now I know many people try to abstract their SQL and data layer “just in case” they need to switch database vendors. But come on, really? Let’s be pragmatic here: your database is half of your application. You’re giving up performance and conveniences by sticking to the bare minimum SQL standard. If and when the day comes to switch platforms, it will be such a monumental requirement (because whatever requires you to switch vendors must be a serious ceiling you’re running into elsewhere) that you’ll be able to dedicate the extra couple of hours to refactoring some vendor-specific SQL. Switching vendors means you’ll also be changing query planners as well as inheriting a new set of defaults eliminating any kind of nirvana-like abstraction for your database.

I’m not suggesting you tie your data layer to the database as tightly as possible but you shouldn’t be afraid to sprinkle in some vendor-specific SQL or optimizations if it means your application runs better.

In Closing…

In rewriting my homegrown framework to a Model-Glue, Coldspring and Transfer application, I had the opportunity to revisit my code at every tier. The tips above are a handful of the ones that served me well in squeezing better performance out of my existing hardware and software infrastructure.

Got your own? Share it in the comments! Use the <code> tag to retain your formatting.

2 Comments

  1. Zack Steinkamp said:

    on October 9, 2008 at 5:53 pm

    Good stuff Brian! Definitely the “slow query log” (as us Mysql hacks call it) is usually the place to start.

    Not sure about Postgres, but there are many many memory tuning options in MySql that can make huge performance gains. It can be daunting though, and may be money well spent to hire an expert for a day for some tuning / training.

    For hardcore datasets, optimizing record lengths with regard to disk cluster size will buy additional performance too. Also, avoiding “blob” types will help reduce disk use (again, on MySql — other vendors may vary).

    I’m sure the list could grow to 100s of items ;-)

  2. Brian said:

    on October 10, 2008 at 8:06 am

    @Zack – memory tuning for sure. I think what many people don’t realize is that your DB can put the entire data set into RAM if it’s not so large; RAM is orders of magnitude faster than disk. Guess what happens then? :)

    We’ve looked at optimizing record lengths and stripe sizes, etc, but in benchmarking they generally lead to percentage improvements – solid, but not the orders of magnitude you can often find by just revisiting stuff you did awhile ago. At least, that’s what I find in my coding!

    Good tip on avoiding BLOBs – put your files and stuff in the filesystem. Better all the way around.

{ RSS feed for comments on this post}