Orange is my favorite color

Based on my geocoded zip code database research, I wound up purchasing the ZipList Can-USA GeoCode database. Here’s what I did to get their 848,000-record CSV file into a PostgreSQL database table. First, create a table to hold the data:

CREATE TABLE tblLookupZipCode (
vchCity varchar(33) DEFAULT '' NOT NULL
,booDST boolean DEFAULT false NOT NULL -- is daylight savings observed?
,decLatitude real DEFAULT 0.00 NOT NULL
,radLatitude real DEFAULT 0.00 NOT NULL
,decLongitude real DEFAULT 0.00 NOT NULL
,radLongitude real DEFAULT 0.00 NOT NULL
,chrAreaCode char(3) DEFAULT '' NOT NULL -- telephone area code
,chrRegion char(2) DEFAULT '' NOT NULL -- state/province
,chrCountry char(1) DEFAULT 'U' NOT NULL -- U for USA, C for Canada
,sntOffsetGMT smallint DEFAULT 0 NOT NULL -- the number of hours offset from greenwich mean time
,vchPostalCode varchar(7) DEFAULT '' PRIMARY KEY -- use varchar, NOT char!
) WITHOUT OIDS;

I store the latitude and longitude in both the original decimal format as well as the radial format (for “distance from” searches) instead of calculating those values on each search. The data from ZipList looks like this:

"City","ST","ZIP","AC","County","TZ","DST","Country","Lat","Long","Type","FIPS"
"Scotch Lake","NS","B1Y 3X1","902","","EST+1","Y","C","46.1800","-60.3700",,
"Barrachois","NS","B1Y 3X2","902","","EST+1","Y","C","45.7300","-63.2700",,
"Millville","NS","B1Y 3X3","902","","EST+1","Y","C","45.0000","-64.8200",,
"Little Bras d'Or","NS","B1Y 3X4","902","","EST+1","Y","C","46.2300","-60.2800",,
"Hillside Boularderie","NS","B1Y 3X5","902","","EST+1","Y","C","46.2300","-60.3800",,
"Florence","NS","B1Y 3X6","902","","EST+1","Y","C","46.2700","-60.2700",,
"Mill Creek","NS","B1Y 3X7","902","","EST+1","Y","C","45.7679","-64.3645",,
"Point Aconi","NS","B1Y 3X8","902","","EST+1","Y","C","46.3300","-60.3000",,
"Bras d'Or","NS","B1Y 3X9","902","","EST+1","Y","C","46.2500","-60.2800",,
"Hillside Boularderie","NS","B1Y 3Y1","902","","EST+1","Y","C","46.2300","-60.3800",,

To get the ZipList database into a format suitable for importing, a few of the fields need to be modified. ZipList uses a time zone field like “EST” or “PST-3″ and I want just the number of hours offset from GMT so I imported the CSV file (ZCUGOEM.txt) into Microsoft Access and then created the following query:

SELECT field1 AS vchCity
, IIf(field7='Y','1','0') AS booDST
, field9 AS decLatitude
, field9 * 0.017453292519943 AS radLatitude
, field10 AS decLongitude
, field10 * 0.017453292519943 AS radLongitude
, field4 AS chrAreaCode
, field2 AS chrRegion
, field8 AS chrCountry
, IIf(field6='NST',-3,
IIf(field6='EST',-5,
IIf(field6='CST',-6,
IIf(field6='MST',-7,
IIf(field6='PST',-8,
IIf(field6='EST+1',-4,
IIf(field6='GMT+1',1,
IIf(field6='PST-1',-9,
IIf(field6='PST-2',-10,
IIf(field6='PST-3',-11,
IIf(field6='PST-4',12,
IIf(field6='PST-5',11,
IIf(field6='PST-6',10,
IIf(field6='PST-7',9)))))))))))))) AS sntOffsetGMT
, field3 AS vchPostalCode
FROM ZCUGOEM;

I should note that FIPS code wasn’t of interest to me so I have ignored it here. I used IIF (the Access version of a CASE statement) to convert the time zone into a straight number. In my simplified system, you’ll notice I’m cheating by changing the 3.5 hour Nova Scotia offset to 3 hours. Oops! You could use this approach to do whatever modifications you need for your situation. From here, export the results from MS Access to a CSV file (zipcodes.csv) with quotes around all fields.

Now you should have a roughly 62MB CSV file. As a tip, if you use the pipe character “|” as your delimiter and skip the quoting, you can cut that file down to about 45MB. That’s 17MB of just quotes! Now upload your CSV file (~3MB when zipped) to your Postgres server. Log in to your database server using PSQL and run the following:

COPY tblLookupZipCode FROM '/path/to/zipcodes.csv' DELIMITERS ',' CSV QUOTE '"';

Wait a couple of minutes. It should finish without problems. Now you’ll want to vacuum this thing for performance reasons so run the following:

VACUUM VERBOSE ANALYZE tblLookupZipCode;

This will take some time (maybe 10 minutes!) so be patient. I think you could make this faster by eliminating the PRIMARY KEY on the table, importing the data and THEN adding the PRIMARY KEY constraint. I didn’t try this though.

Now you can use the great circle algorithm for doing “distance from” searches in your web app!

Comments are closed.