July 19, 2009

What you need to know to store web assets on Amazon S3

Research/HOWTO - Web/Internet | Trackback | del.icio.us | digg

Are you sick of hearing about the cloud yet? If not, then you’ll probably be eager to read this article. If you are, then you should read this article for an easy way to take advantage of cloud infrastructures like Amazon’s S3 in order to speed up your web site or application.

This article is going to focus on storing static web assets – your media, javascript and CSS – on Amazon’s S3. The why’s and the how’s and what you need to know to switch from using a standard webserver like Apache or IIS to hosting the files with Amazon. If you don’t yet have YSlow installed on Firefox, I would recommend getting it now since it will offer you a wealth of insight into how your site is actually performing across a dozen or so categories.

Why would you want to use S3?

In one word, performance. Web browsers allow a fixed number of simultaneous connections to each host as a way of balancing per-connection speed and overall load time of a page. Many web browsers are limited to just two simultaneous connections. Think about your home page – does it have 20 or 30 images? Even if they are only 500 bytes each, you still have a DNS lookup, a TCP connection, handshake, transfer, and load for each asset. Much of this process can be cached but the per-host limit remains and YSlow can visually show you how parallelized the assets for your website are downloaded. There are two ways to get around this:

Combine multiple assets into one file to reduce download count
- Combine multiple images into a sprite file
- Concatenate CSS into a single file, even print and screen styles in a single file!
- Combine commonly-used Javascript files into fewer bundles
Spread the assets of your site out across more hosts

There’s two reasons why I like S3 – one is that it has an incredible number of tools and users. The second is you have two options from a “poor mans” accelerator by putting your assets on S3 to a full-blown CDN (with literally the click of a mouse) on Amazon’s CloudFront service which uses S3 as its source. CloudFront, like CDN pioneer Akamai, put servers all around the world close to users and serve your files from the machine closest to the user. This results in fewer network hops, less latency and faster speeds improving the user experience. Other players include CacheFly, Limelight, SimpleCDN and others.

Another great use of S3 is for hosting user-uploaded files. Especially when you have a cluster of servers, you may need to make the files available across all of the nodes which involves some form of synchronization. S3 can fill in that role all while providing dirt-cheap storage ($0.15/GB per month plus $0.10/GB for transfer). Plus, no concerns about backing up those files – S3 automatically stores several copies of your file across their network for redundancy.

Lastly, S3 supports HTTP headers like Expires and Cache-Control which can tell web browsers to keep your static assets in the local cache. Subsequent page views will have a “Primed Cache” experience that radically reduces the data required to load the page, often to as little as just the raw HTML.

Did I mention it’s cheap?

What are the Gotchas?

S3 is not a web server. If you’re used to something like Apache and mod_gzip, well, you’re in for a few changes:

Amazon can serve gzipped content (which shrinks the size of the file sent over the wire, something most browsers support) but it doesn’t automatically negotiate it. E.g., you need to compress it yourself and conditionally serve it as appropriate.
Most people recommend setting up a FQDN like cdn.motorsportreg.com so you can point a DNS CNAME record at s3.amazonaws.com. When configured, you can access files at any of the following URLs:
- http://bucketname.s3.amazonaws.com/file.jpg
- http://s3.amazonaws.com/bucketname/file.jpg
- http://vanity.yourdomain.com/file.jpg – where “vanity” is the name of your bucket and vanity.yourdomain.com is a CNAME pointing to s3.amazonaws.com (something you do in your DNS setup)
Amazon will let you set HTTP headers for caching and mime type but you must do it on a per-file basis; there is no mime.types file that automagically determines the right value for you. This can be automated but must be accounted for.
SSL is supported but Amazon has a wildcard SSL certificate for *.s3.amazonaws.com. That means https://bucketname.s3.amazonaws.com works without browser certificate mismatch issues but using a vanity CNAME like https://vanity.yourdomain.com will throw a fit when accessed with SSL because the certificate is for *.s3.amazonaws.com and not vanity.yourdomain.com. If you serve CSS from S3 over SSL using the vanity approach, your site would appear completely unstyled in Webkit browsers like Chrome and Safari. Bummer! As of today, Amazon will not let you use your own SSL certificate. If SSL is a requirement for you, you’ll need to skip the vanity approach. If you do go this route, don’t be tempted to use the vanity URL on http and non-vanity on HTTPS. You want to always have the same reference or else the user will be downloading those files a second time!

Migration Guide

Despite these limitations, we’re using S3 with great success by automating our deployments to account for the gotchas and we’re now serving only CFML requests from our two-node cluster. Last month, S3 storage and transfer for our medium-sized web application cost us about $1.22.

Open an Amazon Web Services account and add S3
Get a free S3 client like S3Fox or Cloudberry Explorer. The latter will let you set HTTP headers in a GUI application before you get to the point where you’re doing it programatically and you’ll need that capability in order to set expiry headers, mime types and so forth. S3fox is so easy to pop into Firefox that it doesn’t hurt to have it around as well.
Create two buckets, one for compressed (gzipped) content and one for uncompressed. We named ours cdn-sitename and cdnz-sitename, where cdnz represents the gzipped bucket. IMO, it’s preferable to have two buckets with identical files in each rather than have different names for the compressed file. This makes switching between the two much simpler and you can always rely on a single path/filename regardless of how it will eventually be served. KISS!
If you need to serve assets over SSL, then you should NOT use the CNAME vanity approach to avoid the SSL mismatch. Instead, just use the bucket names and access the files as https://bucketname.s3.amazonaws.com/…
Big sites like YouTube and Yahoo take browser download parallelism to the max by having more than one hostname for assets. Rather than just “static.domain.com”, they have static1 and static2 and maybe more. Figure out a strategy for which assets are put on which host (you don’t want it to be random or the user will wind up downloading an equivalent asset more than once). This will let each page split up the downloads as much as possible.
Upload your files using Cloudberry (be aware that by default, Cloudberry uploads any file greater than 10mb in “chunks” and then masks those chunks in the UI. In reality, on S3, you wind up with multiple 10-mb chunks instead of a single file. Be sure to disable this under preferences. When you’re uploading, set the Expires header to be some date in the far future. The maximum allowed by the RFC is 1 year in the future. You should also set the mime types (image/gif for GIFs, text/css for CSS, etc) or if you access the files directly, the browser won’t know what to do with it (but it will work when included in a web page in my experience)
Modify your site to use a prefix on your static assets like:
```
<img src="#request.prefix#/images/foo.gif" />
```
Using the prefix in this way is helpful during development on your local machine where you can set prefix to an empty string and work completely offline or use local files independent of S3. In production, you populate the prefix and can turn on/off your references to S3 with a single configuration. In fact, if S3 were to have an outage, we could switch to our local web servers with a simple config file change.
Use YUI-compressor or another tool to minify your JS and CSS on the fly and even combine files. This allows retaining your heavily commented and nicely formatted JS and CSS during development without worrying about bloating the files for deployment. No more compromise!
Compress text files like CSS, Javascript, XML and HTML before you upload them using gzip. Keep the filenames the same but be sure to only upload the compressed files into the compressed bucket and set HTTP headers Content-encoding to “gzip” and Vary to “Accept-Encoding”. This will make sure that proxies and browsers do the right thing. It also gives you some appreciation for all of that kung-fu that Apache or IIS do for you out of the box.
You can speed up your site even further by locally including third party files you would otherwise reference remotely. For example, if you use Google Analytics, you can download ga.js and append it to your local builds. Just remember you need to update them periodically. I’ll share my Ant script in the near future which does this automatically during deployment.

In your application, you need code like the following to determine HTTP vs. HTTPS and uncompressed vs. compressed support:

<cfif cgi.server_port_secure>
	<cfif findNoCase("gzip", cgi.HTTP_ACCEPT_ENCODING)>
		<cfset request.prefix = "https://cdnz-sitename.s3.amazonaws.com" />
	<cfelse>
		<cfset request.prefix = "https://cdn-sitename.s3.amazonaws.com" />
	</cfif>
<cfelse>
	<cfif findNoCase("gzip", cgi.HTTP_ACCEPT_ENCODING)>
		<cfset request.prefix = "http://cdnz-sitename.s3.amazonaws.com" />
	<cfelse>
		<cfset request.prefix = "http://cdn-sitename.s3.amazonaws.com" />
	</cfif>
</cfif>

Basically we’re building a source prefix that switches between HTTP and HTTPS and the two buckets you created for holding compressed and uncompressed content. So long as every static asset on your site has #request.prefix# at the beginning of the src, embed or link, you’ll be good to go.

Monitor your access logs for any pages that might still be referencing static assets like HTML email templates or scheduled tasks. Within a week or two, your web access logs should be free of static asset requests.

Next: Automation

It took me several days to get this right the first time. And now I need to do it again every time we update our static assets? How about automating this whole thing? Lucky for you, I’m going to give you the fruits of a full week of labor on my part – my Ant script! It does everything from check out my static assets from subversion to automatically pulling in remote assets like Google Analytics ga.js to compressing and uploading the two versions of each file to my buckets on S3. Look for this in the next week or so!

13 Comments

Aaron Longnion said:

on July 19, 2009 at 10:50 pm

Thanks for the how-to. Very helpful. I look forward to the ANT script, too.
Andy said:

on July 19, 2009 at 11:50 pm

Thanks so much for mentioning CloudBerry Explorer! I just want to mention that CloudBerry Explorer is absolutely free. There is no such thing as a trial version.
Brian said:

on July 20, 2009 at 7:55 am

@Andy – thanks for the clarification; I thought I had a trial beta or something but I’ve updated the post to reflect that Cloudberry is free. It’s a great app (minus the transparent chunking enabled by default! I uploaded a file about 10 times and couldn’t hit it in a web browser before I figured it out by browsing with S3fox)
Mario Rodrigues said:

on July 22, 2009 at 9:22 am

This is great stuff. I’m using S3 right now for loading content, but I haven’t yet figured out how to gzip the files in CF before uploading to S3.

Looking forward to your next post.
S3 Browser said:

on July 25, 2009 at 11:41 am

You can also use S3 Browser – Windows Client for Amazon S3.
Mike said:

on July 31, 2009 at 2:17 pm

“Another great use of S3 is for hosting user-uploaded files.” That’s what I would like to do, but how do I provide a way for users to upload their files (videos in my case)? Since videos are big, I’d like to provide a progress bar too.

Also, based on what you’ve written in your article, it sounds like you’re a pro at optimizing sites for performance. Do you offer your skills for hire?
S3 Browser said:

on July 31, 2009 at 10:10 pm

YUI Uploader is a great way to upload files with progress bar.
*flash and Javascript are required.
Brian said:

on August 1, 2009 at 7:23 am

@Mike – I wouldn’t claim to be the world’s foremost expert but I know enough to make things snappy. Last time I checked, MotorsportReg.com scored a 98 in Yslow.

Drop me a line at brian at vfive.com and we’ll talk.
Deploying assets to Amazon S3 with Ant » ghidinelli.com said:

on September 2, 2009 at 9:19 am

[...] my previous post on storing web assets on Amazon S3, I promised to share the Ant script I developed from a week of work and testing to jumpstart your [...]
Mohan said:

on December 15, 2009 at 12:24 pm

Do we need to download the SSL client certificate to store and retrieve object from AMAZON S3 via https://??

I do not think so. But would like to get some expert opinion.
Andy said:

on December 15, 2009 at 12:33 pm

You don’t need an SSL certificate to retrieve data from Amazon using https:// . Most of the file managers also support SSL, there is usually a checkbox to turn it on.
Mohan said:

on December 15, 2009 at 12:45 pm

From an application standpoin – For eg; a Java app utilizing S3 to store object – does the certificates should be placed in application servers where the app is deployed?
Andy said:

on December 15, 2009 at 12:54 pm

I am not sure Amazon S3 supports client side certificates, but I can ensure you that you don’t need that. You don’t have to deploy anything extra on your application server to work with S3 using SSL.

{ RSS feed for comments on this post}

Orange is my favorite color

What you need to know to store web assets on Amazon S3

Research/HOWTO - Web/Internet | Trackback | del.icio.us | digg

Why would you want to use S3?

What are the Gotchas?

Migration Guide

Next: Automation

13 Comments

Aaron Longnion said:

Andy said:

Brian said:

Mario Rodrigues said:

S3 Browser said:

Mike said:

S3 Browser said:

Brian said:

Deploying assets to Amazon S3 with Ant » ghidinelli.com said:

Mohan said:

Andy said:

Mohan said:

Andy said:

Recent

Archives

Topics

Subscribe to Posts

Subscribe via Email