Why are you using such an insecure and old browser? Please be aware that this site will not display properly in Internet Explorer 6. You can either upgrade to Internet Explorer 7 or use a proper browser such as Safari, Firefox or Opera.

Recently on Twitter


Poor mans CDN


Posted by Jonathan Conway on 2008-05-09  Comments

UPDATE 2008-08-12
There’s a new patch for Varnish if you’re running on Open Solaris which can be found here

It’s been a long while since my last blog post. Things have been pretty crazy over at vzaar so apologies for lack of content.

I finally had a few minutes of time tonight to write up something that I’ve been meaning to share for a while now.

I’ve been talking with a few startups over the past few years who basically have a lot in terms of potential but little in the way of cash. I’ve seen many a startup fail not because they weren’t popular but because the gap between becoming popular and getting enough investment/paying customers through the door was too large. So lately I’ve been spending most of my waking hours thinking how best to squeeze every last bit of performance and penny from software to hardware. Even if you work for a cash rich company a little bit of prudence can save hundreds of thousands of pound over time and can be the difference between that nice little bonus cheque or a boot up the ass.

Storing terabytes of data soon becomes expensive and a headache to manage. Sure, ZFS eases things in terms of management but you’ve got to ask yourself. Do you have the time and resources to spend? Also as your site becomes ever more popular do you have enough cash keep the bandwidth fires going?

One of the things I decided to talk about was making a poor mans storage + CDN that can scale to quite large demands, while keeping costs as low as possible. Now before I go further on the CDN front, if you go to a CDN company and have enough cash/bargaining power then it will be cheaper and more performant for you to go with a proper content delivery network like Limelight. I make no false claims.

However, in the first stages for boot strapping a lot of companies simply can’t take the risk on bargaining a good rate with a CDN if they simply don’t know how popular the site will be. You maybe the most unpopular site on the web. On the other had you could be the next Facebook and there’s no way of knowing in the beginning.

The following solution I present merely simplifies and reduces the cost of content storage and delivery via everyones favourite Amazon web service that is S3. It isn’t a true CDN as you’ll probably realise, but if you’re reading thus far then you probably don’t care and this alternative is good enough.

My solution first started out as a combination of just Nginx, Merb and memcached with some rather complicated scripts in between sweeping the cached content from the file system.

After a while I came to realise that my solution was just way too complicated and that I was yak shaving. It was a few weeks later that I happened to read how Wordpress basically do the same thing but instead use Varnish to handle the caching of content.

They have a setup where Nginx proxies to Varnish. If the requested object is found in the cache then it’s served otherwise it’s forwarded by Varnish onto a PHP script that retrieves the requested object via S3 and serves it. They also have a ‘hawtness’ algorithm that decides whether to put the object into the cache but my solution doesn’t cover that. But it wouldn’t be hard at all to add it to the Merb app.

Our solution is very similar to Wordpresses setup except that we use a small Merb app to do the serving of assets from S3.

Why Merb and not Rails? I needed something with high concurrency and plus Merb can do some pretty funky stuff with regards to rendering which you’ll see in a bit.

First thing we need to do is setup Nginx. I’ll be setting this up on an Ubuntu server so change these instructions as you see fit. We’ll also be installing the fair proxy load balancer Nginx module (although not using it for now).

Download, unpack and build Nginx and the load balancer module

cd /usr/local/src
curl -O http://sysoev.ru/nginx/nginx-0.6.30.tar.gz
git clone git://github.com/gnosek/nginx-upstream-fair.git

tar xvzf nginx-0.6.30.tar.gz
cd nginx-0.6.30
./configure --prefix=/opt/nginx  --with-openssl=/usr/lib \
--with-sha1=/usr/lib --with-http_flv_module \
--with-http_ssl_module --with-http_gzip_static_module \
--add-module=/usr/local/src/nginx-upstream-fair

make
sudo make install
Next up is Varnish. Again we’ll be building from source.

curl -O http://heanet.dl.sourceforge.net/sourceforge/varnish/varnish-1.1.2.tar.gz
tar xvzf varnish-1.1.2.tar.gz
cd varnish-1.1.2
./configure
make
sudo make install

If you get any errors while building make sure you install any related lib’s. If you don’t know the lib name then doing a ‘aptitude search NAME_OF_LIB’ should bring up the package name.

Now to configure Varnish. Here’s a quick VCL file I knocked up. It’s not perfect and is designed to be tweaked, but is good enough to get started with. I’ve annotated the file to give you a quick heads up.


//We'll be using 2 merb instances b1 and b2
backend b1 {
        set backend.host = "127.0.0.1";
        set backend.port = "4000";
}

backend b2 {
        set backend.host = "127.0.0.1";
        set backend.port = "4001";
}

//If you want to be able to expire stuff from the cache
//from life cycle events in your Merb/Rails/Ramaze/Mack ap
//then you'll need to set which IP's you can do this from
acl purge {
        "localhost";
        "127.0.0.1";
}

sub vcl_recv {
        //We'll be serving 'small files from the first merb instance'
    if (req.request == "GET" && req.url ~ "\.(gif|jpg|swf|css|js)$") {
        set req.backend = b1;
        lookup;
     }else{
               //And everything else such as movies etc from another 
               //instance
        set req.backend = b2;
        lookup;
    }
     if (req.request == "PURGE") {
                if (!client.ip ~ purge) {
                        error 405 "Not allowed.";
                }
                lookup;
     }
}
 //Manual cache expiry stuff
sub vcl_hit {
        if (req.request == "PURGE") {
                set obj.ttl = 0s;
                error 200 "Purged.";
        }
}

sub vcl_miss {
        if (req.request == "PURGE") {
                error 404 "Not in cache.";
        }
        //Just something to stop us serving terabytes of data to 
        //search engines and making is poor:(
        if (req.http.user-agent ~ "spider") { 
            error 503 "Not presently in cache"; 
        }
}

Save the above file in /usr/local/etc/default.net.vcl Now before we configure Nginx, lets download and take a look at the Merb app that does all the retrieval and streaming from S3. I’ve put it on Github and it can be found at http://github.com/jaikoo/pmcdn/tree/master.

git clone git://github.com/jaikoo/pmcdn.git

You’ll need to edit the s3_conf.yml file in the config directory and change the BUCKET_NAME in the assets controller before we can get going. After that’s done cd into the merb apps dir and start up two merb instances.

cd pmcdn
merb -d -c 2
Hopefully you shouldn’t have any errors and so it’s on to configure Nginx.

#/etc/nginx/nginx.conf
#I previously created a user for nginx www-data
user www-data;
worker_processes  1;

error_log  /var/log/nginx/error.log;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    access_log  /var/log/nginx/access.log;

    sendfile        on;
    #tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  65;
    tcp_nodelay        on;

    gzip  on;

    #this should really go in include /etc/nginx/sites-enabled/*;
    upstream cache1 {
      server 127.0.0.1:8080;
    }

    server {
     #Change this to your asset server domain
       server_name asset1.jiggahaeyo.com;
        listen 80;
        proxy_max_temp_file_size 0;
        proxy_next_upstream off;
        proxy_read_timeout 60;
        proxy_intercept_errors on;

        error_page 404 @404;
        location @404 {
          rewrite .* /404.html last;
        }

      location / {
        proxy_pass http://cache1;
    }
 }

}

Now we need to start up Varnish on a different port first. It’s default port is port 80 and if your asset server was on a seperate machine from your main app then you could get away with not using Nginx and just use Varnish as is, but for our example we’ll start Varnish up on port 8080.


/usr/local/sbin/varnishd -a 0.0.0.0:8080 -f /usr/local/etc/default.net.vcl
Then it’s Nginx’s turn

/etc/init.d/nginx start

Next make sure you’ve already uploaded an asset to the bucket that you declared in the assets controller. To test this out if you have an image in the namespace images called test1.jpg then the url would look something like http://asset1.jiggahaeyo.com/assets/images/test1.jpg. If you’re tailing the Merb log you’ll see the first time you do this request the Merb app will query S3 and stream the file upstream. Hit the url again and you’ll again be served with the file, but this time the Merb app won’t be hit at all as it’s coming from Varnish.

The Merb app automatically grabs everything that matches /assets* and forwards it to the ‘show’ action in the assets controller. We then remove the ’/assets’ bit from the request path and pass that on as the path to the object in S3 which is retrieved and streamed back without loading the whole file into memory.

OK, this is all cool and stuff but I know you’re thinking that although we’re saving in S3 content serving costs because we’re now caching we are just pushing the cost of bandwidth onto our hosting provider. Bandwidth’s not cheap! Or is it? You’ll find that there are numerous hosts out there who give some fantastic deals when it comes to bandwidth. Joyent for example do a pretty good deal on their accelerators in which they give you 10TB of bandwidth. True the file storage available on those accelerators isn’t much but who cares when we’re using S3!

Also if say our app is partitioned we can have an asset cache on each of our partitioned nodes utilizing any spare bandwidth rather than eating all the bandwidth from just one asset server.

The knowledgeable of you out there are probably now thinking that this is all cool, but what about geographical caching? You get that with a true CDN. Surely the poor mans CDN should do this too? Well indeed it can! However, I’ll leave it to John Buswell’s article in o3 magazine on global load balancing which can be found here as he explains it far better than I could.