Poor mans CDN


It’s been a long while since my last blog post. Things have been pretty crazy over at vzaar so apologies for lack of content.

I finally had a few minutes of time tonight to write up something that I’ve been meaning to share for a while now.

I’ve been talking with a few startups over the past few years who basically have a lot in terms of potential but little in the way of cash. I’ve seen many a startup fail not because they weren’t popular but because the gap between becoming popular and getting enough investment/paying customers through the door was too large. So lately I’ve been spending most of my waking hours thinking how best to squeeze every last bit of performance and penny from software to hardware. Even if you work for a cash rich company a little bit of prudence can save hundreds of thousands of pound over time and can be the difference between that nice little bonus cheque or a boot up the ass.

Storing terabytes of data soon becomes expensive and a headache to manage. Sure, ZFS eases things in terms of management but you’ve got to ask yourself. Do you have the time and resources to spend? Also as your site becomes ever more popular do you have enough cash keep the bandwidth fires going?

One of the things I decided to talk about was making a poor mans storage + CDN that can scale to quite large demands, while keeping costs as low as possible. Now before I go further on the CDN front, if you go to a CDN company and have enough cash/bargaining power then it will be cheaper and more performant for you to go with a proper content delivery network like Limelight. I make no false claims.

However, in the first stages for boot strapping a lot of companies simply can’t take the risk on bargaining a good rate with a CDN if they simply don’t know how popular the site will be. You maybe the most unpopular site on the web. On the other had you could be the next Facebook and there’s no way of knowing in the beginning.

The following solution I present merely simplifies and reduces the cost of content storage and delivery via everyones favourite Amazon web service that is S3. It isn’t a true CDN as you’ll probably realise, but if you’re reading thus far then you probably don’t care and this alternative is good enough.

My solution first started out as a combination of just Nginx, Merb and memcached with some rather complicated scripts in between sweeping the cached content from the file system.

After a while I came to realise that my solution was just way too complicated and that I was yak shaving. It was a few weeks later that I happened to read how Wordpress basically do the same thing but instead use Varnish to handle the caching of content.

They have a setup where Nginx proxies to Varnish. If the requested object is found in the cache then it’s served otherwise it’s forwarded by Varnish onto a PHP script that retrieves the requested object via S3 and serves it. They also have a ‘hawtness’ algorithm that decides whether to put the object into the cache but my solution doesn’t cover that. But it wouldn’t be hard at all to add it to the Merb app.

Our solution is very similar to Wordpresses setup except that we use a small Merb app to do the serving of assets from S3.

Why Merb and not Rails? I needed something with high concurrency and plus Merb can do some pretty funky stuff with regards to rendering which you’ll see in a bit.

First thing we need to do is setup Nginx. I’ll be setting this up on an Ubuntu server so change these instructions as you see fit. We’ll also be installing the fair proxy load balancer Nginx module (although not using it for now).

Download, unpack and build Nginx and the load balancer module

cd /usr/local/src
curl -O http://sysoev.ru/nginx/nginx-0.6.30.tar.gz
git clone git://github.com/gnosek/nginx-upstream-fair.git

tar xvzf nginx-0.6.30.tar.gz
cd nginx-0.6.30
./configure --prefix=/opt/nginx  --with-openssl=/usr/lib \
--with-sha1=/usr/lib --with-http_flv_module \
--with-http_ssl_module --with-http_gzip_static_module \
--add-module=/usr/local/src/nginx-upstream-fair

make
sudo make install
Next up is Varnish. Again we’ll be building from source.

curl -O http://heanet.dl.sourceforge.net/sourceforge/varnish/varnish-1.1.2.tar.gz
tar xvzf varnish-1.1.2.tar.gz
cd varnish-1.1.2
./configure
make
sudo make install

If you get any errors while building make sure you install any related lib’s. If you don’t know the lib name then doing a ‘aptitude search NAME_OF_LIB’ should bring up the package name.

Now to configure Varnish. Here’s a quick VCL file I knocked up. It’s not perfect and is designed to be tweaked, but is good enough to get started with. I’ve annotated the file to give you a quick heads up.


//We'll be using 2 merb instances b1 and b2
backend b1 {
        set backend.host = "127.0.0.1";
        set backend.port = "4000";
}

backend b2 {
        set backend.host = "127.0.0.1";
        set backend.port = "4001";
}

//If you want to be able to expire stuff from the cache
//from life cycle events in your Merb/Rails/Ramaze/Mack ap
//then you'll need to set which IP's you can do this from
acl purge {
        "localhost";
        "127.0.0.1";
}

sub vcl_recv {
        //We'll be serving 'small files from the first merb instance'
    if (req.request == "GET" && req.url ~ "\.(gif|jpg|swf|css|js)$") {
        set req.backend = b1;
        lookup;
     }else{
               //And everything else such as movies etc from another 
               //instance
        set req.backend = b2;
        lookup;
    }
     if (req.request == "PURGE") {
                if (!client.ip ~ purge) {
                        error 405 "Not allowed.";
                }
                lookup;
     }
}
 //Manual cache expiry stuff
sub vcl_hit {
        if (req.request == "PURGE") {
                set obj.ttl = 0s;
                error 200 "Purged.";
        }
}

sub vcl_miss {
        if (req.request == "PURGE") {
                error 404 "Not in cache.";
        }
        //Just something to stop us serving terabytes of data to 
        //search engines and making is poor:(
        if (req.http.user-agent ~ "spider") { 
            error 503 "Not presently in cache"; 
        }
}

Save the above file in /usr/local/etc/default.net.vcl Now before we configure Nginx, lets download and take a look at the Merb app that does all the retrieval and streaming from S3. I’ve put it on Github and it can be found at http://github.com/jaikoo/pmcdn/tree/master.

git clone git://github.com/jaikoo/pmcdn.git

You’ll need to edit the s3_conf.yml file in the config directory and change the BUCKET_NAME in the assets controller before we can get going. After that’s done cd into the merb apps dir and start up two merb instances.

cd pmcdn
merb -d -c 2
Hopefully you shouldn’t have any errors and so it’s on to configure Nginx.

#/etc/nginx/nginx.conf
#I previously created a user for nginx www-data
user www-data;
worker_processes  1;

error_log  /var/log/nginx/error.log;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
    include       /etc/nginx/mime.types;
    default_type  application/octet-stream;

    access_log  /var/log/nginx/access.log;

    sendfile        on;
    #tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  65;
    tcp_nodelay        on;

    gzip  on;

    #this should really go in include /etc/nginx/sites-enabled/*;
    upstream cache1 {
      server 127.0.0.1:8080;
    }

    server {
     #Change this to your asset server domain
       server_name asset1.jiggahaeyo.com;
        listen 80;
        proxy_max_temp_file_size 0;
        proxy_next_upstream off;
        proxy_read_timeout 60;
        proxy_intercept_errors on;

        error_page 404 @404;
        location @404 {
          rewrite .* /404.html last;
        }

      location / {
        proxy_pass http://cache1;
    }
 }

}

Now we need to start up Varnish on a different port first. It’s default port is port 80 and if your asset server was on a seperate machine from your main app then you could get away with not using Nginx and just use Varnish as is, but for our example we’ll start Varnish up on port 8080.


/usr/local/sbin/varnishd -a 0.0.0.0:8080 -f /usr/local/etc/default.net.vcl
Then it’s Nginx’s turn

/etc/init.d/nginx start

Next make sure you’ve already uploaded an asset to the bucket that you declared in the assets controller. To test this out if you have an image in the namespace images called test1.jpg then the url would look something like http://asset1.jiggahaeyo.com/assets/images/test1.jpg. If you’re tailing the Merb log you’ll see the first time you do this request the Merb app will query S3 and stream the file upstream. Hit the url again and you’ll again be served with the file, but this time the Merb app won’t be hit at all as it’s coming from Varnish.

The Merb app automatically grabs everything that matches /assets* and forwards it to the ‘show’ action in the assets controller. We then remove the ’/assets’ bit from the request path and pass that on as the path to the object in S3 which is retrieved and streamed back without loading the whole file into memory.

OK, this is all cool and stuff but I know you’re thinking that although we’re saving in S3 content serving costs because we’re now caching we are just pushing the cost of bandwidth onto our hosting provider. Bandwidth’s not cheap! Or is it? You’ll find that there are numerous hosts out there who give some fantastic deals when it comes to bandwidth. Joyent for example do a pretty good deal on their accelerators in which they give you 10TB of bandwidth. True the file storage available on those accelerators isn’t much but who cares when we’re using S3!

Also if say our app is partitioned we can have an asset cache on each of our partitioned nodes utilizing any spare bandwidth rather than eating all the bandwidth from just one asset server.

The knowledgeable of you out there are probably now thinking that this is all cool, but what about geographical caching? You get that with a true CDN. Surely the poor mans CDN should do this too? Well indeed it can! However, I’ll leave it to John Buswell’s article in o3 magazine on global load balancing which can be found here as he explains it far better than I could.

Meta

Posted by jonathan 6 days ago

8 comments »

The universal paperclip


Working in a dual Ruby web stack framework has it’s perks. Today I wanted to use Paperclip in my Merb app as well as my Rails app. So ten minutes later I packaged up Paperclip as a gem and tweaked it ever so slightly to work on both Rails and Merb. The hardest thing was getting the manifest right.

It just goes to prove how well designed and extracted plugins like Thinking Sphinx and Paperclip are in order to allow me to do this so quickly.

Gem can be found here if you’re interested. Any bugs let me know.

Meta

Posted by jonathan about 1 month ago

2 comments »

Thinking Sphinx for Merb... And Rails


Update 2008-04-17
Pat’s integrated my changes and also removed the horrible double Rake task hack I did. Now the Rake task works automatically. I’ve removed my branch and now I mirror Pat’s master branch. Please use Pat’s branch instead as his one will always be the most up to date.

It started off this morning before breakfast when I wanted to get Pat’s excellent Thinking Sphinx plugin going in Merb. The next thing I know, it was all done with enough time to spare for some tea. However my first attempt I made was too tightly coupled with Merb when it didn’t need to be. So tonight I cleaned it up and now it’s flexible enough to be used on both Merb and Rails apps.

My experiment can be found here but I warn you to tread with care as it’s still work in progress. However it works quite well for me as we use both Rails and Merb at work. The gem is based on Thinking Sphinx trunk so please read that first if you have any problems.

To get going:


git clone git://github.com/jaikoo/thinking-sphinx.git

cd thinking-sphinx

rake package

sudo gem install pkg/thinking-sphinx.0.8.0.gem


In your Merb app add the following:

init.rb:
dependency 'thinking_sphinx'

Rakefile
require 'thinking_sphinx/tasks/merb
For your Rails app:

Create an file called search.rb in your iniitializers dir and insert:
require 'thinking_sphinx'

Add a file called thinking_sphinx.rake in lib/tasks and add:
require 'thinking_sphinx/tasks/rails

Then do the usual Thinking Sphinx setup such as adding the index info to your models, rake thinking_sphinx:configure, rake ts:index etc…

Thanks again to Pat Allan for creating such a well designed and abstracted Sphinx plugin.

Future? I’m eyeing up adding DataMapper support which I’m quite excited about… Hopefully I’ll get sometime in the next few days to work on this.

Update Pat announced the release of his re-write of the Thinking Sphinx plugin. Check it out here !

Meta

Posted by jonathan about 1 month ago

4 comments »

Cleaning up constants


When it comes to a Rails or a Merb app I like to make sure there is a clean separation of concerns. One of the ways I do this is modularising the behaviour of an object into mixins.

For example, a movie object might need some logic that relates to video transcoding. Rather than have this explicitly part of the main class I move it into a mixin, this for me makes code that is:

Of course like everything this requires thought and I’m careful not to go mad with this pattern as the last thing I want is something that resembles the old EJB 2.0 madness where all the business logic was contained in other service objects and all that’s left is an anaemic model in which you need an IDE like Intellij to discover what on earth is going on and where.

When it comes to constants I’ve noticed a few people extolling the virtue’s of putting constants in the environment.rb (dev/test/production) as an easy way of having globally accessible constants. This in my opinion should be avoided at all costs if possible.

Instead I favour putting my constants in either in the model that they relate to as this increases readability and stops pollution/collisions. Or my current favourite method is to group the constants up by context into their own yaml files and have them loaded and accessed from a module. It still keeps my models clean while still keeping my code readable and more maintainable.

For example something like this:


module Tasks
   module Config

    def self.message_queue_ip
      @@message_queue_ip ||= YAML.load_file \
       ("#{Merb.root}/config/tasks_conf.yml") \
        [Merb.environment]['message_queue']
    end

    #Rest of your constants....
   end
end

#Add access it with 
Tasks::Config.message_queue_ip

Of course this is a trivial example but you see how it automatically loads the correct value based upon the current environment without any conditionals. Yep, it’s all pretty much common sense but I still seem to always get handed apps that don’t follow simple rules of good OO design.

Meta

Posted by jonathan about 1 month ago

0 comments »

Euruko 2008


It looks like quite a few of us UK Ruby guys are going to be at Euruko in Prague this coming weekend which should be fun. I’ll be flying out this Friday around 13:45 on an Easyjet flight from Stanstead with Peter and Jamie.

I’ve also heard that all the bamboo’ers are going to be there which will be fun to catch up on old times. If you’re going to be at Euruko it’ll be good to chat and have some beers!

Meta

Posted by jonathan about 1 month ago

0 comments »

Daemonize RabbitMQ


I got an email this morning asking how to daemonize RabbitMQ when running with the STOMP adapter. It’s pretty simple really, all you need to do is add ‘-detached’ to the RABBIT_ARGS in the makefile.

Or you could replace the start_server task with this one:

start_server:
    $(MAKE) -C $(RABBIT_SOURCE_ROOT)/erlang/rabbit run \
        RABBIT_ARGS='-pa '"$$(pwd)/$(EBIN_DIR)"' -rabbit \
            stomp_listeners [{\"0.0.0.0\",61613}] \
            extra_startup_steps [{\"STOMP-listeners\" \
                        ,rabbit_stomp,kickstart,[]}] -detached'

Meta

Posted by jonathan about 1 month ago

0 comments »

OH HAI RabbitMQ!


So a few weeks ago I unfairly called RabbitMQ complex without clarifying why. When it comes down to it there’s nothing hard about getting up and running with RabbitMQ especially these days with the Stomp and HTTP adapters out there. However the one sore point that a lot of Ruby people have when I recommend this fine messaging machine to them is that there isn’t much documentation on getting up and running with Ruby. Well, there’s nothing more pathetic than a man who grumbles but doesn’t do anything about it and so as promised I made a screencast. It’s actually based of the excellent RabbitMQ and Stomp tutorial that can be found on Lshifts great blog. Of course I altered it to have a bit of a Ruby twist with a really simple Rack/Merb version of hello world.

Well I’m still nursing a hangover this morning after a few too many fine beers and champagne that was to be had at Pizzaonrails last night. It was good to see everyone last night…

Meta

Posted by jonathan 2 months ago

5 comments »

Friday round up 2008-02-29


Britain was built on Queues!

A long while ago I wrote about messaging systems and I’d just like to give a quick update on that. A couple of weeks ago Twitter released Starling and although simplistic I’ve found it very stable and even easier to use as it uses the same protocol as memcache. Something that might be of interest to people out there is this, Chris Wanstrath is working on an evented version of Starling which no doubt should give some performance gains.

Another queuing system I’ve been playing around with lately is conveyor which is based on the excellent thin and uses HTTP as it’s transport protocol. Usually the purist in me would recoil from the thought using a fat protocol like HTTP for something like messaging. However, I bet for most cases it performs Good Enough©. There’s always going to be a trade off between simplicity and performance and if you really want speed & features then I still recommend you look at RabbitMQ, but for most cases something like conveyor will definitely do.

Like Starling it’s extremely easy to get started as conveyor’s interface is just a web server that conforms to Rails like REST conventions. Documentation is a bit sparse but the test suite is comprehensive and easy to grok.

I’m too poor to afford a CDN so I made one

Finally I was going to describe a setup for a poor mans/cheap CDN… However it’s been a crazy week so I’ll leave it till the weekend.

Meta

Posted by jonathan 2 months ago

5 comments »

Merb, Datamapper AND merbful_authenticaion my lovelies!


Merb screenshot

I promised a few folks a screencast of Merb and Datamapper a long time ago. I mean it’s all very nice reading about it all on my shiny blog but what about video? Nothing better than being able to watch some geek p0rn while drinking a glass of wine. Heck why not be like me and do that on the London Underground. Glass of wine in one hand Ipod Touch in the other and trying to fend off the heathen hordes with my foot. On it’s a jolly old place that Victoria line.

But I digress. I could of just given you a Merb and Datamapper tutorial, but that wouldn’t have been fun would it? I mean even the Queen’s auntie has written one of those, so no I had to go one further! I give you the Merb, Datamaper and merbful_authentication screencast!! Let it not be said that I’m not a generous and pleasing chap (most pleasing indeed or so in my naughty little dream last night).

Well you can tell that this isn’t rehearsed as I left in all the bloopers including me frantically wishing I’d used in-memory sessions rather than cookies for this demo.

Meta

Posted by jonathan 4 months ago

7 comments »

W3top loosers


Take a look at this site notice anything familiar about it? Yep, that would be my “Twitter”http://www.twitter.com username and also my tweets. However that is definitely not my picture (it’s one of my friends) and it’s given false information about me such as that I live in Lancaster (I don’t) and that I’m seeking women (I’m a little old and tired to be playing the gigolo game). I don’t mind my tweets being aggregated by a third party but I don’t like my online persona hijacked.

It turns out that w3top is some online dating site that thinks it’s cool to steal other peoples personas in order to try and bulk up their fake ‘userbase’ for one reason or another. I very much doubt those reasons could be for the greater good of man.

Meta

Posted by jonathan 4 months ago

4 comments »

Goodbye, Hello.


Well yesterday was my last day at New Bamboo. My day consisted mostly of consuming large amounts of xmas fare, fine red wines and drunken chat in the pub before getting lost in a thick freezing fog that had engulfed Angel. I somehow ended up doing a mile and a half walk to Kings Cross through strangely empty streets and side alleys, no doubt caused in part by the eerily thick fog and commuters eagerly leaving for home earlier than usual.

So what awaits the future? Well I can now happily reveal that I’ll be working for a London based start by the name of Vzaar, check out their blog to find out more about them if you’re interested. I’m totally stoked that I’ve been given a chance to work at Vzaar as I strongly believe that video is one of the richest ways of getting your message across on the web.

Right, I’m off to do some last minute xmas shopping. Apparently because of the credit crunch affecting buyer trends, much joy is to be found with bargains! Have a great xmas everyone:).

Meta

Posted by jonathan 4 months ago

0 comments »

I've got the Power!!


W00t! I’ve just received my invite into the Powerset Labs community! I’m totally stoked up as I’m been waiting for quite a few months now to get my hands on some of Powersets next generation search.

I’ve only taken a quick look around the labs site but it’s pretty obvious that some decent amount of thought went into it’s development. One of the nice features in the discussion forums is the ability to mod discussions ala digg, but obviously I can pretty much guarantee that the comments will be a lot more constructive than those found on the former site mentioned.

I’m off to play now. Don’t be suprised if I can’t be found for a few days;).

Meta

Posted by jonathan 5 months ago

0 comments »

Uploading and feedback with Nginx


I was in the office the other day, you know doing the usual drinking coffee and unleashing my usual carry on’esque humour to those unfortunate enough to be listening. When the conversation turned to file upload strategies and progress reporting solutions in Rails.

Now for various reasons that have been talked about by other people before, Rails as a solution to uploading large files was instantly rejected. The usual answer to this problem is to use a bank of merb’s and feed back progress to the client. It was during this conversation that Mr Tanner brought up the subject of the Nginx upload progress module.

Now we’re big fan’s of Nginx because it’s fast, lightweight and easy on resources. So I immediately looked into this. This module basically counts the number of bytes it’s received, and allows the progress of the upoad to be retrieved by the client via AJAX and JSON. After the file upload is complete Nginx then pushes the payload to the upstream server. This get’s around the problems people have had using Nginx with merb/upload progress. The upload progress feedback is pushed to Nginx instead which for me would allow more flexibility in my choice of what the upstream server/application will be. Merb’s great. It’s fast and lightweight, but if you’re using it just file uploads then it could be a bit of a waste. It would be simple to write a simple mongrel handler to deal with the file uploads or use the soon to be released Wisteria a Ruby micro framework by Kirk Haines.

Wisteria is a micro framework that doesn’t try and do anywhere near as much as Rails or Merb and because of this it’s managed to perform approx. 2000 req/s for a simple hello world application. These figures make this a framework to definitely consider in the future if you’re doing large file uploads or dealing with requests that need high performance but also need non of the polish that merb or Rails provide.

Of course there is the problem of validation if you want to verify both the upload payload and any other fields such as descriptions, tags etc. In my architectures I always push off as much as I can in a non synchronous way. So in the domain of a popular video sharing site that has to accept uploaded videos, with a description, some tags and then transcode the video into the appropriate format. I would probably validate the incoming data, push this onto a message queue (albeit without the actual file payload of course) to be dealt with by another process when resources are available.

Now the conundrum for me is to do with validation. In the lightweight process that’s dealing with the file upload I probably don’t want to be loading the entire model of the ORM I’m using such as ActiveRecord or DataMapper. However I do want to use the validations that are part of the model. Wouldn’t it be great if I could just mixin my validations into my ORM model and share them also with a lightweight DAO that I’m just using a a temporary container? This way I wouldn’t be duplicating business logic.

To my knowledge the AR validations are too tightly coupled with AR for me to do this in a lightweight manner. However I’m sure that I could do this with DataMapper because of the way it was developed using BDD. I’ll experiment more tonight to see if this is indeed feasible.

Meta

Posted by jonathan 5 months ago

3 comments »

Symbol shortcut


I just found this out today which is pretty sweet. Sometimes I need a symbol but compose it using a string and then call to_sym on that. Turns out there’s a shorter more concise way of doing this:

   @ivar2 = "_two" 
   :"my_string#{@ivar2}" 

This returns a symbol :my_string_two. Pretty cool I thought.

Meta

Posted by jonathan 6 months ago

2 comments »

Validate with DataMapper


So, a week or two ago I was modelling something with DataMapper and noticed that I needed some validation that went beyond the built-in validation macros. Now I could of just created my own validation macro as DataMapper makes this incredibly easy to do so, but I just needed something to get my specs to pass and go green.

Being a long time ActiveRecord user I immediately brought up the API documentation and went looking for the validates method. In AR, if you have something that doesn’t fit into one of the validation macros you simply override the validate method and push your errors onto the errors array. After a bit of poking around in the docs I couldn’t find anything that would fit my needs so I loaded up the DataMapper source code in TextMate and went searching. In order to make it easy for us AR old timers to switch over to DataMapper the authors were kind enough to create an AR impersonation module that wraps a few DataMapper methods in AR like methods. One of the is the save method and that’s where I noticed that DM called the valid? method before persisting the database session.

Now I have no idea if this is the official or the best way to do it but seeing this I realised that simply overriding the valid? method in my model, pushing my errors onto the errors array then calling super seemed to work fine for me. Now I’ve got to warn you that this relies on the internal workings of DataMapper and this could break in the future so use with caution. Of course you’re already writing specs/tests so this shouldn’t be too much of a problem:). By the way when adding errors if you have no attribute to bind the error to then passing in the context of :general is the recommended way of doing this.

Meta

Posted by jonathan 6 months ago

0 comments »

Search


Archives


May 2008 (1)
April 2008 (2)
March 2008 (4)
February 2008 (1)
January 2008 (1)
December 2007 (2)
November 2007 (5)
October 2007 (3)
September 2007 (4)
August 2007 (1)

Twitter


About


Online journal of Jonathan Conway a twenty something technologist, entrepreneur, husband, daddy of two, oh and lead architect at vzaar. Currently residing in London, UK.

You can find a little bit more about me here

Contact me at

Flickr


Euruko2008 Euruko2008 Euruko2008 Euruko 2008 Euruko2008 Euruko2008

Linkage


My tumbler
vzaar
Brightbox Rails Hosting
My Caboose Facebook Profile
New Bamboo
Luke Redpath
Jamie Van Dyke
Peter Cooper
Ismael
Caroline
Monster Gym
Scala
Pat Allan
Cristi Balan

Dopplr


Sponsors


Brightbox Rails Hosting