Five Things That Will Kill Your Site

A high availability website: It’s all about expensive redundant hardware with top of the line load balancers and an enterprise class SAN, right? Well, not necessarily.

There are several cheap or free steps you can take to ensure uptime before you splash your cash on lots of kit. If you’re not taking these steps you are wasting your money on your redundant hardware. Here are my top five things that bring sites down, in rough order of likeliness …

  1. Change
  2. Unexpected load
  3. Slow death
  4. Time related issues
  5. Hardware failure

Starting from the top, here is my explanation of each of these categories, together with ways to minimise the chance of it happening to your site.

Editor’s Note: We’ll be covering ‘Practical Advice for Scaling your Web App’ and ‘Advanced Web App Marketing Strategies’ at The Future of Web Apps.

#1: Change

This is the biggest category here. If your website was just fine yesterday, and today it’s bust, you probably changed something.

Maybe you just released a new version of your software? Upgraded a third party product? Or changed the network configuration? Even minor, apparently safe, changes can have unforeseen side effects. So the number one measure you can take to maximise your site uptime, is the unsexy, dull and constraining world of change control.

There are some very basic things to get right here: test your software before you release it. Test it properly even if you think it’s obvious it’s going to work. In fact you also need to test releasing it, which means you need development and test instances of your application, as well as your live one.

Develop your changes in the development environment, and work out the release plan – including how you would reverse the change if you needed to. Follow that release plan in the test environment and test the result. Test it really thoroughly because it’s much easier (not to mention much less stressful) to fix it here than once it’s live – you are not just testing the release, but also the release process itself, so the test environment should be as like the live one as possible.

This is pretty basic stuff, and there’s no excuse for not doing it, however small your company is, especially with virtualisation making it easy to run multiple environments on one physical server.

Change control is about three simple things:

  1. Giving some thought to changes you make to your live system and their potential risks
  2. Making changes one at a time so if something breaks you know what the last change was
  3. Writing down what you changed and why so that everyone can see.

Get this under control and you will have avoided the vast majority of your potential outages.

#2: Unexpected Load

This is the Digg/TechCrunch effect. Someone writes something nice about your web app, the buzz spreads, and before you know it you have an order of magnitude more traffic to your website than you ever planned for and the whole thing melts.

So what can you do about it? Well of course you could buy racks of hardware just in case, but that’s not really practical. Here are some more realistic suggestions:

Know your capacity. This involves performance testing your application in advance. Set a level for what you consider to be an acceptable response time, and ramp up simulated users carrying out typical tasks until you exceed that threshold. From here you can establish performance bottlenecks and tune your application to increase that capacity (how to do this is a massive subject in its own right), but even without this tuning, you will at least know what kind of spike in load you can support.

It is also worth investigating on-demand services such as Amazon EC2 to increase capacity at short notice either in anticipation of or in response to a big traffic spike.

Spread your PR. A big bang PR launch of your new site could leave your reputation in tatters just as everyone’s eyes are on you. But is the big bang really necessary?

If you have developed a major upgrade to your web application, then it may well be very newsworthy and you may want to tell your entire userbase about it as soon as possible. But the last thing you want is every single registered user hitting your site in the hour following your newsletter, and if it is exciting news for your users today, it will still be exciting news for them tomorrow or the day after.

Segment your user base, tell them the exciting news a segment at a time, and use the early data to determine whether you need to speed up or slow down subsequent mailings. The same goes for initial launches of your site: look at options to make announcements in one geography at a time or launch to a limited initial user base.

Degrade gracefully. It is better to give full service to a percentage of your users, and a helpful static message to the rest, than for your entire site to be entirely unresponsive. You’ll need to specifically code for this.

In general a production web application will be constrained on the number of requests that can be effectively processed at any one time – and you should have a good idea of what the upper limit is from following the “know you capacity” advice above.

For a fixed number of incoming requests per second, the number of requests in progress at any one time depends on how long they take to process: 10 requests a second that take 1 second each to process means that you will have 10 requests in progress at any one time. If it starts to take 2 seconds to process each, then you will have 20 requests in progress at any one time, which will probably make your requests take slightly longer still. This slowdown continues until you reach a tipping point where your service grinds to a halt, and any responses that do get returned take much longer than the browser timeout.

There are a few solutions to this, ranging from a simple restriction on simultaneous connections (pick the largest number you know you can handle) to queuing requesting for asynchronous processing rather than tying up a thread for each (detail of these is beyond the scope of this article). Do investigate and implement a suitable solution for your technology stack.

#3: Slow Death

This is the slow incremental use of a resource over time that goes unnoticed until one day you hit a critical limit. The obvious contenders here are disk space and memory (being eaten up over time by a slow memory leak).

This category of outage is easy to avoid with a little forward planning. You need to monitor, set alerts and watch trends. Have your system SMS you when available disk space gets below 20% for example.

In theory modern garbage collected languages like Java and C# make memory leaks much less likely than the direct memory allocation of C and C++, but they are still possible: watch for memory allocated by static classes, caches that are not cleared, or more traditional memory leaks in third party middleware.

Track memory usage and watch for trends. If you see issues then the short term solution is to restart the offending component, but in the long term you need to track down, isolate and fix the problem.

#4: Time Related Problems

If your website was working fine yesterday but is broken today, one thing that has definitely changed, and is most definitely outside your control to stop, is the date and time.

The biggest threat is daylight saving. You code your new feature in the winter, test it thoroughly and put it live, only to find problems when the clocks change in the spring. Most of the time the bugs this creates are not ‘site-down’ problems (maybe time data in your app is out by an hour throughout), but I do remember one problem at lastminute.com where some badly written date calculation code went in to an endless loop every night between midnight and 1 am once daylight saving started, bringing the hotels search down.

The number one rule here is to make sure that your code never gets the current system date and time directly, but calls a mockable alternative, so the production implementation gets the date and time from the system clock, but you can write unit tests to test the behaviour for a sensible set of dates, times and times zones.

The other gotcha is licence expiry. Make sure you know when your crucial licences expire, and create a reminder in your task management system of choice to make sure you renew in plenty of time.

#5: Hardware Failure

Things with moving parts break, and in the server world that means primarily disks and fans. Disks hold your data so are kind of crucial, over and above mere uptime, so make sure you have RAID redundancy (and appropriate monitoring so you can replace a broken disk before a second one breaks), and backups at least daily with copies offsite.

If you’ve done everything else on this list, and have some money to spend to reduce the risk of hardware failure, then go for it in this sequence:

  1. Add a load balancer and scale out the web tier, to give you both increased capacity and redundancy.
  2. Mirror or cluster your database on to a second DB server. Likewise for any critical files on the filesystem: use a SAN or replicate between servers.
  3. Set up in a second data centre – either as a DR fallback or operate in active-active.

Of course spending cash to ensure site availability is wasted if the failover doesn’t work when you need it, so plan carefully and test it. With redundant hardware in place, it will protect you from more than just hardware failure. It allows you to direct traffic away from an instance the seems to have locked up while you restart it, or perform operating systems upgrades that require a restart without site downtime.

In Conclusion …

This isn’t an exhaustive list of things that might break your site, but it represents a good summary of outages I have come across. (The big category I have not included here is failure of some service provided to you like power in your data centre). However, dig in to what caused a failure and it’s probably one of the items on this list, most likely someone changing something.

As I mentioned at the very beginning, hardware failure is last on my list, so do not jump straight in to spending money here until you have dealt with the other categories here, and the single most important thing you can do is control change.

In my days running online development at lastminute.com, I was talking to the head of technical operations over the Christmas period, and I happened to comment that the site had been remarkably stable. He quickly replied, “Yes, that’s because most of your guys are on holiday, so no one’s been meddling. If I sent my team on holiday too we’d have 100% uptime”.

It’s true: IT systems that no one touches don’t break very often. Not never, but not very often. That said, change is clearly a major part of a successful website, so make sure you are confident of the changes you make and how to undo them, and you will be on your way to having a stable site.

Please share any tips you have for keeping your site live, in the comments below. Thanks!

Free Workshops

Watch one of our expert, full-length teaching videos. Choose from HTML, CSS or WordPress.

Start Learning

Comments

0 comments on “Five Things That Will Kill Your Site

  1. Good article. These points are especially true if your site is on a shared hosting service. One of the worst things that can occur is becoming popular overnight, and then your host craps out and your left with a big 404 error. Monitor your traffic at least weekly and determine if a self hosted solution is warranted.

  2. Pingback: Jonathan Howell » Blog Archive » Five things that will bring your site down

  3. Great advice. Some additional technical tips: A good cheap fail over solution is using a DNS Failover service like the one offered by dnsmadeeasy.com where by it will automatically transfer all traffic to another host. It can take upto about 5 minutes to have affect but if you can afford 5 minutes downtime it is far cheaper than costly load balancers and can also do monitored DNS round robin load balancing. Those writing apps in rails should also consider using Capistrano to help with managing deployments. You can easily revert back to an earlier release if needed by default and can configure automated backups on deployment etc.

    • That is very good advice indeed. Although I am not a huge fan of dnsmadeeasy, their services have saved me a few times.

      • Curious as to why you are not a big fan of dnsmadeeasy.com.. have you had any problems with their services? I think their control panels are pretty ugly, and occasionally awkward, but I’ve never had a problem with their service.

        Thanks for the article,
        Dom

  4. great post, loved the “mock time” suggestion, indeed a great approach.

    re the digg effect, always be sure to load test your apps first, even if you don’t and you use apache, at least tweak its settings – start servers, max connection, max servers are all v significant whilst KeepAliveTimeout is the most important, drop it to 2 seconds and you can suddenly deal with 4-5* more traffic per box :)

    regards!

  5. Pingback: links for 2009-08-10 » Koen blanquart testblog bij ict4me

  6. Regarding change management, I’m surprised you didn’t mention some sort of version control system. I often struggle getting clients to understand that such systems aren’t luxuries, but rather are imperative for any serious web development. They also make the question of rolling back a change a substantially easier one. With hosted subversion and git solutions, and GUI tools that allow even less technical people to use them, there’s little excuse not to incorporate them into any development process.

    • I assumed that using a version control system like Subversion was so obvious I didn’t need to mention it. From your experience with your clients, it seems that was a bad assumption to make. You are completely correct: version control is not a luxury it is an absolute requirement from long before your web site goes live. There is no excuse for not setting this up before you write your first line of code.

  7. Pingback: 28 fresh links! « Adrian Zyzik’s Weblog

  8. Thanks for the article. Good thoughts. For #2, using a Content Delivery Network (CDN, e.g. Amazon CloudFront, Akamai) for static files can be pretty significant to reduce load on your servers, even before doing on-demand scaling like EC2.

    httperf is one tool for load testing. peepcode.com has a screencast on its use.

    A few typos: las(t)minute.com, perform operating systems (upgrades).

    Cheers.

  9. Great stuff. Thanks.

    The only thing I’d like to see add is… Change (The Lack of). Certainly many sites suffer because of not being kept current as well, no? Security comes to mind.

  10. Of course you don’t have to resort to an “expensive hardware load balancer”, when you can use a much better value software one (with more functionality)!

    Deploy it on whatever hardware you want, use it in a cloud or virtual environment etc.

    All of your points (and others you have missed, such as attacks) can be overcome (or at least hidden from the users) using an intelligent traffic manager. Why write code in your app to cope with this (then have to replicate it across all your apps), when you can write code in your load balancer that can be used for all of your online services?

    Download our no-cost development license to find out how easy it is to do.

    http://www.zeus.com/downloads/zxtm.html

    Nick

  11. Indeed Great article Jonathan. You’re perfectly right that hardware failures are not #1 reason for failure. That’s why really good pro-active monitoring is a as essential as redundancy. You should probably add security / hacking to the top list.

    As we didn’t find a hosting provider doing all of that, we have been designing a specially monitored, redundant, high-availability with automatic failovers, in a cloud-cluster way for our high-traffic site http://www.joomlapolis.com/ which has now run for over a year without any interruptions, despite 2 hard-disk failures, 1 configuration error by an admin, 2-3 disk and memory treshhold alarms treated in time, and so on.

    After several demands, we now ended up offering that same hosting on that same cluster as well, with same redundance and monitoring. Learned a lot on the way on how hosting should be done: the biggest is that it’s lots of work to set it up right: from a barebone root-box to a redundant secured monitored hosting, it’s around a year of learning curve…a never-ending one in fact…LOL

  12. Aren’t the points listed kind of common sense for any IT pro that manages web based applications?
    Maybe I misread the audience of the post.
    I do agree that if you are ignoring any of these items then you should be prepared to fail at some point.

  13. I know this is going to sound like I’m from 1999, but my favorite websites are almost pure HTML. If possible (and it obviously isn’t always) avoid anything complicated. No databases, no scripts… nothing. Also, never make your site purely Flash. HTML is fast, clean, and easy. Most of the time you shouldn’t need or want more than that, even if it seems outdated from a design standpoint.

  14. Pingback: MakinMo's Tech Blog

  15. Pingback: NexNova » Blog Archive » Links del giorno: August 12, 2009

  16. Pingback: Zero 2 Hero | Five Things That Will Kill Your Site

  17. Pingback: Ennuyer.net » Blog Archive » Rails Reading - August 12, 2009

  18. Something I’d suggest: make sure your developers are friends with your operations people and that they communicate well.

    Get Operations to review feature plans/technical designs, and understand what the development team are trying to do. On the other hand, get developers to understand where operations are coming from – why they’re scared of uncontrolled change, and what it will take to get them on side.

  19. Pingback: Five Things That Will Kill Your Site | Ethiopian News

  20. Pingback: Week in Review - 8 August 2009 | Robert Casto

  21. Here’s one more thing that can bring your site down. I have see this one overlooked many times because it is so obvious. Make certain that your domain name has not expired. To a client or a visitor it will look as though “the site is down”.

  22. Pingback: Five things kill ur site « Nagini

  23. Pingback: サイトマスター必読、あなたのサイトの人気を落としてしまう5の行為 | ノンプログラマの糞勉強

  24. Pingback: Best of the Web – August | huibit05.com

  25. Pingback: Best of the Web – August | Ouech.net

  26. Pingback: Best of the Web – August - Webreweries.com

  27. Pingback: Best of the Web – August | KolayOnline

  28. I want to know why my comments are not working here? I thoroughly replied on one of the post but nothing happened. Are all above comments are fake?