Rethinking Disaster Recovery

Disaster Recovery has been on the minds of companies ever since the early days of commercially available computing. Today's world of DR revolves around four acronyms - BIA (business impact analysis), RPO (recovery point objective), RTO (recovery time objective) and BCP (business continuity plan). The acronyms appear in a disaster recovery plan in roughly in that order, the thinking being that you first analyse the impact to business of systems being down, then figure out how far back in the past are you willing to turn the dial back to recover from (last day, last hour, last millisecond). Next focus on how long you can afford to be down. Finally - buy a boatload of hardware, software and services to convert all this into action.

Setting up a DR is a hugely expensive affair that takes a significant amount planning and effort, not to mention all those drills and tests every now and then. CTOs have followed this prescription since the late seventies (apparently the first hot site was set up in 1978). But should they? Forty years ago computers were the size of houses, Amazon was just a river and mobiles visible only in Star Trek. Computing has undergone seismic change - are we sure these disaster theories are still valid?

The Business of Impact

As they say, the best place to start is the very beginning; the much vaunted "business impact analysis". This is supposedly the foundation of DR - studying what impact the non-availabilty of computers is going to have on which business line, and what that expected financial loss is likely to be under various scenarios. The idea of course is that the cost of downtime should be known, so that mitigation can be sensibly budgeted for. Consultants take a lot of time and make a lot of money answering this question - usually with a hundred pages of details - but I've rarely found the details used anywhere except in presentations.

Here's the bottom line; any well-managed company can tell you the most critical bits of all its businesses in a minute. Even a decade ago most companies had computers for only some lines of business; for other workflows they were not critical and basically supported processes that could be done manually (if slower). Today, computers run the show itself, not just track the data. If all your servers are down, impact will be big - is it really important to know whether you will lose a hundred crores or a hundred and twenty? And if you think your HR systems are not critical, just hope that disaster knows to avoid salary day. At best, you can focus on the order in which they are recovered. There too, be prepared for the unexpected - deadlines can loom anywhere just when most inconvenient, throwing the best laid sequences into disarray.

The other fallacy of a business impact analysis is to pretend that the disaster that struck the primary can be overcome in a few days - maybe a week. In practice, no major disaster gets over so quickly; if a serious fire or earthquake or flood has happened, you way well see your primary data center unavailable for months, not days - meaning that all those "non-critical" bits suddenly start to bite. Another reason not to indulge the fantasy that only some applications are worth moving to DR.

The Disastrous Definition of Disaster

Its important to understand what we mean by disaster, to understand what we mean by recovering from it. Traditionally, this is divided into three buckets:

Localized disasters such as equipment failure or accidental damage that is local to the computers or networks.
Site disasters that cover the entire site the computers are stored at, such as fires, explosions, riots and vandalism, building collapse, etc.
Wide area disasters that cover a large geographic area, such as earthquakes or floods.

The easiest way to handle all these disasters, conventional wisdom goes, is to put computers in a different building (thereby covering both local and site) in a different part of the world (thereby covering wide area disasters). The core reason for this is that prevention of these disasters is difficult, so it makes sense to create a backup.

All sounds firmly logical and reasonable; where's the problem? Here's where I have issues. First, localized and site disasters were quite a worry when servers were concentrated in office buildings, but a custom-designed Tier 4 datacenter is not particularly prone to the first two, and even the third - wide area disaster - is reduced considerably by choice of location. My offices have to be in cities prone to tsunamis or earthquakes but data centers can be located any distance away and in safer zones. Wal-mart apparently uses a remote nuclear bunker as their data center. The requirement that a second data center be in a "second seismic zone" is not particularly meaningful if the primary is not in danger of earthquakes.

Then there's software damage. Many of the most likely catastrophic disruptions we face today are not physical. NotPetya caused Maersk worldwide a $300m loss by wiping out nearly 4,000 servers. Before this, Backorifice, Wannacry and a host of other malware have caused large scale outages. These are now far more likely and far more destructive that the fire or earthquake fears of decades past. And most DR strategies focus only peripherally on them.

The final problem is the assumption that everything is (and should be) concentrated into one "primary" data center. Modern businesses can be everywhere; hundreds of servers in hundreds of locations. Businesses like Facebook or Google are already like that, but so are many less visible businesses such as Zara. High speed interconnectivity allows distant locations to be as "near" as the next computer used to be a decade ago so centralization is a design choice, not a necessity. When tech says "it can't be done", they usually mean it can't be done without some redesign - usually a redesign of the way they use their relational databases.

Yes, today most businesses have "primary" data centers. I put to you that its possibly cheaper (and probably no more time consuming) for most businesses to re-architect a system into a distributed model than to create and run a single full-fledged far DR site.

The Objectives of Recovery

DR strategies are dominated by the conversation around two objectives :

Recovery Point Objective - the elapsed time (or data points or transactions) before the start of the disaster that you target your recovery up to, post which all transactions and data are lost. an RPO of 10 minutes would mean that only transactions completed ten minutes before the disaster can be recovered.
Recovery Time Objective - what are you targeting as the maximum time between declaration of disaster and resumption of operations. An RTO of 4 hours would mean four hours from start of DR activity to going live again. Not start of incident mind you, a fire may rage for some time before DR activity is initiated.

Why am I calling it a trap? These two sensible objectives are sensible only if the underlying assumptions hold, and these are again invisibly tied to the mainframe-centric view of computing that we CIOs grew up with.

One: The theory of a recovery point is a viable objective only if all transaction points fall along a single timeline. If multiple parallel interconnected transactions are going on, you're in deep trouble. A decade ago, most businesses had one single thread of processing, activities were slow and linear (mostly batch) - and an RPO was easily defined. Today, businesses are different: a bank may have overlapping internet banking and funds transfer and debit card activity all done by the same set of clients - a perfect RPO would require the exact same cutoff on all the different activities or lead to inconsistencies that will take hours to resolve. A manufacturing company has manufacturing processes, goods receiving processes, finance and accounting processes all running in parallel. If you have many different points of recovery for these many different systems (even if each individually has the same RPO), a long, difficult, treacherous reconciliation process awaits.

Indeed, in any such system, there is only one single valid recovery point - zero - that is, no transaction loss at all. This is theoretically impossible to achieve in a modern system with overlapping parts unless there's an orderly shutdown; the best a business can do is to take advantage of shutdown windows and guarantee the quiescence required. It also means that most RPO targets are not actually achievable outside drills.

Two: Recovery point is assumes a business is operating in isolation. I can only recover to a point in the past if the word around me has not moved. If the ERP of a manufacturing plant shut down at 10am, and came back at (say) noon - then an RPO of 1 hr would mean ERP has no record of goods received from 9am. However, goods were received, and the suppliers of those goods do have proof that they delivered. Whats more, goods continued to pile up even after 10 am. A bank, a stock exchange, an e-commerce business will have similar problems; the rest of the world interacted with it in the hours or minutes or even microseconds that were lost; a disaster in one place can't wipe the records of those other companies away. Even compensating them for lost transactions (through, say, insurance) would require some amount of reconciliation.

Today, in an age where people get upset over losing Facebook photos, why should any business tolerate or plan for anything but zero RPO? This means businesses may lose the odd uncommitted transaction but no committed transaction should be lost or in an inconsistent state. Given the challenges of quiescence, this means that the "fail and recover" model just won't work.

Recovering in Time

Recovery Time Objective is less of a trap, since there's no theoretical barrier to having a reasonable recovery time objective. My problem, ironically, is with more time rather than less. The point of RTO is predictability - it allows a business to estimate the quantum of lost business hours and plan, insure or do other mitigations against it. If your train is not at the station, its much better to know how long the delay is, so that one can come back rather than just wait around.

However, the longer the RTO, I put to you the less predictable it is.

If you took an RTO of minutes or seconds, this would rely on technology that can be tested even in real conditions, and would probably work. We have all had long experience of failing systems over from one to the other in near real time, and we routinely test it while in production - yanking a cable to watch a router fail over, or forcing a server to fail over to another in the cluster. Netflix's famous chaos monkeys are probably the most extreme example of just that, and it all works quite reasonably.

Once things gets into hours or days, however, such predicability goes out of the window. Firstly you can't test such big outages in real conditions, since no business will risk hours of downtime to prove some reliability engineer's design. In the absence of repeated testing, I find it hard to believe that someone has designed a few hours worth of workflow and taken everything into account, not a single mistake all at the first shot. Drills are well and dandy, but everything there is neatly planned, no stress, no crisis, no unpredictable behaviour. A real disaster, like a battle plan, rarely survives the first gunshot. And businesses know this; one will find quite easily examples of companies dithering about switching over to their DR sites for as long as possible because of the fear that the process may go disastrously wrong.

Matters get even more complicated when taking into account the issues with synchronising states because of the RPO points described earlier. This is typically not part of the RTO calculation, since that one is purely about technical system availability. In my experience, reconciling business transactions can take quite a while, and businesses cannot really start till this is done. DR drills don't take this part into account, any more than fire drills take into account the time it takes for you to get your family photographs reprinted.

It is thus more reliable and realistic to plan for continuous availability rather than limited downtime - especially in conjunction with zero RPO. Remember, most modern businesses think of all significant downtime as disastrous, so continuous availability has great business value. If you're planning for limited downtime as a disaster response, be prepared for great disappointment when a real disaster occurs. Can (near) zero RTO really happen - it already is. Some companies have indeed into the act - China Construction Bank, apparently the world's second largest bank by market capitalization, claims an RTO of 10 seconds or less. Newer businesses like Amazon or Facebook have no concept of RTO; instead they constantly recover automatically and unnoticeably from repeated partial failures. Its not that they have no outages, but the outages are all partial and not complete shutdowns. The most extreme example I know of is the Netflix Simian Army - a series of automated programmes called "chaos monkeys" that deliberately disrupt the live Netflix setup as a way to test the resilience of Netflix's systems.

Mismatched Expectations

My primary issue with disaster recovery plans is that companies often have a very different expectation from DR plans. Almost without exception, the CEO thinks of a disaster recovery plan as covering mitigations against any major business interruption. Traditional DR is quite emphatically NOT that - it only covers catastrophic physical disasters and says nothing about outages such as software failures, deployment problems and a host of other downtimes unrelated to physical disasters. New York Stock Exchange did not invoke disaster recovery on July 8, 2015 even though all trading was suspended for over three hours - it was a software glitch. Neither did ICICI Bank in 2013 or NSE in 2017, when their respective systems went offline for hours. Maersk suffered a catastrophic cyberattack that wiped out all its computers and networks, and their DR plan was of little help then. This wasn't a mistake - it was because the current definition of DR is not targeted for these situations.

The overwhelming majority of loss due to IT failure in a company is generally not caused by the low-probability/high-consequence catastrophes but by high-probability/low-consequence events that occur frequently, such as software bugs, local hardware failures or security breaches. Worse, as applications become more complex, involving an ever-larger tangle of codes, data nodes, systems and networks, the exposure to these "smaller" events becomes more frequent and their impact more costly.

Indeed, substantial system outages have plagued every major company at some time or another and the solution has never been DR. The difference is - technology teams treat (external) disasters as one cause of outage among many - and DR as a response to that particular outage. Business, however, thinks outage is the disaster, regardless of cause. This to them implies that DR should be invoked to solve the problem, and there's deep frustration when it isn't. Even regulators can get into this act - companies have received notices from regulators asking them to explain extended downtimes, as NYSE or DBS found out.

There's also the constant battle about business having to ask for a certain level of resilience as part of their requirements. This is another hangover from the days that resilience was hard to do, and therefore was not done unless asked for. While it is more resource-efficient to do resilience etc at the application level, it is also far more difficult and often leads to gaps. For a couple of core applications where it may be worth the time and effort to carefully craft resilience into the application itself; for others its better to take resilience at the infrastructure and platform level - all hardware is redundant, all databases highly available, all web servers clustered for resilience and so on. This is why the public cloud is so attractive in this regard - pretty much every regular VM on AWS shares the same resilience characteristics as every other, obviating the need for long study and detailed contracts for each individual deployment.

Also something to remember - for most companies, partial loss of business, even if extended a bit, is much better than complete failure for any period. Banks can easily survive the failure of a branch or two, or the odd ATM being down - even if for extended periods. Individual flights get delayed or cancelled all the time. Business continuity and customer experience become more in line with expectations.

Then there are regulatory restrictions. I've ignored them here, since the primary objective of disaster recovery is business exigency, not regulatory compliance. If the outcomes needed by a company are achieved, regulators can generally be convinced - or some amount of work allocated explicitly towards regulatory compliance while focusing on what's good for the business. I'm reminded of the simple chain and Godrej lock that a public sector bank put around its servers in an ultra-modern, access-controlled, biometric-protected data center, as a way to meet regulations that were obviously written some time ago.

My New World DR prescriptions

Let me sum up what I've said so far:

Business impact analysis isn't all that useful. All parts of a modern business is dependent on computers, and you should look at being able to recover everything.
In today's world, the only worthwhile RPO is zero. Pay a consultant to tell you that if it helps you sleep at night
RTO plans promising just a few hours of downtime look great on paper and work well in drills, but melt away at the first sign of real trouble.
What business thinks it is getting is insurance against all significant downtime, not "downtime only if caused by an earthquake".

My prescriptions are quite simple, though there is a mountain of work underlying their implementation. I put to you however, that this mountain is no bigger than the effort consumed by a traditional DR implementation, and probably not much more costly given how expensive and wasteful traditional DR is. So here goes (roughly in order):

Build the final defence first: A bulletproof, remote bunker site that stores all data, configurations, permissions and logs. This becomes your last line of defence against the most catastrophic of events such as devastating ransomware attacks. For all my talk of resilience, we should not ignore the black swan doomsday scenario. When in real trouble, reboot everything.
Throw away RPO, RTO: Each individual failure should have zero transaction loss and seamless, sub-30-second restart, which is well within the capabilities of current off-the-shelf products. Indeed, with a little bit of design effort sub-second recovery is possible for many applications. If regulations force you RPO and RTO on you, Zero and 4 hrs is perfectly respectable and does not need a huge study to support it.
Availability rather than Recovery: Forget recovery - aim for availability where another instance of a process takes the load if one instance fails. Plan to keep everything under all possible failure scenarios, not just disasters. 99.99% availability for core applications and 99.9% for all consumer interfaces will generally give you great results.
Resilience is the best defence: re-architect systems into multiple distributed processing clusters with no single points of failure. Most of today's systems and platforms are already cluster aware; its just a matter of taking advantage of the implementations. A significant amount of resilience can be achieved even with relatively low numbers - even a cluster of as little as four, each with a very moderate 90% reliability can give you better than four-nine availability overall.
Prevention is better than cure: Use reliable cloud hosting when you can. Host on-premise applications in top tier data centers that are extremely well protected against site and wide area failures. Multiple highly reliable data centers connected with high speed data links to each other and possibly to one or two clouds can mitigate failure at incredible levels.
Separate Operations from Servers: Its surprising how many companies concentrate operations and servers in one location. Moving people is much harder than moving applications when buildings catch fire.
Focus on Infra and Platform: Make underlying infra and platforms resilient, so that any application deployed on it automatically becomes resilient. Externalise resilience.
Design for Incremental: Traditional DR designs are very stepwise, while today's technology allows things to progressively become more and more resilient. Clusters can get larger, alternate redundant paths can keep getting introduced, things should keep getting better.

Today's DR model is not like insurance; almost everyone knows someone who claimed insurance but look about you and ponder; how many companies do you know that have invoked a move to their disaster recovery site under an actual outage (not a drill)? Outages come and go, drills move along nicely but no one seems to have used it in a real life situation. It's like an expensive temple to the gods - we haven't seen it work but why take the risk of angering any celestial beings?

Disaster Recovery as we know it today is a little behind the times; lets think of better ways.

Brave For Free

Search This Blog