Leaving the Cloud

There’s much talk now about how the cloud is passé, and just an expensive way to send people into space really. A blog post by David Heinemeier Hansson posted some time ago about how he thinks the cloud is largely new coats of paint on old stuff and thus wants to leave it behind at once. As a practitioner who’s spent time on and off the cloud with companies large and small, these are my comments.

(I got this image created by DALL-E-2)

Now David is no lightweight. He’s the man behind of Basecamp that millions of people use. He’s also the man behind the widely used Ruby on Rails framework that tens of millions of people use. He’s a bestselling author. He’s even an actual, certified race car driver. I’ve never written a major open-source project, and only ride bicycles past racetracks so lets be clear — I’m sniping at giants. My only defence — I’ve done some of this stuff at all kinds of scale. I’ve been born once on the cloud, spent time purely on the ground and been back and forth a few times and this hopefully gives me a view worth looking at.

Davi’s core argument, that one should not rent if one can buy, is correct — but only under some circumstances. David himself points out two of them; a new company with no customers yet or an established company with highly volatile growth where he says the cloud is better. However, his argument is that these are the ONLY two cases where cloud is better, which is where I think he stumbles over his racing car.

First, let's get some background. David — one has to understand his vantage point. The company 37Signals he is CTO of is a small company — $7.9m in revenues, and about 95 employees. It has about 14 million users per month across some 150,000 paid customers. It’s a profitable, steadily growing company that is not VC funded and, unlike most other tech companies values slow profitable growth over rapid scaling. They have repeatedly talked about this: their ideal is grocery store rather than Slack or Netflix.

This makes 37Signals quite the outlier, and because of this what works for them is not always what works for others. We need more nuance.

Scaling without the Cloud

The article says Basecamp is already running a large and complex service but that’s not really correct. As far as I can make out from Basecamp’s own site and from googling other sources, the company has 8–10m visitors a day and 130,000 accounts. This is not very large; even RBL Bank where I work currently is about 7–8m visitors a month. On top of that Basecamp has no real-time transactions, relatively simple back-end processing and much easier security requirements.

David himself says “We have a business model that’s incredibly compatible with owning hardware and writing it off over many years. Growth trajectories that are mostly predictable. Expert staff who might as well employ their talents operating our own machines”

This difference is crucial. Basecamp’s loads are probably managed by a handful of servers, compared to over 3,000 servers for RBL Bank running 40+ lines of business. Running a full infrastructure operation with hundreds of racks and thousands of servers, not to mention stacks of routers and switches. And, there’s all this need for 24x7 support. Can a handful of people manage this?

For most company CTOs, the answer is an immediate no.- So much hardware usually spells. There’s the server team with its Linux experts, the network team with its Cisco experts, the storage team with its EMC experts, the virtualisation team with its VMWare experts, the database team with Oracle experts and so on. Lots of experts, lots of supervisors, lots of complexity to face up to.

However, I agree with David; this is not today’s reality.

Today’s infrastructure even on-premise is usually already virtualised and mature tools exist to manage every aspect in a largely automated fashion so a small (not particularly expert) team can do — with some help from OEMs -everything that used to require an army. Remember there was a time when a Xerox machine used to require a certified operator — today a toddler can press the button and get perfect answers because the technology has progressed, interfaces have been simplified and moving parts reduced. Something similar has happened to servers and networks; most CIOs just haven’t noticed. All the easy integrated management stuff that’s available on the cloud today is also easily (and reliably) available on-premise as well.

Amazon and Google have themselves helpfully open-sourced many of the tools they use, plus the VMWares and Nutanixes and Dells of the world have copied and matured. The hyper-scalars today achieve server-to-people ratios of the order of 10,000 — meaning one person manages 10,000 servers. At the somewhat smaller company scale, it's still easily possible to manage a few thousand servers with no more than a cricket team. Remember we’re not talking about supporting application code — that’s the same effort cloud or on-premise — but rather about managing the underlying compute storage networking infrastructure.

There’s no magic — a simplified, virtualised hardware setup (everything of the same kind) and multi-skilled people and excellent management tools — all of which most companies have within easy reach if only they tried. NSE does it in IFSC Gift City, Zerodha does it, startups do it but established companies struggle to get out of existing paradigms to do it this way.

A caveat, though (and it might be a big one). Your hardware stack needs to be uniform and relatively new for this to work (which incidentally is indeed the reality in most companies). Most of the tooling and skills become complicated or impossible unless you’ve committed to simplified, modernised stacks. A bit like Southwest Airlines having only one kind of plane — a key strategy Southwest uses to literally take to the clouds.

Spiking without the Cloud

David’s statement that businesses are either steady or volatile is too simple. Any complex business entity always has multiple lines of business and multiple products — and thus a mix of steady and volatile. New businesses may launch small and scale quickly, or launch big and scale even faster, or have steady growth at either scale. Further, it’s hard to predict spikiness in the future.

Basecamp has the luxury of both steady growth and predictability of future spikiness. After years in the same business with the same product, it’s unlikely to get any unexpected spikes. This is not true of most complex businesses, where constant new launches and introductions make things more volatile. Even here, though, David himself talks about how much of a boon the cloud was when they launched Hey — 300,000 users in hours rather than the projected 30,000 in months.

Can this be achieved without the cloud? Here the answer is a no (but with some escape clauses). On-premise scaling for a single company essentially means leaving capacity idle; there’s no other way to get capacity on demand. Some of the burdens can be eased by dynamically reallocating resources from idle applications to busy ones (an ability that most modern on-premise stacks have today) but the capacity on-premise is finite and eventually relatively limited.

What can be done today is bursting into the cloud. Even traditional businesses have started doing this in limited ways, running massive data management jobs on the cloud instead of purchasing the hardware for it. Oracle and many other hardware OEMs provide a way to expand into the cloud with pay as you go beyond the core on-premise load — RBL Bank made great use of this during the first few days of the core banking upgrade when many jobs were taking much more compute while optimisation was on. All applications don’t support this, but as applications get containerised this will increasingly become the norm.

Of course, if you’re in hypergrowth mode all this burstability won’t help; you’d rather stay on the cloud.

Shiny New Coats of Paint

David spends a lot of time talking about how people are fooled into the magic of AWS. This seems to contradict the fact that companies that are very technologically sophisticated (such as Capital One, Nasdaq, Slack or Netflix) continue to be on the cloud and continue to commit to it. Netflix spends about $12m per month today and has publicly committed to 3x the number by 2025. Slack has committed to spending $400m. It is unlikely that these companies have not done the math or are dazzled by new coats of paint.

So what does a Netflix know that a Basecamp does not? Nothing really, it's just a difference of viewpoint. Whatever you do on the cloud you can do on-premise, and it's not as difficult or people-intensive as it used to be — but it is more work than doing it on the cloud. Basecamp prioritises cost savings and is willing to put a few extra people into managing the service. Capital One (much bigger and also more complex) probably prioritises other tasks over the cost savings of the cloud; they may, for instance, choose to deploy people into product development over infrastructure management. There’s also a question of scale — grow to hundreds of thousands of servers and it's no longer a few open-source tools and some automation.

And then there are the services. The cloud can offer many services fully predigested; just take the service and don’t worry about either application or infrastructure. Analytics, logging, database, mobile backend, object storage, serverless web serving, and messaging are all available on tap, on demand, and with extreme reliability. Of course, they come at a premium but even people who own cars and houses rent from Hertz and Marriott. The trick is to use it appropriately and wisely and to know when to leave the rental.

Summary

Adoption of the cloud is about many things — cost is one of them. Businesses that need agility need to keep a foot in the cloud. Handling non-core concerns with non-core skills eventually bloat an organization’s workforce. Cloud is not a silver bullet — it’s one more powerful choice in an armoury of weapons.

Key Lessons:

Don’t blindly put things on the cloud. Consider continuously — for each product launch — if you’re more Basecamp or more Netflix. New applications or highly volatile applications are obvious candidates but there are others too. And things change, review this periodically to see where to move what.
Be less worried about not being able to hire or retain on-premise skills. You don’t need those expert armies anymore, so you can train across your entire workforce much more widely and more frequently. DevOps absorbs much of your expert workloads (requires a significant cultural shift).
Clean up and simplify your stack. Most on-premise complexity and cost come from purchasing multiple variants, vendors and versions.- Be Southwest
Move in-house for the priciest bits — in AWS today it is RDS (managed Postgres databases) so I would strongly suggest moving away from it. Build skills in-house for Postgres management and stop being so scared of databases.
The time when IAAS was the goal is long gone. Move to the cloud only if you’re ready to be cloud native — containerised, DevOps, horizontally scaled etc.
Where it makes sense, make use of some of the extra value adds that the cloud offers for free, such as much better monitoring, autoscaled AI Operations, much finer-grained cost control (Netflix has done some amazing stuff here). If you’re not ready to do all that, replace your CIO rather than your infrastructure.

Brave For Free

Search This Blog