Cloud bill shock and the quiet return of on-prem

· Carl Heaton · Infrastructure Commentary

A few numbers, all from the last six weeks of reporting, sketch the same picture from different angles.

Railway, a developer platform that spent roughly $2 million a month with Google Cloud, was disabled without warning on 19 May. Its API, its virtual machines, and its database instance all went down. The outage lasted eight hours. The founder called it "beyond insane". Two years earlier, Google Cloud had deleted UniSuper's entire VMware Engine environment across two regions because a parameter was left blank in a configuration. The pattern is not new and the customers are not small.

Splunk's annual downtime report, published in May, puts the global cost of unplanned downtime at the largest 2000 firms at around $600 billion a year. A 50% increase in two years. The average cost per minute, $15,000. The biggest single cause: SaaS and third-party application failures, which have tripled since 2024.

Uber, separately, burned through its entire 2026 AI budget by mid-April. One Dell employee racked up $3,400 of token charges in 24 hours. Per-token prices are falling, but consumption is rising faster. William Fellows at 451 Research framed the underlying problem cleanly: "do the same thing a few times, even on the same model, and each prompt will consume a different amount of tokens". The unit cost is going down. The bill is going up.

Put together, these three threads, vendor risk, downtime cost, and AI spend that nobody can forecast, are why "post-cloud" has gone from a Gartner slogan to something firms are actually doing. The lesson for an SME is not "move off the cloud". It is something a bit more careful.

What the post-cloud strategy is, and is not

The phrase is doing a lot of work. Jay Litkey at Flexera was the most honest summary in the ITPro piece: "I do not think 'post-cloud' means moving away from cloud. I think it means moving away from the default cloud." The shift is from "everything in one hyperscaler because that was the easy decision in 2018" to "this workload here, that workload there, with an exit option on each."

Gartner expects 90% of organisations to be on hybrid infrastructure by 2027. The hybrid in question is not the old data-centre-plus-cloud version; it is the hyperscaler-plus-sovereign-cloud-plus-private-infrastructure-plus-edge version, with different workloads in different places for different reasons.

The drivers, in the order most firms encounter them:

Cost. The marketing line was always "you only pay for what you use". The real bills are dominated by what Jay Litkey calls "the layers underneath": data egress charges, inter-region traffic, premium support tiers, the storage class you forgot to set, the snapshots that nobody deletes. James Brooks at HPE called this out: "Secondary and consumption-based elements such as data egress charges are where costs quietly compound." For a small business the absolute numbers are smaller, but the same shape applies. The first cloud bill that doubles unexpectedly is usually the trigger for asking whether the workload could run somewhere cheaper.

Lock-in risk. The Railway shutdown is the case study to internalise. Railway was a paying customer at $2 million a month. They were running production workloads. They were given no warning. The reason was an "automated action". The recovery took eight hours, during which a non-trivial slice of the modern web that depended on Railway was also offline. Mark Duff at Mitel framed the response: "Security, compliance and digital sovereignty are becoming deciding factors in technology decisions." That is regulator-friendly language for "we want to be able to leave when we need to".

AI spend that nobody can forecast. The Uber and Dell numbers are the visible edge of a deeper problem. Per-call costs are nondeterministic. The same agent doing the same task can cost ten cents one day and ten dollars the next, depending on which path the model chose. Most finance teams have no mental model for this, because the rest of their cost base is forecastable. Dell's COO Jeff Clarke pitched on-premises hardware as "a free, unmetered token generator". That is marketing, but the underlying insight is correct: for a workload running constantly, the maths of buying the hardware can beat renting it.

Why an SME does not just decamp

Three reasons it is not as simple as "move off AWS".

The cloud is doing things you do not see. The hyperscaler runs the patching, the redundancy, the geographic failover, the DDoS protection, the certificate rotation, and twenty other operational concerns that nobody on staff has the time to do. The cost of replacing all of that on a small set of in-house servers is, for most SMEs, more than the cloud bill it would replace. The post-cloud move is sensible for the workloads where the firm can do those things itself or pay a specialist to do them; it is a bad move for the workloads where the firm could not.

Concentration risk does not disappear, it relocates. Moving from one hyperscaler to another is not diversification. Moving from a hyperscaler to a specialist UK provider is partial diversification, with new dependencies. Moving to on-premises adds a hardware-failure single point of failure that the cloud had spread across thousands of disks. The right question is not "who do we depend on", but "what is our plan when each of them fails", and answering it properly is the work.

The cloud bill is also the cloud features. Some of what you are paying for is a managed database, a managed queue, a managed identity service, a managed deployment pipeline. Replicating those on-premises is expensive. Replicating them poorly is dangerous. The honest comparison is the workload-by-workload one: this particular service costs us X on AWS and would cost us Y to run elsewhere, with Z in feature loss. The aggregate comparison is rarely the right one.

What an SME should actually do

Three small pieces of work change the position without large structural change.

Tag the bill. Whatever cloud account you have, put cost tags on resources by project, team, or product line. AWS, Azure, and GCP all support this for free; the value comes from the discipline of using them. Once tagged, the monthly bill is comprehensible: this product cost us X, that one cost us Y. Without it, the bill is a single mysterious number and arguments about it are emotional.

Run an exit-test once a year. Pick one significant workload. Write down what it would take to move it to a different provider, or to a private host. Estimate the cost, the time, the data migration risk, and the feature loss. You will not actually move it. The point is to know whether you could, and to find out before a vendor surprise like Railway's makes the question urgent. We wrote about the same shape of problem in Gov.uk Pay swapped Stripe for Adyen, read the exit clause.

Decide which workloads are price-sensitive enough to revisit. Three categories tend to be candidates for the post-cloud rethink. Heavy data egress workloads, where the bill is dominated by moving bytes out. Steady-state compute workloads that run at consistent load, where the cloud "elastic" pricing buys flexibility you do not need. And AI workloads with predictable shapes, where a dedicated GPU box would amortise itself in months. None of these has to leave the cloud. All of them are worth asking the question about.

A fourth, harder action is worth flagging. For any workload that would cost the business real money if it disappeared for a working day, write down what your fallback is. The Railway incident hurt because Railway had no fallback for "the provider switches us off". UniSuper's incident hurt because there was no fallback for "the provider deletes our environment". The honest fallback for most SMEs is "we restore from backup elsewhere", which is fine if "elsewhere" actually exists and has been tested in the last six months. If it has not, the fallback is not real.

The deeper change

The thing under all of this is that the cloud is now in its second decade and the customers are getting more sophisticated. The early arguments were "cloud good, data centre bad". The current arguments are about which workloads suit which model, and how to keep optionality through the choice. The hyperscalers know this, which is why all three are now selling sovereign-cloud editions, on-premises hardware that bridges to their cloud, and connectivity products that make moving workloads back and forth easier. The market is offering more granularity precisely because customers are asking for it.

For an SME the practical horizon is shorter. Tag the bill, run the exit-test, identify the candidates. Most workloads will stay where they are. The few that do not are usually the ones where the unit economics had quietly moved against you, and the cloud bill shock was the prompt to look.

How Steelwise can help

Tagging the cloud bill, running an honest exit-test on the workloads that matter, and writing the short fallback plan for vendor-shutdown scenarios is the kind of practical work we do with clients. Get in touch.

Further reading

← All filings