Aren’t you glad you didn’t fly with British Airways this past week? If you had a flight last year August with Delta, you’d also have been stranded, and ditto if you flew American Airlines somewhere in 2015. Twice that for Southwest in the past five years and considerably more if we start counting other airlines in the US and Europe.
It’s not an airline, but when the new chief information officer arrived at Airports Company South Africa (Acsa) recently, he found the company had “no comprehensive technology strategy”, according to a Brainstorm interview. The airline industry appears to have a serious IT deficit problem.
But back to BA’s incredible failure. Its main data centre went down due to a power problem. The backups didn’t kick in — actually, they may have, but we’ll get to that in a second — and all of BA’s services were unavailable.
Now, airlines operate on hideously complicated technology. Their systems need to coordinate ticket prices, passenger information, baggage details, load balancing data, flight schedules and much more across both internal and external systems, ranging from private companies to government agencies. They all operate on narrow and tightly managed schedules, so one problem can create a cascade effect.
When Delta went down, it was due to a few services being unavailable. It was down for several hours, resulting in an estimated US$150m in losses. With BA, far more went down, resulting in several days of chaos and who knows how much financial damage.
But what happened? Rather, let’s first ponder why it is important we ask that question. BA’s whole data centre should not have gone down and, when it did, it should have had a backup system ready to take over. Something went wrong, though, and the whole house of cards collapsed.
This is important to study, because data centres increasingly run the world. Yes, in theory cloud systems are making redundancy measures a lot simpler and faster. But reality is messier than theory and, as BA demonstrated, a small mistake can be disastrous.
BA’s CEO, Alex Cruz, eventually said it had something to do with the messaging system of their operations. This isn’t how BA staff communicate with each other, but instead the communications between its IT services.
I’m not familiar with what messaging does in data centres, so I asked local cloud stack builder Wingu to explain what might have happened. Messaging handles queries between different and often unrelated services (look up Zaqar as an example). Wingu said it’s an educated guess given the vague information from BA, but it’s likely the airline has a huge mix of old and new systems in operation, thus using messaging services as the glue to keep everything together.
Basically, messaging in this scenario sounds like a way to MacGyver legacy and new systems into cooperation.
Taking that further, comments on The Register cast some more light. When a backup system comes online, it may have to take over and match the transactions the previous system was juggling. But if there is a total miscommunication between what came before and what must happen next, things stop working. So perhaps BA’s backup system did come online, but it failed to get on the same page and consequently didn’t do anything.
Why is this scary? I suspect that many data centre systems have a similar flaw and that the companies that own these have little appreciation of how quickly a relatively small mistake can snowball. Even new systems are not safe from innocuous errors: remember recently when a third of Amazon Web Services went down over a console command?
Airlines operate 24/7, so making wholesale upgrades is very tricky. They are further hampered by thin margins. Banks, which also operate 24/7, at least can throw money at the problem. Most industries can’t. But the AWS incident illustrates that even that may not be enough to stop a small spark from igniting a barrel of dynamite.
When Southwest Airlines’ systems went down, its CEO said it was like a thousand-year flood, a totally unforeseen event. But looking at the data, at least in the airline world, and it seems they are experiencing a lot of floods.
As the world becomes more reliant on data centres, can we expect more such events? Will cybercrime soon be rivalled by the consequences of accidental stack collapses? Might there one day be an economic slump or a deathblow to a major “too-big-to-fail” institution, all because a glitch turned into a downward spiral?
We aren’t discussing this. Instead headlines have been screaming about outsourcing and job cuts, which now appear to have had nothing to do with BA’s crash.
I think we should get ready for a lot more of these “thousand-year floods”.
- James Francis is a freelance writer whose work has appeared in several local and international publications