Airlines hit IT turbulence

[dropcap]A[/dropcap]ren’t you glad you didn’t fly with British Airways this past week? If you had a flight last year August with Delta, you’d also have been stranded, and ditto if you flew American Airlines somewhere in 2015. Twice that for Southwest in the past five years and considerably more if we start counting other airlines in the US and Europe.

It’s not an airline, but when the new chief information officer arrived at Airports Company South Africa (Acsa) recently, he found the company had “no comprehensive technology strategy”, according to a Brainstorm interview. The airline industry appears to have a serious IT deficit problem.

But back to BA’s incredible failure. Its main data centre went down due to a power problem. The backups didn’t kick in — actually, they may have, but we’ll get to that in a second — and all of BA’s services were unavailable.

Now, airlines operate on hideously complicated technology. Their systems need to coordinate ticket prices, passenger information, baggage details, load balancing data, flight schedules and much more across both internal and external systems, ranging from private companies to government agencies. They all operate on narrow and tightly managed schedules, so one problem can create a cascade effect.

When Delta went down, it was due to a few services being unavailable. It was down for several hours, resulting in an estimated US$150m in losses. With BA, far more went down, resulting in several days of chaos and who knows how much financial damage.

But what happened? Rather, let’s first ponder why it is important we ask that question. BA’s whole data centre should not have gone down and, when it did, it should have had a backup system ready to take over. Something went wrong, though, and the whole house of cards collapsed.

This is important to study, because data centres increasingly run the world. Yes, in theory cloud systems are making redundancy measures a lot simpler and faster. But reality is messier than theory and, as BA demonstrated, a small mistake can be disastrous.

BA’s CEO, Alex Cruz, eventually said it had something to do with the messaging system of their operations. This isn’t how BA staff communicate with each other, but instead the communications between its IT services.

Messaging

I’m not familiar with what messaging does in data centres, so I asked local cloud stack builder Wingu to explain what might have happened. Messaging handles queries between different and often unrelated services (look up Zaqar as an example). Wingu said it’s an educated guess given the vague information from BA, but it’s likely the airline has a huge mix of old and new systems in operation, thus using messaging services as the glue to keep everything together.

Basically, messaging in this scenario sounds like a way to MacGyver legacy and new systems into cooperation.

Taking that further, comments on The Register cast some more light. When a backup system comes online, it may have to take over and match the transactions the previous system was juggling. But if there is a total miscommunication between what came before and what must happen next, things stop working. So perhaps BA’s backup system did come online, but it failed to get on the same page and consequently didn’t do anything.

Why is this scary? I suspect that many data centre systems have a similar flaw and that the companies that own these have little appreciation of how quickly a relatively small mistake can snowball. Even new systems are not safe from innocuous errors: remember recently when a third of Amazon Web Services went down over a console command?

Airlines operate 24/7, so making wholesale upgrades is very tricky. They are further hampered by thin margins. Banks, which also operate 24/7, at least can throw money at the problem. Most industries can’t. But the AWS incident illustrates that even that may not be enough to stop a small spark from igniting a barrel of dynamite.

When Southwest Airlines’ systems went down, its CEO said it was like a thousand-year flood, a totally unforeseen event. But looking at the data, at least in the airline world, and it seems they are experiencing a lot of floods.

As the world becomes more reliant on data centres, can we expect more such events? Will cybercrime soon be rivalled by the consequences of accidental stack collapses? Might there one day be an economic slump or a deathblow to a major “too-big-to-fail” institution, all because a glitch turned into a downward spiral?

We aren’t discussing this. Instead headlines have been screaming about outsourcing and job cuts, which now appear to have had nothing to do with BA’s crash.

I think we should get ready for a lot more of these “thousand-year floods”.

James Francis is a freelance writer whose work has appeared in several local and international publications

Winter load shedding will be much reduced: Eskom

Political parties share their views on renewable energy

What should happen to the Post Office?

Microsoft’s big AI bet is paying off, results show

Google parent Alphabet soars on first-ever dividend

Huawei Pura 70 series sports latest made-in-China chip

Intel is not out of the woods yet

Shareholders scarper as Meta ups AI investment

Microsoft’s AI lead puts AWS cloud dominance on watch

US prosecutors want Binance founder jailed for 36 months

The most iconic Nokia phones ever made

Meta goes all-in on AI

Transformers to diffusion models: AI jargon explained

How AI is revolutionising computer programming

Microsoft under fire over ‘shambolic’ security practices

TCS Legends | An interview with David Frankel

TCS | Meet the CHIPendales – South Africa’s biohacker duo

TCS+ | What MTN has to offer government clients

TCS Legends | South African internet pioneer Mike Lawrie

TCS | From Namibian start-up to regional powerhouse: the rapid rise of Paratus

How industrial IoT could help fight rampant electricity theft in South Africa

South African outage highlights high-stakes cloud risks

Government’s new energy plan is out of touch with reality

Unpacking CompCom’s new ‘public interest’ merger rules

Stop criminalising TV licence non-payment

Airlines hit IT turbulence

Messaging

The most iconic Nokia phones ever made

Meta goes all-in on AI

Transformers to diffusion models: AI jargon explained

HP ScanJet series – scanning with superb results

Zoom Fibre community projects reshaping South Africa’s digital landscape

Red Hat, LSD Open announce significant new partnership

How industrial IoT could help fight rampant electricity theft in South Africa

South African outage highlights high-stakes cloud risks

Government’s new energy plan is out of touch with reality

Subscribe to the newsletter

Airlines hit IT turbulence

Messaging

Related Posts