How faulty software has left society on the edge of disaster

 (iStock/The Independent)

(iStock/The Independent)

When a routine Air Canada flight came in to land at San Francisco on a July evening in 2017, it missed a lethal disaster by just 13 feet.

Unaware that one of the airport’s runways was closed, the pilot – exhausted after 19 hours awake – had attempted to land on a taxiway on which four other planes were lined up awaiting liftoff.

Had he not noticed his error and pulled up in time, the crash could have been more deadly than the 1977 Tenerife Airport disaster, in which two Boeing 747s collided on a foggy runway, killing 583 people.

Technically, the pilot had been warned about the closure of the runway. But the warning had been buried on page eight of a 27-page briefing written in all capital letters in a bizarre, Byzantine code via the US government’s Notice To Air Missions (NOTAM) system, a widely resented institution with some components 30 years old.

This incident illustrates just how much the developed world depends on fragile, outdated or just plain janky software systems that run our critical infrastructure behind the scenes – from air travel to water treatment plans to postal services.

The dangers of that situation were underlined last week when NOTAM spectacularly imploded, depriving pilots of vital information about potential hazards along their routes and consequently cancelling or delaying nearly 12,000 flights.

Nor was it the first such glitch. Over Christmas, the US air carrier Southwest Airlines suffered a multi-day meltdown that union leaders blamed partly on unwieldy software that collapsed under the weight of historic winter storms.

Outdated or inadequate software has also been blamed for an attempted cyber-attack on a water treatment plant in Florida in 2021, and road deaths due to defective electronic throttle systems in Toyota cars.

“These aren’t just teaching moments, these are entire university curriculum moments that need to be studied, examined and addressed,” says Henry Harteveldt, president of the Atmosphere Research Group consultancy, referring to the problems with NOTAM and Southwest.

“This does expose some of the vulnerabilities, and of course the biggest vulnerability everybody fears is a cyber-attack. What really concerns me is: do any of these events illustrate weaknesses within the systems that could be used to cause an absolutely catastrophic, almost doomsday-like scenario?”

’Devastating effects on human life’

In 2011, the prolific Silicon Valley venture capital firm Andreessen Horowiz declared that “software is eating the world”. A decade and change later, the world has been thoroughly digested.

Now, as then, every major infrastructure service depends on software, and most depend on a complex network of interlocking systems, any one of which can go wrong.

The tech industry has a plethora of piquant terms for the problems that can afflict large coding projects: “spaghetti code”, “software rot”, or “dependency hell”, to name just a few.

One of the most dangerous for large institutions is “technical debt”, meaning the cost paid tomorrow for coding decisions taken yesterday. Organisations with heavy technical debt can be trapped in reliance upon ageing software that is no longer fit for its purpose, yet without the resources – or, more commonly, the determination by senior managers – to fix it.

“Technical debt often goes hidden, but there is no doubt it is having an impact on the reliability and quality of critical national infrastructure,” says Junade Ali, a British computer scientist and expert on technical debt who has worked on the UK’s road signalling network and Google and Apple’s Covid exposure notification system.

“Unmanaged technical debt can have devastating effects on human life, from miscarriages of justice to death. [It] also reduces the agility of a business by slowing its ability to test new features in the real world, get user feedback and iterate rapidly…

“As software is becoming increasingly complex and more of the world is dependent on software, the challenge is ever-growing.”

Consider, for instance, a computer accounting system called Horizon built for the British Post Office at the cost of around £700m in taxpayer money. Between 1991 and 2015, 918 employees were successfully prosecuted for supposed financial discrepancies recorded by the system, in some cases reportedly leading to bankruptcy, divorce and even suicide.

As early as the year 2000, however, there were allegations that Horizon was riddled with errors. A series of external reviews and court judgements backed that up, and today many of the prosecutions have been overturned or their targets paid compensation.

Another case of allegedly deadly code involved Toyota, which was forced to recall cars and settle a string of lawsuits after claims that its throttle software caused sudden and unintended acceleration that may have led to as many as 89 deaths and 57 injuries. In 2013, a jury in Oklahoma found it had shown “reckless disregard” for public safety, although Toyota settled that too without admitting responsibility.

Michael Barr, a software testing expert who undertook a confidential review of one Toyota throttle system, testified that it had multiple problems that could have caused a 2007 crash. In a later presentation, he said Toyota software had suffered from “spaghetti code” (which means just what it sounds like) and a lack of proper safety systems that could detect and prevent errors as they occurred.

Other cases cited by Barr include a computer glitch in a US Army Patriot missile launcher during the first Gulf War in 1991 that caused it to ignore an incoming Iraqi missile – leading to 28 deaths and 100 injuries – and errors that caused a radiotherapy machine in the 1980s to give out lethal overdoses of radiation.

That is not to mention the numerous occasions on which an inappropriate reliance on Microsoft Excel spreadsheets has caused crucial systems to break down, including at the bank JP Morgan Chase and at Britain’s public health agency during the Covid-19 pandemic.

Accidental errors can also make these systems vulnerable to cyber-attacks, especially in an age when state-sponsored professional hacking groups prowl the internet while winning attack strategies are bought and sold on the dark web.

“Much of the critical infrastructure that we rely on today was established long before the appropriate software – or even the concept of cybersecurity – came along,” says John Fokker, head of intelligence at the cybersecurity firm Trellix.

“Often based on legacy operating systems that were set up decades ago, these organisations are using software that is rarely updated – if at all. A successful attack could have a potentially devastating impact.”

How US air travel imploded this Christmas

It’s not clear exactly what the root cause of last week’s NOTAM outage was. The Federal Aviation Administration (FAA), which maintains the system, has said there is “no evidence” of a cyber-attack, instead blaming an engineer contractor who allegedly damaged a key data file by failing to follow procedures.

If so, the FAA has questions to answer about how one mistake was enough to disrupt the whole system to the extent that officials were forced to reboot everything to get it up and running.

What we do know is that the system has long been criticised for its obtuseness and fragility. The current iteration is a patchwork of older and newer software layers that must interact with each other, and prior to the outage it was not due to be upgraded for at least six years.

In fact, according to OpsGroup, a pugnacious grassroots association of air industry professionals, NOTAM still uses a text encoding format that dates back to 1924, designed for telegraph machines and incapable of displaying lowercase letters.

That is part of the reason why NOTAM messages are written in a nigh-on inscrutable sigils such as: “A0290/21 NOTAMN. Q) VHHK/QNMAU/IV/NBO/AE/000/999/2219N11355E005. A) VHHH. B) 2105252130 C) 2105252329. E) SIU MO TO DVOR/DME ‘SMT’ 114.80 MHZ/CH95X NOT AVBL DUE MAINT.”

Worse, everyone from Ops Group to the then head of the US national Transportation Safety Board (NTSB) agrees that NOTAM – in theory reserved for essential updates about genuine hazards – is utterly clogged with superfluous or irrelevant notices, making it easy to miss actually important information.

The International Civil Aviation Organisation (ICAO), which is attempting to reform NOTAM, has said that around 20 per cent of the active notices are older than 90 days. In Albania, there is reportedly an active NOTAM from the year 2000 offering advice to pilots about the Millennium Bug.

OpsGroup has also documented examples of dueling NOTAMs issued by the governments of Turkey and Greece, disputing each other’s right to issue NOTAMs relating to territory rights claimed by both nations.

One former airline pilot has even claimed that on the day that Malaysia Airlines Flight MH17 was shot down over Ukraine, killing 298 people, there was a cryptic but critical NOTAM issued for that area, which might have averted disaster if it had been clearer.

No wonder NTSB head Robert Sumwalt said in a 2018 hearing that NOTAMS “are just a bunch of garbage that nobody pays any attention to”. This year’s NOTAM failure is now being investigated by Congress.

Other air travel technologies have also suffered outages in the last few years. An air traffic control system called ERAM has failed seven times since 2014, most recently on 2 January this year. In 2021, a private sector reservation system called SABRE suffered an outage too.

Then there is the Southwest meltdown, which the Southwest Airlines Pilots Association (SWAPA) has blamed partly on a custom-built automated crew scheduling system called SkySolver.

Southwest’s “point to point” flight network depends on a complex dance of planes and staff moving from city to city, being in the right place at the right time for their next assignment. When one flight is delayed or cancelled, SkySolver reportedly finds a way to resolve the problem and reassigns planes and staff as needed.

But SWAPA says that it can only handle up to 200 to 300 scheduling changes at a time, meaning it was completely overwhelmed when freezing weather blanketed much of the US, driving the number of individual pilot reassignments as high as 600 per hour.

Amid the chaos, SWAPA claims, SkySolver repeatedly created solutions that simply did not work in practice, and did not take into account the quickly evolving situation. The group says only 15 per cent of SkySolver solutions between 20 December and 29 December were actually flown, with 85 per cent made obsolete before they could fly.

Staff were left stranded in hotels while they waited for new assignments, invisible to the software system and unable to get through to the human schedulers at a call centre who were manually trying to fix the mess. “We have crews stuck, and scheduling doesn’t know where they are,” SWAPA head Casey Murray told The Wall Street Journal.

SWAPA also says this led planes and crew being flown from city to city purely to put them in the right position, even though there were actually enough staff available to legally take passengers. Because the scheduling system did not know where these employees were, and they could not reach the scheduling team, they could not be reassigned, and the planes flew empty.

To add insult to injury, the group’s data shows more than 500 incidents where these “position ferries” were flown on the same routes where passenger flights were cancelled.

In response to questions from The Independent, a spokesperson for Southwest said that it has been spending roughly $1bn on IT upgrades and maintenance each year. He said the company replaced its reservations system in 2017, its technical operations record system in 2021, and its “human capital management system” in 2022.

Although Southwest chief executive Bob Jordan has apologised and accepted responsibility for the incident, he downplayed the role of software, telling The New York Times: “There’s been confusion over ‘well, your technology failed.’ The technology did not fail; it worked as designed. Our processes worked as designed; they just were all hit by overwhelming volume.”

He added that in 2022, eight new versions of SkySolver were released.

Hackers are probing America’s water treatment system

In February 2021, a water treatment plant worker in Florida noticed his mouse cursor dancing around the screen on its own.

Before his eyes, the cursor opened up various programmes that controlled the water treatment process and boosted the level of sodium hydroxide – a poisonous substance commonly known as lye, which is used in drain cleaner and in small amounts to remove metals from drinking water – to 100 times its normal level.

The sabotage was swiftly reversed, and the plant had physical safety systems that would have stopped lye-rich water from being piped into anyone’s home. Yet the incident illustrated how America’s roughly 50,000 community water systems, often run by local governments and without their own dedicated cybersecurity staff, could be tempting targets for hackers.

This was far from the first or the only incident. Between 2019 and 2021, cyber-attacks struck water and wastewater institutions in California, Maine, Nevada, New Jersey, Kansas and beyond, according to the US Cybersecurity and Infrastructure Security Agency (CISA).

Another study found 25 incidents reported by US water utilities in 2015 alone, noting that there may be others never reported.

CISA also warned that water treatment plants “commonly use outdated control system devices or firmware versions, which expose [them] to publicly accessible and remotely executable vulnerabilities.”

Outdated software was certainly to blame in the Florida case, where investigators found multiple off-site computers running old versions of Microsoft Windows, sharing a single password to access a remote access programme that had been replaced about six months beforehand but never actually removed.

As CISA’s then head Chris Krebs wrote, “Unfortunately, that water treatment facility is the rule rather than the exception.”

Trellix, the cybersecurity firm, says its research has found that many critical infrastructure institutions are “extremely vulnerable to attack” because they do not follow cybersecurity best practices such as keeping software up to date.

“Given the FAA outage last week, it is clear that outdated security systems and siloed legacy architectures are no longer fit for purpose,” says John Fokker.

“A successful attack could have a potentially devastating impact. It could halt operations which could have a far-reaching and widescale effect – not only on the organisation itself, but staff members, customers, and even on society as a whole.”

Why do software glitches go unfixed?

In many of these cases, there were ample warnings. OpsGroup and ICAO have been lobbying to fix NOTAM for years, while the FAA has long been working to modernise the system.

Meanwhile, SWAPA has referred to SkySolver as “a house of cards”, claiming that Southwest has ignored its entreaties about “numerous and ever-increasing meltdowns”.

So why do technical debt and other software hang-ups persist?

For water systems, the problem is simple: thousands of small institutions run by often under-funded local governments, usually sharing their IT staff with other departments.

“When an organisation is struggling to make payroll and to keep systems on a generation of technology created in the last decade, even the basics in cybersecurity often are out of reach,” wrote Krebs in 2021.

CISA also noted: “[Water] facilities are inconsistently resourced municipal systems – not all of which have the resources to employ consistently high cybersecurity standards… [they] tend to allocate resources to physical infrastructure in need of replacement or repair (eg pipes) rather than It infrastructure.”

Harteveldt says there is a similar problem in the aviation industry, which suffers from an unusual combination of heavy tech dependence and poor tech investment. “When you talk to an airline CEO about an investment, they will tell you they’d rather buy an airplane, because they know that’s what makes them money – rather than take half the money an airplane would cost and invest it in IT, [which[] could take years to start showing a return.”

Ironically, he argues, the industry launched essentially the first e-commerce business back in the 1960s when it created a computerised nationwide booking system. Today, however, its high costs and low profit margins mean companies tend to invest around one to two percentage points less of their annual revenue in IT than other sectors.

For the FAA and other state agencies, there are the traditional problems of taxpayer funding: budgets being used as a political football, bureaucratic inertia and, in the US, a persistent legislative gridlock that has left the FAA still run by a temporary acting administrator, with no permanent chief confirmed by Congress.

“If you are a hotshot IT professional and you want to work in an environment where you have state of the art technology, and leadership that appreciates the importance of technology, you probably are not going to seek out a career either at the FAA or at an airline,” says Harteveldt.

Junade Ali’s research has also found basic problems shared across various sectors. “I spent much of the early part of my career successfully dealing with egregious levels of technical debt. I’m afraid the road to addressing it requires persistence and the success rate for most is low,” he says.

Building genuinely resilient software – let alone unwinding past technical debt – is often slow, complex and expensive work, requiring serious investment, commitment from leadership and specialised practices such as building automated tests to catch errors and monitor software while it is running.

“Estimates vary, but most research converges on the statistic that only one-third of digital transformation efforts ultimately end up being successful,” he concludes.

For the air industry, Harteveldt is optimistic, saying: “I think the Southwest event, along frankly with the NOTAM event, will be a catalyst for a lot of airlines, not just in the US but around the world… if the outcome of this is a recognition by airline leadership that they need to do a much better job of investing in technology than they have, then ultimately these catastrophes will not have occurred in vain.”

If not, Harteveldt fears the consequences. “The FAA collapse is alarming because it illustrates the fragility of the FAA’s systems, and air travel in the US is mission critical to how our country functions,” he says.

“Imagine if an air traffic controller was giving incorrect information because someone had hacked the tower system, giving approval to one aircraft to do one thing and another aircraft to do something else, and it resulted in a collision… I think this is the thing that everybody who works with aviation technology fears most.”