TEST TEST TEST, Or Things Will Break Down

23/07/2024

IT Outage

THINGS break. That’s a well known and fully understood fact of life.

Even the best-designed systems will fail at some point.

Look at my favourite spacecraft, NASA’s (National Aeronautics and Space Administration) Voyager¹ 1 and 2, launched in 1977. Both of them are out of the heliosphere and into the cosmos — and all the while transmitting scientific data back to us on Earth. It had failures², but good engineering on the part of NASA, recovered them well enough to be useful decades past their use-by date.

Sadly, we don’t see enough of such robust set-ups on Earth.

Not that there aren’t systems that are very reliable and dependable. But, because of a multi-year effort to push certain perspectives and products by some entities, where there could have been robustly engineered systems deployed instead, we have, in many places, a foundation of clay with Swiss cheese-like processes (hereafter referred to as Swiss Cheese OS – SCOS).

In the tech world, this monoculture of insecurely designed systems along with convicted, monopolistic behaviour of one entity has significantly dented many industries and businesses and created opportunities for wasteful work aka anti-virus products.

ALSO READ: Are we too reliant on technology?

Keeping Things Open

There is, having painted the doomsday scenario in the previous paragraphs, a very bright and significant movement that has continued to stand the test of time — open source software led by systems like Linux (Red Hat, SUSE, Debian and many others) that bring highly reliable and consistently high-quality products and services at very reasonable costs with zero vendor lock-in.

And, yes, you can confidently run them for yourself without a subscription with those vendors.

The recent global “IT” outage (what a silly phrase) that happened on Friday 19 July 2024, causing widespread disruptions to an estimated 8.5 million Windows devices, was triggered by exactly ONE file that was meant to patch one of the many holes in the SCOS to provide “end-point” protection.

So much for protection.

Swiss Cheese OS

From the picture above, the slices labelled “People” and “Process” are, obviously, highly prone to failure.

We all know that people make mistakes, fall for phishing attempts, fall for social engineering scams.

Processes do also fail. No process that is complex has all corner cases covered. It is, after all, designed and run by humans (and sometimes assisted by AI to make it even worse when done without thinking).

The Technology slice is the last frontier of defence. We cannot afford to have those systems be so rampantly and consistently, badly designed. And yet they are.

And because those are being acquired by entities without thinking, there is a whole slew of businesses (all proprietary, I must add) that claim to help protect the SCOS — SolarWinds, CrowdStrike etc. — to be “secure”.

ALSO READ: Advice For The Digitally Violated

These entities exist because SCOS exists. Interestingly, SCOS also sells security services. Which begs the question, why would you actually trust SCOS to sell a security solution to protect their own systems?

Why couldn’t they have fixed the problem in the first place? Why not “do it right at the start”?

Just as the likes of Kaspersky and other “anti-virus” companies exist, these entities would be snuffed out if SCOS gets their act together.

Wide Reach

The SCOS maker has a huge footprint in many businesses including, sadly, government (but, thankfully, not defence).

Let’s contrast all of that with the Linux-based ecosystem. EVERY cloud service provider (yes, including SCOS’ Azure) is running Linux as their base OS as well as all of the services that need to be run — containers, SELinux, etc. No competent CIO/CTO would ever allow SCOS in their baseline cloud service if they want it to be secure and reliable (and keep their jobs).

Linux and the thousands of high-quality open source software today powers the global Internet (all your Wi-Fi access points, your network switches, your ISPs, your telcos, your mobile phones, your “smart” TVs, washing machines, coffee machines, electric vehicles, gazillion IoT devices, etc.). Millions of lines of open source code running in millions of open source projects power the planet’s systems.

Playing It Safe

The open source world has to work with the proprietary vendors as one big ecosystem for the customer, but when failures happen regularly in these proprietary software stacks and systems, one has to ask — why do CIOs/CTOs go down that path?

CrowdStrike Microsoft — The blue screen of death popped up on 19 July 2024 during the IT outage.

Perhaps it has it got to do with the euphemism “No one gets fired for buying IBM/Microsoft³.”

One is reminded of 2008, when the London Stock Exchange (LSE) fell for technology with Microsoft.Net and on Day One of trading with the new system, the whole thing collapsed⁴. LSE’s CTO exited, and the system switched to a Linux-based system for all trading requirements.

Could the “IT outage” we saw on 19 July 2024 be prevented with Linux as the baseline system?

YES.

Will the organisation that continues to push SCOS switch over to Linux systems and sunset SCOS? That’s a real possibility. But let’s talk about it, again, after the next outage pinned on SCOS. I can wait.

Will those who provide bandages and patches for SCOS cry foul? Of course. But those entities exist because of continued poor engineering, and even poorer security of SCOS.

The provider of SCOS prioritised profit over security⁵.

That itself should be a wake-up call to any CIO/CTO paying attention.

To be fair, in this classic case of failure of SCOS, the real issue was CrowdStrike’s poor software testing process.

They could have done A/B or Blue-Green testing. It does not seem to be the case. And the only reason we all know of their process failure is because this “IT outage” raised a global embarrassment.

On 19 April 2024, exactly 3 months ago, systems that were being provided with end-point monitoring by CrowdStrike, that runs on Debian and Rocky Linux, also crashed⁶ after an untested update from CrowdStrike was pushed out.

In the current “IT outage”, blame can’t be attributed 100% to the maker of SCOS.

CrowdStrike should carry the main load. They should have tested their changes in a tiny, known, and manageable subset of production systems. Systems that they can recover from, should things go kaput — which would happen every now and then.

Once the tests are done, then do A/B testing⁷and Blue Green Deployment^8.

Learn what broke and fix it. Re-run the tests.

Remember, TEST, TEST, TEST. Even when dealing with robust operating systems like Linux, TEST, TEST, TEST.

Just stop using SCOS, Mr/Ms CIO/CTO. How many more failures do you have to fret over? Your organisation will be thankful you moved to Linux.

This is already 2024 and not 1995. Open source technologies have won the game. We can wait for all of you to join us.

The views expressed in this article are those of the author.

ABOUT THE AUTHOR

Harish Pillay is a seasoned open source technologist and an established leader in technological innovation.

He currently leads the AI Verify Foundation focusing on community building to bring testability and accountability to AI solutions globally. He is concurrently the Chief Open Source Officer of TOOOPLE Pte Ltd.

Harish used to be the Chief Technology Architect at Red Hat Asia Pacific and served as an Adjunct Professor at Nanyang Technological University.

REFERENCES

¹ https://voyager.jpl.nasa.gov/

²https://www.wired.com/story/nasa-repair-voyager-1-spacecraft-data/

³https://www.infoworld.com/article/2322970/no-one-gets-fired-for-buying-ibm.html

⁴https://www.computerworld.com/article/1671758/london-stock-exchange-timeline-of-technical-problems.html

⁵https://www.npr.org/2024/06/13/nx-s1-5003958/whistleblower-tells-propublica-about-microsofts-cybersecurity-lapses

⁶https://news.ycombinator.com/item?id=41005936

⁷https://en.wikipedia.org/wiki/A/B_testing

⁸https://en.wikipedia.org/wiki/Blue–green_deployment