One of the biggest drivers for this redundancy is the harsh radiation environment of space, where high-energy particles can affect avionics and create āwrong answersā that must be filtered out of the flight solution.
Source: How NASA Built Artemis IIās Fault-Tolerant Computer
This lede doesnāt do justice to the incredible engineering behind Artemis IIās flight control computer - one of the worldās most redundant computer system.
To ensure those wrong answers never reach the spacecraftās thrusters, NASA moved beyond the triple redundancy of traditional systems. Orion utilizes two Vehicle Management Computers, each containing two Flight Control Modules, for a total of four FCMs. But the redundancy goes even deeper: each FCM consists of a self-checking pair of processors.
Effectively, eight CPUs run the flight software in parallel. The engineering philosophy hinges on a āfail-silentā design. The self-checking pairs ensure that if a CPU performs an erroneous calculation due to a radiation event, the error is detected immediately and the system responds. ⦠This level of redundancy is specifically scaled for the rigors of deep space. NASA anticipates transient failures during the Artemis II missionās transit through the high-radiation Van Allen Belts.
A solid engineering solution succinctly explained is a work of art. I am intrigued by the other engineering challenges faced by the team. They had to have other constraints: cpu cycles, latency, cycle time, power etc in addition to larger constraints like total weight, total power etc impacting the Orion capsule itself. Hereās the simplest challenge to imagine. If you want to add some additional loop check that will add a software cycle time of 1s (with its cpu and power constraints), first off - you have to determine if a) you can take the hit of it running in 8 cpus - in parallel. Then you potentially have other solutions on the table: Is there a hardware relay that might be able to achieve the same thing from the Apollo era? How do you manage those trade offs?
āModern Agile and DevOps approaches prioritize iteration, which can challenge architectural discipline,ā Riley explained. āAs a result, technical debt accumulates, and maintainability and system resiliency suffer.ā
Yes, the trade off in modern software development is: keep punting the problem until it becomes the biggest problem - with the belief that we can put an army / special-ops to solve it at the time. There is no army / special-ops in space. This is a very greedy (while not intended to be about money, there is some commercial impact too) approach in software.
While the four-FCM primary system is robust, NASA must still account for common mode failuresāsoftware bugs or catastrophic events that could theoretically impact all primary channels simultaneously.
To mitigate this, Orion carries a completely independent Backup Flight Software (BFS) system. This is a prime example of dissimilar redundancy. It is implemented on different hardware, runs a different operating system, and utilizes independently developed, simplified flight software. ⦠==Even in a total power loss scenarioācalled a ādead busāāOrion is designed to survive. If power is restored, the spacecraft enters a safe mode, in which the vehicle first stabilizes itself and then points its solar arrays at the Sun to recover power. Then, it orients its tail toward the Sun for thermal stability before attempting to re-establish communication with Earth. During such a failure, the crew can also take manual action to configure life support systems or don space suits.==
As someone thatās made a career out of prioritizing things, I want to take a moment and say, thatās one of the best descriptions of feature prioritization that Iāve seen. While I can admit that it can be aspirational, this is the kind of clarity that a good PM / problem definition can bring about.
This is also the kind of thinking that I am finding modern software development with LLMs thriving in. My own workflow goes like this:
brainstorm -> design -> review -> update design -> spec -> plan -> execute chunks -> review chunks -> review against plan and design and spec -> adopt changes and then deploy
I am talking about āsimpleā software that I write compared to truly complex architectures in large code bases. However, like in most technology development, that core loop is the key to developing maintainable, sustainable architecture that have all the trade offs also well defined, decisions well documented so that the context building necessary to improve that system is available to an LLM rather than them have to guess it.
Because remember itās in the generation that you start trusting a model and where your own judgement with your experience likely will make a better impact.