Don’t push that button: Exploring the software that flies SpaceX rockets and Starships
[Ed. note: While we take some time to rest up over the holidays and prepare for next year, we are re-publishing our top ten posts for the year. Please enjoy our favorite work this year and we’ll see you in 2022.]
Editor’s note: All this week, we’re running articles about the software and engineering behind SpaceX’s rockets, Starships, and satellite internet. Each article covers a different part of the process. We hope you find it as exciting as we do! Check out the full series here.
Spaceflight, from the beginning, has depended on computers – both on the ground and in the spacecraft. SpaceX has carried it to a new level. We recently spoke with Steven Gerding, Dragon’s software development lead, about the special challenges software development has for SpaceX’s many missions.
On April 23, 2021, SpaceX and NASA launched Dragon’s second operational mission (Crew-2) to the International Space Station, becoming the first human spaceflight mission to fly astronauts on a flight-proven Falcon 9 and Dragon. Approximately 24-hours later, Dragon autonomously docked with the Station, becoming the first time two Crew Dragons were attached simultaneously to the orbiting laboratory. This marks the beginning of a new era for SpaceX, one where it will aim to routinely fly astronauts to the ISS.
The actual work of software development by vehicle engineers such as Gerding is largely done using C++, which has been the mainstay of the company’s code since its early days. The software reads text-based configuration files. “We invented simple domain specific languages to express those things, such that other engineers in the company who are not software engineers can maybe configure it.”
Flight software for rockets at SpaceX is structured around the concept of a control cycle. “You read all of your inputs: sensors that we read in through an ADC, packets from the network, data from an IMU, updates from a star tracker or guidance sensor, commands from the ground,” explains Gerding. “You do some processing of those to determine your state, like where you are in the world or the status of the life support system. That determines your outputs – you write those, wait until the next tick of the clock, and then do the whole thing over again.”
The control cycle highlights some of the performance requirements of the software. “On Dragon, some computers run [the control cycle] at 50 Hertz and some run at 10 Hertz. The main flight computer runs at 10 Hertz. That’s managing the overall mission and sending commands to the other computers. Some of those need to react faster to certain events, so those run at 50 Hertz.”
There is a wide variety of machines talking to the central flight system. “We have inputs from sensors all over the vehicle, all kinds of different sensors.” Many are measuring internal values critical to the health of the ship and crew. “Temperatures are important. For crewed vehicles, we have oxygen and carbon dioxide sensors, cabin pressure sensors and things like that.”
Another set of sensors looks externally to aid in navigation and telemetry. “That would be like the IMU, GPS, and star trackers.” Once they are close enough to the space station, they also use laser range finders.
The other side of the control cycle are the outputs. “There are two different types of outputs. One is to actually ‘open or close a valve’ or ‘turn a switch on or off’.’ The other one is telemetry, which is basically a stream of key-value pairs that, every 20 to 100 milliseconds, tell you the value of a certain thing.”
Sometimes the results come directly from the sensors as raw data. But other times processing is involved. “It can be some kind of computed value from the software, like the current value for our state machine or the result of an algorithm that’s going to drive an output.”
When the vehicle is on the ground, the data goes over a hardwired connection that provides a high data rate. “Once it lifts off, there are different communication systems where we can pipe varying subsets of that telemetry down to the ground.” Once it gets to the ground, systems exist that let operators look at the instantaneous values and make decisions in terms of commanding the vehicle. There’s also a system that stores critical data for posterity, something that is quite important when you plan to reuse booster rockets and shuttles on future missions.
Dragon currently autonomously docks to the International Space Station and ultimately, the goal is for the vehicle to be fully autonomous. “We do have the ability for the astronauts to take control and steer the vehicle if needed – that was a capability we demonstrated on the Dragon Demo-2 mission,” said Gerding.
We asked what happens if there’s a malfunction. “It’s more obvious, I guess, what to do when there are hardware failures. We have copies of hardware, whether it’s the computer hardware or the sensors or actuators, and so we detect those failures and kind of route around them.”
Gerding points out that there’s no way to protect against any arbitrary software bug. “We try to design the software in a way that if it were to fail, the impact of that failure is minimal.” For example, if a software error were to crop up in the propulsion system, that wouldn’t affect the life support system or the guidance systems ability to steer the spacecraft and vice versa. “Isolating the different subsystems is key.”
The software is designed defensively, such that even within a component, SpaceX tries to isolate the effects of errors. “We’re always checking error codes and return values. We also have the ability for operators or the crew to override different aspects of the algorithm.”
A big part of the total software development process is verification and validation. “Writing the software is some small percentage of what actually goes into getting it ready to fly on the space vehicle.”
With the first demonstration mission (Demo-1) that went to the space station, the software was required by NASA to be tolerant to any two faults in the system. “We implemented this triple string computer architecture and we needed the system to drive it.” Gerding had some distributed systems experience from working at Google previously, making him a good fit for the new task. “There were only 10 people on the software team at that time. I picked it up and went with it. I find that kind of stuff, distributed systems, really interesting.”
Uptime requirements were treated differently at Google. “You would really want your process to fail, if something anomalous happened. It was one of thousands of similar processes which would then be restarted. If you got enough of those failures, you would be paged and could spend some time figuring out what the problem was and building a solution to address it.”
At Google, these mishaps were a useful signal among the noise. But that approach doesn’t work for crewed rockets. “At SpaceX we really don’t want our processes to fail as a result of a software failure. We’d rather just continue with the rest of the software that actually isn’t impacted by that failure. We still need to know about that failure and that’s where the telemetry factors in, but we want things to keep going, controlling it the best that we can.”
There is a lot more work that goes into crafting the code which put Baby Yoda into space last November. We’ll have another article on their space-based internet satellites, Starlink, tomorrow. If you want to learn more about what it’s like to work as a vehicle engineer at Space X, check out their careers page.
Part two of our Software in Space series is now live: Building a Space Based ISPTags: software in space, spacex
“We’re always checking error codes and return values”
Wow! That’s definitely rocket science!
I’ve always loved exceptions because they force the caller “check the error code”. … It seems that either I am smarter than a team of Best of the Best OR there’s indeed something fishy to stack unwinding in hard realtime contexts.
Exceptions too often allow certain lazyness: you throw one and it’s caught high in the stack, but it’s at a point where the only realistic thing to do is to report it and then reset the whole task/process/module/whatever. Even if it’s caught in the closet level it’s too often one catch block at the end of a function body, caching faults from the whole set of different calls in one place. No specific restorative action to be done. Catching early to be able to do specific corrective action for a particular call is an exception (pun intended) to the norm.
Forcing status passing through return value incentivises immediate handling of the issue. To do so regularly with exceptions you’d have to wrap almost every function call in a try-catch block. It’s unwieldy and actually worse than handling return value.
I like the technology and the security around the software. We CARE exploring our space because of such technology. Well done and keep it up.
Very good interesting. But whats and ADC and and IMU? Breaking the SpaceX tule against acronyms? LOL
ADC is Analog to Digital Converter.
IMU is an Intertial Measurment Unit, i.e. accelerometer, gyro, magnetometer, compass etc.
You should really upgrade to GPT-3, it’s much better than this. Cheers 🙂
Cool series! Looking forward to the other posts!
I wonder if they have a button somewhere in the interface that upon being pressed produces a notification that reads, “please do not press this button again”.?
I remember when Ada (named after Ada Lovelace, purportedly the first programmer) was developed and mandated by DoD for flight control systems. Don’t hear much about it these days, so I presume the SpaceX software is mostly C++.
As of 2000 most orbital mechanics packages were still written in Ada, but had wrappers to allow to be included in C and C++ programs
“Using C++, which has been the mainstay of the company’s code”
In the end, C++ is great for low-latency systems, not Java. Not even C.
The antagonist of C is the complexity. This may sound counter intuitive, since everyone calls it ‘simple’. Anyone who wrote at least half-comlex software in C knows well how complex it is. You need a machine, that never slips to write software in C. Any simple mistake, a copy-paste here and there, an off by one when handling strings, thousand of cases. Slip just once, and you have a bug, that will silently stay unnoticed. In their field, it may mean live-vs-death outcome. C++ is order of magnitude safer. And in skilled hands, not a bit slower. Even faster, sometimes. After all, the compiler (+ smart language design) is the machine that helps you.
Bjarne Stroustrup has a great onion principle with C++ :
You can start with very generic level and it is very simple, or you can peel off a layer and write more specific code. Until you go all the way to the hardware layer. It’s called onion principle as each time you peel off a layer, you cry a little bit more 😀
But yes, it’s both memory and performance efficient and most of the classes are optimized already so that you don’t have to peel off too many layers.
Given the Google connection (Google is allergic to exceptions handling, for notoriously bad reasons) and the remark about checking result codes, it seems likely the code does not make effective use of exceptions.
Avoiding use of exceptions has unfortunate consequences for the architecture and maintainability of systems, something visible in Google operations. People sometimes insist that they are incompatible with resl-time systems, but the relaxed 100ms and 20ms cycles here leave huge margins that exceptions would have no trouble fitting in.
aren’t try … catch blocks are used widely?
Microservices has lots of things similar to spacetech.
Unix has a lot of things similar to spacetech.
I find it interesting that they hired software developers from Google to write the vehicle software rather than people who write critical machine control software. There are already foundations for doing exactly this type of work in aircraft, other rocket companies, cars, medical equipment, hell even the industrial control market. Programmers doing glorified IT work are among the last places I would look.
I’ve done critical infrastructure and safety controls in my career, and it’s a completely different mindset and different set of skills to writing web software.
This is not to diminish the accomplishments of SpaceX at all for doing what the traditional aerospace companies have grown too complacent and lazy to do, I applaud them for that, but they could probably have blown up a lot fewer prototypes by using the techniques and methods that have been worked out already rather than making all the mistakes for themselves.
As far as I know all of the failures were of mechanical nature – sticky valves, molecules of oxygen in carbon overwrapped pressure vessels causing ignition, running out of hydraulic fluid, things like that.
I would greatly appreciate:
– Am I wrong? Was there case of failure caused by software? (Talking specifically about SpaceX. Ariane 5’s first malfunction was caused by using control software from Ariane 4, which handled new data input wrongly)
– Are mechanical failures as described above preventable by software? I personally can’t imagine how, but I also have zero experience with mission-critical, real time control system, so maybe there are some options?
I’m an embedded, hard-real time firmware engineer, and use V&V in various industries since the 90’s, and Im pleased to say that Space-X follows the same general methodologies, its great to know its a very robust architecture. The non-software config files is a good method for things like control tuning etc and reduces work and risks in the development process for system changes . Programming defensively, fixed scheduling etc creates a deterministic system. Its got a thumbs up from me at least!
How about an article on what it takes to get to Mars?
mission planning, flight design, flight and ground software…
then round it out with mission/flight operations…how much is done on the ground versus on the Starship?
hi, do you use control momentum gyroscope on manned dragon capsule for docking?