Don’t push that button: Exploring the software that flies SpaceX rockets and Starships

[Ed. note: While we take some time to rest up over the holidays and prepare for next year, we are re-publishing our top ten posts for the year. Please enjoy our favorite work this year and we’ll see you in 2022.]

Editor's note: All this week, we're running articles about the software and engineering behind SpaceX's rockets, Starships, and satellite internet. Each article covers a different part of the process. We hope you find it as exciting as we do! Check out the full series here.

Spaceflight, from the beginning, has depended on computers – both on the ground and in the spacecraft. SpaceX has carried it to a new level. We recently spoke with Steven Gerding, Dragon’s software development lead, about the special challenges software development has for SpaceX's many missions.

On April 23, 2021, SpaceX and NASA launched Dragon’s second operational mission (Crew-2) to the International Space Station, becoming the first human spaceflight mission to fly astronauts on a flight-proven Falcon 9 and Dragon. Approximately 24-hours later, Dragon autonomously docked with the Station, becoming the first time two Crew Dragons were attached simultaneously to the orbiting laboratory. This marks the beginning of a new era for SpaceX, one where it will aim to routinely fly astronauts to the ISS.

The actual work of software development by vehicle engineers such as Gerding is largely done using C++, which has been the mainstay of the company’s code since its early days. The software reads text-based configuration files. "We invented simple domain specific languages to express those things, such that other engineers in the company who are not software engineers can maybe configure it."

Flight software for rockets at SpaceX is structured around the concept of a control cycle. “You read all of your inputs: sensors that we read in through an ADC, packets from the network, data from an IMU, updates from a star tracker or guidance sensor, commands from the ground,” explains Gerding. “You do some processing of those to determine your state, like where you are in the world or the status of the life support system. That determines your outputs – you write those, wait until the next tick of the clock, and then do the whole thing over again."

The control cycle highlights some of the performance requirements of the software. "On Dragon, some computers run [the control cycle] at 50 Hertz and some run at 10 Hertz. The main flight computer runs at 10 Hertz. That's managing the overall mission and sending commands to the other computers. Some of those need to react faster to certain events, so those run at 50 Hertz."

There is a wide variety of machines talking to the central flight system. "We have inputs from sensors all over the vehicle, all kinds of different sensors.” Many are measuring internal values critical to the health of the ship and crew. “Temperatures are important. For crewed vehicles, we have oxygen and carbon dioxide sensors, cabin pressure sensors and things like that."

Another set of sensors looks externally to aid in navigation and telemetry. "That would be like the IMU, GPS, and star trackers." Once they are close enough to the space station, they also use laser range finders.

The other side of the control cycle are the outputs. "There are two different types of outputs. One is to actually ‘open or close a valve’ or ‘turn a switch on or off’.’ The other one is telemetry, which is basically a stream of key-value pairs that, every 20 to 100 milliseconds, tell you the value of a certain thing."

Sometimes the results come directly from the sensors as raw data. But other times processing is involved. "It can be some kind of computed value from the software, like the current value for our state machine or the result of an algorithm that's going to drive an output."

When the vehicle is on the ground, the data goes over a hardwired connection that provides a high data rate. “Once it lifts off, there are different communication systems where we can pipe varying subsets of that telemetry down to the ground.” Once it gets to the ground, systems exist that let operators look at the instantaneous values and make decisions in terms of commanding the vehicle. There's also a system that stores critical data for posterity, something that is quite important when you plan to reuse booster rockets and shuttles on future missions.

Dragon currently autonomously docks to the International Space Station and ultimately, the goal is for the vehicle to be fully autonomous. “We do have the ability for the astronauts to take control and steer the vehicle if needed – that was a capability we demonstrated on the Dragon Demo-2 mission,” said Gerding.

We asked what happens if there's a malfunction. "It's more obvious, I guess, what to do when there are hardware failures. We have copies of hardware, whether it's the computer hardware or the sensors or actuators, and so we detect those failures and kind of route around them.”

Gerding points out that there's no way to protect against any arbitrary software bug. “We try to design the software in a way that if it were to fail, the impact of that failure is minimal.” For example, if a software error were to crop up in the propulsion system, that wouldn't affect the life support system or the guidance systems ability to steer the spacecraft and vice versa. “Isolating the different subsystems is key.”

The software is designed defensively, such that even within a component, SpaceX tries to isolate the effects of errors. “We're always checking error codes and return values. We also have the ability for operators or the crew to override different aspects of the algorithm."

A big part of the total software development process is verification and validation. "Writing the software is some small percentage of what actually goes into getting it ready to fly on the space vehicle."

With the first demonstration mission (Demo-1) that went to the space station, the software was required by NASA to be tolerant to any two faults in the system. “We implemented this triple string computer architecture and we needed the system to drive it.” Gerding had some distributed systems experience from working at Google previously, making him a good fit for the new task. “There were only 10 people on the software team at that time. I picked it up and went with it. I find that kind of stuff, distributed systems, really interesting.”

Uptime requirements were treated differently at Google. "You would really want your process to fail, if something anomalous happened. It was one of thousands of similar processes which would then be restarted. If you got enough of those failures, you would be paged and could spend some time figuring out what the problem was and building a solution to address it."

At Google, these mishaps were a useful signal among the noise. But that approach doesn’t work for crewed rockets. "At SpaceX we really don’t want our processes to fail as a result of a software failure. We'd rather just continue with the rest of the software that actually isn't impacted by that failure. We still need to know about that failure and that's where the telemetry factors in, but we want things to keep going, controlling it the best that we can."

There is a lot more work that goes into crafting the code which put Baby Yoda into space last November. We'll have another article on their space-based internet satellites, Starlink, tomorrow. If you want to learn more about what it’s like to work as a vehicle engineer at Space X, check out their careers page.

Part two of our Software in Space series is now live: Building a Space Based ISP

Add to the discussion