Testing software so it’s reliable enough for space
We’ve talked about the engineers who write the code that operates SpaceX spaceships. Now let’s talk about the people who build and maintain the tools and processes that enable the developers and ultimately, help accomplish the mission of flying astronauts to space.
Stack Overflow talked with Erin Ishimoticha, an engineer in the Software Delivery Engineering group from the Choctaw Nation of Oklahoma. Ishimoticha, a full-time engineer for 15 years, started her career with shell and Perl scripting and has now been at SpaceX for about two years.
Check, check, and check again
Software Delivery Engineering’s job, per Ishimoticha, is to coordinate good software development and testing practices across SpaceX, ensuring everyone writing code for spacecraft uses proper version control methods, and that it undergoes automated and human testing managed in a continuous integration (CI) system.
“We develop and maintain our own CI system,” she said. “We have a web service that runs reports – it ingests telemetry from both software and hardware tests, builds graphs, and has its own assertions that it runs on the data, producing a report on how the software is performing.”
This means that folks in Software Delivery Engineering are doing development, testing, and DevOps, with a team of about 15 engineers, including a dedicated Software Reliability Engineering (SRE) team.
The quality control for spacecraft software is different than normal enterprise or consumer applications. The requirements are pretty high. “You can look at NASA’s software classifications – there’s a lot of publicly available information about that. We do something similar. There are several different quality standards that apply to software ranging from CI tools to the flight software which provides safety critical functions for the crew.
Because of that, “there’s a lot of extra stuff you have to do during the development process before you’re allowed to merge,” Ishimoticha says. “We have a lot of checking and double-checking. We have pull request review conditions that have to be met, even post-merge testing to make sure that changes that were made while a pull request was in flight don’t interfere with the change we just merged.”
Watching each other’s back
The development process seems to be agile in spirit, with test-driven design and multiple engineers incrementally in each code change.
“Most of the time we work with the concept of a responsible engineer, who takes a ticket – an issue – off the backlog.” The responsible engineer then works the issue from understanding the problem and follows it through, possibly even deploying the new software.
“If I take a ticket, I will first understand the problem and try to reproduce it. Then I design the fix or feature, implement the fix, and ensure test coverage and functional testing in an isolated environment. Then I issue a pull request. It’s my responsibility to find someone to review the PR.”
The process isn’t finished when the review is complete. “When the PR is merged, it’s my responsibility to ensure the next test run on master passes in CI. And then we start on verification.”
The CI environment is based on [HT Condor], a workload management system for compute-intensive jobs that originated with the High Throughput Computing Group at the University of Wisconsin Madison. It’s prized at SpaceX for its powerful queueing, job prioritization, and resource management capabilities – particularly for the HITL testbeds (more on those later).
Condor manages workloads similar to a traditional batch system, but can make better use of idle computer resources. Ishimoticha says, “we run about a million CI builds a month.”
Managing builds and rockets on the table
The platform is built around PostgreSQL to manage the metadata about the builds, test results, and other artifacts, along with a lot of Python and C/C++.
Docker is used heavily, along with a little bit of Kubernetes. “Docker is used a lot for ephemeral builds.” By using Docker, they can ensure each build runs in a clean environment. “Being able to put those jobs inside Docker means that we can throw away side effects every time, which is wonderful.”
We wondered about how Software Delivery Engineering works with the hardware. “We don’t work with the vehicle hardware as much as the data center hardware,” said Ishimoticha. “We rebuild worker systems and add hardware to the CI system all the time. We have about 550 small, medium, and large workers running different kinds of jobs. A small worker has two cores, while our large workers have 28 cores and 32 gigabytes of memory.”
There is one cool part of the job that involves playing with actual hardware. “We also have a ton of testbeds connected to the CI system. We call these tests HITLs, pronounced ‘hittles’, meaning hardware-in-the-loop. These are one-off, custom-built, literally all the guts of the rocket laid out on the table. It doesn’t have any fuel tanks and it doesn’t have any engines, but it’s got, you know, all the flight computers and sensors.” She gave a laugh. “We have one for the seats on Dragon 2 and it’s got the actuators, we can move the seats around.”You don’t have to be a rocket scientist to become a test engineer, but it’s pretty fun being a test engineer who gets to work with actual, physical rockets.
If you want to learn more about working in software at SpaceX, check out their careers page. For the other blog posts in this series, you can check out the rest of our series.Tags: software in space, spacex, testing
thi is over the draem not the reality
Hmm. I remember reading something about the quality control process for the space shuttle. A lot of it was about the attitude of the people, and the environment they worked in. Regular hours. Encouraged to take vacations. People who wanted structure and work/life balance. Content to work at the same sort of thing for years at a time.
I wonder to what extent their Python is type-annotated.
Please see that people at Boeing read this. The 737 Max issues, combined with their rocket delays, s a cause for great concern.
Given how important software and computing is to infrastructure (I’m looking at YOU Colonial Pipeline), why isn’t all software required to undergo this level of safety testing?
When are we going to start holding software companies responsible for the hacks their unsafe coding practices allow?
I wonder to what extent their Python is type-annotated?
I hope they’re not just coding and testing to the “happy path” where all the data is valid, all functions return successfully and to spec. I’ve noticed in the past dozen years or so, naive programmers tend to trust documentation, trust the users to behave, and trust that data glitches never happen, and then testers assume if they test to the specifications and the tests pass, that the software is good to ship. I learned early on, if your code lets the user enter bad data or otherwise misbehave, they will, and if your code doesn’t do enough error checking (anathema to coders nowadays, a “waste of time and money” to bean-counting management) then when data drops or is corrupted by cosmic rays, or if the user enters “0” for a field that will be used as a denominator, and if the test team doesn’t test with bad data and bad behavior, you get things like the Boeing capsule failure. I’ve been developing software since the mid-80’s, so if you think I’m talking out my butt on this, I’m just reporting what I’ve seen with my own eyes on dozens of teams through the years.
If you found that interesting, you might be interested in following Erin on Twitter https://twitter.com/ErinIshimoticha
It blows my mind to know how different is the building, testing and maintenance of the custom systems they made for development, testing and DevOps. If you came of an startup environment you know that most of the testing you do is kind of simple and you don’t need to build specific tools in order to test it. But here, you realize the importante of build the necessary tools in order to keep your system running in optimal conditions all the time. I’m also a big fan of Typescript, so to know they use it is encouraging to keep using it 🙂
You might want to have a visit with Margaret Hamelton – the lead programmer on the Apollo flight software project.
Shes still around and has some very interesting things to say about writing ultra reliable code.
Sounds like a very nice job of build & integration test automation for a critical system using best in class components. Standing up such a testing system a very big job that is often ignored, short-changed, or bungled, so +100 for that. Most of the basic good practices are mentioned, including automated integration with HIL for components.
No discussion of the test design strategy, so I have to assume it varies from one dev to the next and probably does not achieve high coverage, either at a component or system level. I’d like to know more about what their “Software Reliability Engineering” team actually does, and to what extent it follows IEEE 1633.
“We have pull request review conditions that have to be met, even post-merge testing to make sure that changes that were made while a pull request was in flight don’t interfere with the change we just merged.”
If one implements a commit queue that all merges have to go through serially and reject all commits that do not pass the tests after being rebased on the tip of the primary branch, it’s possible to do all testing pre-merge and guarantee that every commit that gets merged is green.
It’s curious that a project where reliability is so highly valued would skimp on implementing a commit queue and rely on post-merge tests instead.