Testing software so it's reliable enough for space

We’ve talked about the engineers who write the code that operates SpaceX spaceships. Now let’s talk about the people who build and maintain the tools and processes that enable the developers and ultimately, help accomplish the mission of flying astronauts to space.

Stack Overflow talked with Erin Ishimoticha, an engineer in the Software Delivery Engineering group from the Choctaw Nation of Oklahoma. Ishimoticha, a full-time engineer for 15 years, started her career with shell and Perl scripting and has now been at SpaceX for about two years.

Software Delivery Engineering's job, per Ishimoticha, is to coordinate good software development and testing practices across SpaceX, ensuring everyone writing code for spacecraft uses proper version control methods, and that it undergoes automated and human testing managed in a continuous integration (CI) system.

"We develop and maintain our own CI system," she said. "We have a web service that runs reports – it ingests telemetry from both software and hardware tests, builds graphs, and has its own assertions that it runs on the data, producing a report on how the software is performing."

This means that folks in Software Delivery Engineering are doing development, testing, and DevOps, with a team of about 15 engineers, including a dedicated Software Reliability Engineering (SRE) team.

The quality control for spacecraft software is different than normal enterprise or consumer applications. The requirements are pretty high. "You can look at NASA's software classifications – there's a lot of publicly available information about that. We do something similar. There are several different quality standards that apply to software ranging from CI tools to the flight software which provides safety critical functions for the crew.

Because of that, "there's a lot of extra stuff you have to do during the development process before you're allowed to merge,” Ishimoticha says. “We have a lot of checking and double-checking. We have pull request review conditions that have to be met, even post-merge testing to make sure that changes that were made while a pull request was in flight don't interfere with the change we just merged."

The development process seems to be agile in spirit, with test-driven design and multiple engineers incrementally in each code change.

"Most of the time we work with the concept of a responsible engineer, who takes a ticket – an issue – off the backlog." The responsible engineer then works the issue from understanding the problem and follows it through, possibly even deploying the new software.

“If I take a ticket, I will first understand the problem and try to reproduce it. Then I design the fix or feature, implement the fix, and ensure test coverage and functional testing in an isolated environment. Then I issue a pull request. It's my responsibility to find someone to review the PR."

The process isn't finished when the review is complete. "When the PR is merged, it's my responsibility to ensure the next test run on master passes in CI. And then we start on verification."

The CI environment is based on [HT Condor], a workload management system for compute-intensive jobs that originated with the High Throughput Computing Group at the University of Wisconsin Madison. It’s prized at SpaceX for its powerful queueing, job prioritization, and resource management capabilities - particularly for the HITL testbeds (more on those later).

Condor manages workloads similar to a traditional batch system, but can make better use of idle computer resources. Ishimoticha says, "we run about a million CI builds a month."

The platform is built around PostgreSQL to manage the metadata about the builds, test results, and other artifacts, along with a lot of Python and C/C++.

Python "is for backend test running, build orchestration, and all our web servers are python-based. It's a lot of little scripts and a lot of big web services. Angular and JavaScript on the web services for the front end, and a little bit of Typescript, which is great, I'm a big fan."

Docker is used heavily, along with a little bit of Kubernetes. "Docker is used a lot for ephemeral builds." By using Docker, they can ensure each build runs in a clean environment. "Being able to put those jobs inside Docker means that we can throw away side effects every time, which is wonderful."

We wondered about how Software Delivery Engineering works with the hardware. "We don't work with the vehicle hardware as much as the data center hardware,” said Ishimoticha. “We rebuild worker systems and add hardware to the CI system all the time. We have about 550 small, medium, and large workers running different kinds of jobs. A small worker has two cores, while our large workers have 28 cores and 32 gigabytes of memory."

There is one cool part of the job that involves playing with actual hardware. "We also have a ton of testbeds connected to the CI system. We call these tests HITLs, pronounced 'hittles', meaning hardware-in-the-loop. These are one-off, custom-built, literally all the guts of the rocket laid out on the table. It doesn't have any fuel tanks and it doesn't have any engines, but it's got, you know, all the flight computers and sensors.” She gave a laugh. “We have one for the seats on Dragon 2 and it's got the actuators, we can move the seats around.”You don’t have to be a rocket scientist to become a test engineer, but it’s pretty fun being a test engineer who gets to work with actual, physical rockets.

If you want to learn more about working in software at SpaceX, check out their careers page. For the other blog posts in this series, you can check out the rest of our series.

Testing software so it's reliable enough for space

Check, check, and check again

Watching each other’s back

Managing builds and rockets on the table

Add to the discussion