Editor's note: This is the second piece in a four-part series on the software that powers SpaceX. If you missed yesterday's piece, you can read it here.
There are requirements that make software engineers sweat. Massive distribution to thousands of nodes. High reliability and availability. Multiple distinct platforms. Rapid network growth.
Hardware platforms that change in days or weeks.
Now put your production platforms into space.
This is the world SpaceX’s Starlink program, which has set a goal to provide high-speed broadband internet to locations where access has been unreliable, expensive, or completely unavailable.
Stack Overflow spoke with two Starlink software leads — Akash Badshah and Andy Bohn — about their development methods and practices. The software breaks down roughly into two parts: 1) software that flies and 2) software that supports the flying components, manages the networks, controls the Starlink satellite “constellation”—the Starlink satellites in orbit—and maintains the communication between the constellation and the ordinary terrestrial internet.
The current Starlink constellation consists of hundreds of small, low-cost satellites in low Earth orbit, and the company aims to scale this to thousands. The low altitude is necessary to provide low latency. Current geostationary satellites orbit 26,200 miles from the center of the earth and 22,300 miles above the surface, meaning a signal takes roughly 0.240 seconds to make a round trip. Starlink is currently in orbit at 340 miles, cutting the speed of light lag to almost a hundredth of that.
Bohn, manager for the Network Software team, said, “We have a ground cluster of services figuring out who talks to whom in the network. What’s interesting about our satellites is they are very close. Because of this, a satellite may only be overhead for a few minutes. The antenna on a customer's roof therefore needs to change which satellite it talks with often.”
"Consider how your cell phone communicates with fixed communication towers. Occasionally your phone needs to switch from one tower to another, but the connection is usually stable,” says Bohn. “For Starlink, one of the main challenges is that our "towers" are orbiting Earth, forcing your path to the internet to change very frequently. My team orchestrates this dance by computing desired network topologies, distributing this plan to the assets in the network, and configuring hardware to make it happen."
Of course, the Starlink satellites present their own problems. Each satellite is responsible not only for maintaining connections to the ground stations that are in view, but unlike most satellites, the Starlink ones are largely navigating themselves. When you’re launching hundreds of satellites, there’s no time to place each one into its own specific orbit; instead, ground control tasks each satellite with a place in the constellation, and the satellite steers itself into place. The Earth-side network then provides continuous updates on traffic conditions and constellation changes, while each satellite updates the ground on its planned trajectory.
“The combinatorics of the problem make scaling this system to many millions of people a challenge,” says Bohn. “When serving users, Starlink satellites need to paint the ground with data beams of different frequencies to avoid interference. We end up solving a global scale coloring and interference avoidance problem that is another one of our bigger challenges to do in real time."
Starlink software, both in satellites and on the ground, is written almost exclusively in C++, with some prototyping development in Python. The software is developed in a continuous integration environment, with teams merging into the master development branch often and deploying to the fleet of satellites in space each week.
“We use C++ for most of the vehicle control software. There is a lot of heritage with it at SpaceX as it’s a very low-level language we can use on bare metal microcontrollers. This lets us use it on our embedded Linux computers that we use throughout all our different vehicles,” explains Badshah. “We learned a lot from Dragon and Falcon about how you can run a self-redundant architecture on triplicated computers that are sharing data and solving the same problems.”
New code goes through an extensive testing cycle using many different test frameworks, from simple unit tests to running in massive simulations. Some of the most interesting tests include everything from putting satellites in anechoic isolation from ground stations and testing their communication, to a test that presents the test platform with a simulation of the entire environment in which the satellites will operate. In essence, Starlink has built a simulation to mock-up space-time, at least in the neighborhood of Earth.
"For development and test of these algorithms, we have a full-scale network simulation running in continuous integration on a high-performance computing cluster. This simulation is capable of running the C++ production code as well as running against prototype code written in Python,” says Bohn. “The Python version allows for rapid iteration during the design phase. Once we are happy with the results of an algorithm, we port it to C++ so it runs efficiently in production.”
One of the big challenges for Starlink is that the satellites themselves often change. Starlink says that they’ve never had a launch in which the satellites going into the constellation hadn’t changed from the last launch. In most environments, this would be a major problem (read: recipe for disaster.) Starlink has solved this problem by putting software developers directly into the manufacturing cycle.
Instead of new hardware being “thrown over the wall” to developers, the software developers are integrated into the manufacturing process to the extent of being on the actual manufacturing shop floor. To make sure that hardware and software stay in sync throughout the process, software is sometimes tested on satellites coming off the production line and on their way to orbit.
Once the satellite software is ready to fly, it’s packaged for transmission to the satellites. Releases are first deployed to a few satellites in orbit and tested in place. If there are failures, the software can be rolled back. If it’s judged satisfactory, the software is deployed in an exponential rollout to the remaining satellites.
Another advantage of C++ is in the area of memory management. No matter how many times you check the code before launch, you have to be prepared for software corruption once you’re in orbit. “What we have established is a core infrastructure that allows us to know we are allocating all of our memory at
initialization time. If something is going to fail allocation, we do that right up front,” says Badshah. “We also have different tools so that any state that is persisted through the application is managed in a very particular place in memory. This lets us know it is being properly shared between the computers. What you don’t want is a situation where one of the computers takes a radiation hit, a bit flips, and it’s not in a shared memory with the other computers, and it can kind of run off on its own.”
If you want to learn more about what it’s like to work as a vehicle engineer at Space X, check out their careers page. If you’re interested in how code works at other parts of SpaceX, you can dive into the rest of our series.