Rewriting Bash scripts in Go using black box testing

When rewriting software in a new language, how do you test that your new and old programs do the same thing?

Article hero image

Testing is an integral part of any application, and writing automated tests is critical to ensuring the safety of your code. But what do you do when you’re rewriting a program in an entirely different language? How do you ensure that your new and old program do the same thing?

In this article I’m going to describe a journey we took to change a collection of Bash scripts into a well-organized Go library, and how we made sure that nothing broke along the way.

In the beginning…

Here at Flipp we have our own microservice platform, which allows us to package up and deploy our code as part of our continuous delivery pipeline. We provide additional functionality as well, like validations for permissions (so we don’t deploy something that’ll fail because it doesn’t have permissions to read or write a resource).

These scripts were written in Bash. Bash is a great way to interact with Linux executables. It’s really fast and doesn’t need any particular programming language installed since there is no compilation or interpretation past the shell itself. But Bash is finicky to work with, hard to test, and doesn’t have a “standard library” like most programming languages.

We probably could have kept chugging with Bash. However, the main pain point that kept coming back was the use of environment variables for pretty much everything. When you’re writing Bash, there is no way of telling whether a particular environment variable is an input to your script, whether it’s something that just happened to be set due to some external process, or if it’s a local variable that your script should “own.”

In addition, if you separate your scripts into files so your platform can use them, you have no way of stopping any user from calling little bits of it without telling you. If you want to try to deprecate a feature, it’s almost impossible to do since everything is open to the world.

When making any kind of significant change, our only choice was basically to put out a new version, “test it in production,” and promote it to the default version after those tests are complete. This worked fine when we had a couple of dozen services that used these scripts. Once we ballooned to several hundred, it became less ideal.

We had to rewrite these things so they were actually maintainable.


We decided to rewrite the scripts in Go. Not only was it a language we were already using at Flipp for high-throughput API services, it allows easy compilation to whatever target architecture we needed and comes with some great command-line libraries like Cobra. In addition, it meant we could define the behavior of our deploys with config files as opposed to huge messes of environment variables.

The question remained though — how do we test this thing? Unit and feature tests are typically used before refactoring so you can ensure that your inputs and outputs match. But we were talking about rewriting this in an entirely different language. How could we be sure we’re not breaking things?

We decided to take a three-step approach:

  1. Describe the behavior of the existing scripts by having a test framework cover all existing cases.
  2. Rewrite the scripts in Go in such a way that all existing tests still pass. Write it in such a way that it could take either environment variables or config files as its input.
  3. Refactor and update the Go library so that we can take advantage of the flexibility and power of a programming language. Change or add tests as necessary.

Describing behavior: Introducing Bats

Bats is a testing framework for Bash. It’s fairly simple — it provides a test harness, setup and teardown functions, and a way to organize your tests by file. Using Bats, you can run literally any command and provide expectations on exit codes, output, environment variables, file contents, etc.

This is step one to creating a way to test our existing scripts. But it doesn’t quite go far enough. One of the critical pieces of automated testing is the ability to stub or mock functionality. In our case, we didn’t want to actually call out to the docker or curl commands, or to do any real deploys, as part of our testing framework.

The key here is to manipulate the shell path to direct any invocations to “dummy” scripts. These scripts can inspect inputs to the command, as well as environment variables that could be set during test setup, and print their output to a file which can be inspected after the test runs. A sample dummy script might look like this:

#! /bin/bash
echo -e "docker $*" >> "${CALL_DIR}/docker.calls"

if [[ "$*" == *--version* ]]; then
 echo "Docker version 20.10.5, build 55c4c88"
if [[ $1 == "build" && "$*" == *docker-image-fail* ]]; then
 exit 1

A sample output file after a script run might look something like this:

# docker.calls
docker build --pull -f systems/my-service/Dockerfile
docker push my-docker-repo.com/my-service:current-branch

The last piece is to introduce snapshot testing to our main test harness. This means we save these dummy output files, plus the actual command output, to files that live inside the repo. The order of operations is something like this:

  • The new test is written. At this point, we don’t have any output files.
  • The new test is run with the UPDATE_SNAPSHOTS environment variable set. This saves the dummy output files and command output to an outputs directory for the folder we’re running in. These get committed to the repo.
  • When we re-run tests, the output is saved to a current_calls directory within the same folder.
  • After the command is done, we call a script that compares the contents of the outputs directory with that of the current_calls directory.
  • If the output is identical, it reports success and deletes the current_calls directory.
  • If there are differences, it reports failure and leaves the current_calls directory where it is so we can inspect it and use diff tools.

Describing behavior: Gotchas

There were a couple of things we got bitten by that we had to fix in order to get this to work right on our continuous integration pipeline:

  • Bash scripts run on whatever computer is running it, meaning that the current directory might be different between your machine and the CI pipeline’s machine. Because of this, we had to search for the current directory in all output files and replace it with %%BASE_DIR%% . This ensures that the output to be compared is always identical regardless of where it’s run.
  • Some commands output colored text using the \e directive. This results in slightly different text saved to the output files on Mac versus Linux, so we had to do some find/replacing here as well.
  • We had to have a way to check if the called command actually failed — sometimes we expect it to fail, and the test itself should fail if the failure didn’t happen as expected. In our case we had to run quite a bit of code both before and after the command under test so the exit code was no longer available to the Bats test file. Because of this, we had to set an environment variable indicating whether we expected the current command to fail or not.
  • We wanted to keep the actual test files concise to avoid possible manual errors. In our shared test code, we determined the suite folder from the name of the test file and in many cases, the test was nothing but a single line with the name of the folder inside that suite to test.

Describing behavior: Devising the tests

The slog now began. Essentially, every if and loop statement in our Bash scripts represented another test case. In some cases it was obvious that a code branch was fairly isolated and could be covered by a single case. In others, the branches might interact in weird ways. This meant that we had to painstakingly generate test cases in a multiplicative manner.

For example, we needed to have a test for when a single service was being deployed as opposed to multiple services and also when only the main deploy step was being run as opposed to the full workflow. In this case this meant four independent test suites.

This step probably took the longest! When it was done, we could safely say we had described the behavior of our existing deploy scripts. We were now ready to rewrite it.

The rewrite: Let’s go with Go!

In the first iteration of the Go library, we deliberately called out to Bash (in this case, using the go-sh library) every time we wanted to do something external, like web requests. This way, our first Go version was completely identical to the Bash version, including how it interacted with external commands.

Refactor and update: Realizing wins

Once all tests were passing using this version, we could start having it act more like a Go program. For example, rather than calling curl directly and dealing with its cumbersome way of checking HTTP statuses, it made far more sense to use Go’s built-in HTTP functions to make the request.

However, as soon as we stopped calling a command directly, we no longer had our existing tests verify our behavior! The outputs depended on the existence of those dummy scripts, and we had stopped calling them.

We didn’t actually want our test outputs to be identical to the old ones — the curl outputs specifically were so convoluted that we wouldn’t have any “wins” if we wrangled the Go code into somehow outputting something that looked like it came from a Bash curl call.

To get past this point, we had to take three steps.

  • We had to write a library that wrapped the commands as they were being called right now. For example, a function that took a URL, a method, POST data, etc. At this point, it still calls the curl command.
  • We then change it to use the Go HTTP functions instead of curl. We write unit tests around that library so we know it at least calls the HTTP functions correctly. We also have the “mock” version of this library write to its own output file, similar to how our “dummy” scripts work.
  • Finally, we reran the snapshots for the tests. At this point, we had to do manual work to compare the outputs of the original curl.calls file and the new requests.calls file to validate that they were semantically identical. Using diff tools, it became pretty easy to tell visually when things were identical and when they weren’t.

In other words, although this step lost us our armor-plated certainty that nothing had changed, we were able to pinpoint our change so that we knew that all the diffs were related to this one change and could visually confirm that it worked.


We did still have to do some manual testing due to environment changes, but the end result worked really well. We were able to replace our deployment scripts with the Go version for all new services and were even able to develop a script to automate pull requests for all existing services (using multi-gitter) to allow teams to move to the new version when they were able to.

This was a careful and long journey, but it helped immensely to get where we were going. You can see an edited version of the scripts we used in this gist!

Login with your stackoverflow.com account to take part in the discussion.