Securing the data in your online code repository is a shared responsibility
Back when I started with computers, if I wanted to back up my code, I’d plug a hard drive into my computer and copy over all the files. In those days, all apps were desktop apps and all vital data lived on your desktop hard drive.
But today, a large number of apps and services live in the cloud; your desktop is just a thin shim to some service operating elsewhere. The code for these apps, and yours, likely lives in the cloud on a SaaS source control platform, like GitHub or GitLab. The data and the app are just abstractions now, and I’m not sure developers appreciate that while the data isn’t on media that they can touch anymore, they still need to do backups.
My co-founder knows this pain all too well. Once while on stage giving a presentation in front of a live audience, his Mac froze. When he rebooted it, he saw the Mac question mark icon appear; his hard drive had died and all his files, including his presentation, were gone forever. On that day, when he experienced his own data disaster, he became a backup advocate.
When he came to me with the initial thought to back up Shopify data, I was the first to say it was a dumb idea. Why would you do this? That data lives in the cloud, so data security must be the service’s responsibility, right?
Wrong. That’s because SaaS companies follow the shared responsibility model.
The shared responsibility model
Any SaaS company cares about data security, naturally. No one is going to use their service if the data that customers upload to it gets lost, corrupted, or stolen regularly. But these service providers care about data on a macro level—say a meteorite lands on one of their data centers, or a bad actor breaches their systems and corrupts their entire customer database. he service provider takes great pains to protect against those types of threats with data replication, strong infrastructure-level protections, and encryption.
But when an individual customer has a problem, tracking the cause becomes a bit of a needle in a haystack issue. Something trivial, say a change to a small attribute in a particular area in your data model or an integration screwing up your PR history isn’t something that the provider can necessarily handle. You are responsible for ensuring that your individual account data is safe, recoverable, and secure.
The shared responsibility model has evolved with the introduction of a plethora of SaaS platforms—large and small.. Service providers tend to focus on their business and protect its integrity. They do one thing, and they do it well. Account-level recovery is non-trivial to develop and typically lies outside their core competencies. That’s why they require each account holder to protect their own data.
It’s best practice in data security and business continuity to adhere to the 3-2-1 Backup rule. This rule says that you should have three backup copies of your data, two on different media, and one offsite. For instance, one copy in the SaaS service, one on a local hard drive or server, and a third in a separate cloud service. Your service may have a basic backup plan in place, but for only one snapshot of the the data (and this snapshot could be outdated or incomplete). Backing up data manually is an error-prone and time-consuming process that doesn’t meet the needs of companies who are serious about protecting their own data.
The licenses for most SaaS providers include language indicating that the customer needs to ensure their own level of integrity, safety, and security for the data they upload and use. For developer tools like code repositories, every organization is going to have varying needs in what security controls they have. Some may be bound by compliance schemes like PCI, SOC 2, ISO27001, and GDPR.
But imagine that your service provider has a perfect backup strategy. You should still backup your code and associated metadata. When you’re faced with the possibility of your business being severely disrupted for a period of time, even irrevocably, you’ll want to have a plan to prevent that disruption without having to rely on your service provider’s backup snapshots. Nobody cares about your code like you do. Think about it as insurance for your software engineering organization.
Build vs. buy
You may think backing up your online repo is pretty easy; just run `git clone` and you’re done—backup accomplished. But your repo is more than just the source code. There’s a whole lot of metadata around the code that would be lost. Your commit messages and pull requests serve as an audit trail for the design decisions and change management. Without these, you lose important context around your projects.
Many of the online code repo providers have realized that their service is where the software engineering process occurs, so they have features beyond source control. They’ve added features that support CI/CD pipelines, build processes, and automated testing. These are mission-critical features and are not pulled down to your local system on `git clone`. In fact, the more you look at the issue, the more items need to be included in backups.
You could try to roll your own backup script. Developers love building their own solutions; it’s why many of them got into programming in the first place. It’s how Rewind was started. We saw the 25-page help document that went through everything that one needed to do to back up of an e-commerce store and said, “We can automate it.” It took us a month to create a prototype, but customers came to us, cash in hand, because the process of transforming that massive help file into a script is both cumbersome and error-prone. If something goes wrong and you lose data, your service provider is going to shrug and say, “We’d love to be able to help you, but we can’t.”
Automating backups and restores is not simple. The reality is that you are building something with the APIs that the service company provides; no longer can I backup my code using the old method of plugging in a fresh hard drive and dragging files over. APIs rely on the internet remaining stable. Building on those APIs means that you’ll be forced to update and maintain the script as the APIs change—and they will, sometimes in breaking changes. APIs also abstract away crucial business logic that you need to understand to make sense of the data when it comes time to recover when the inevitable data disaster strikes.
Companies thrive when employees focus on improving their core product and delegate the rest. SaaS companies thrive when they follow the Shared Responsibility Model—and so should you. One of our customers came to us after trying to do this themselves. They spent a full week building something that could produce a backup of some of their online repositories. It cost a lot of time and money, but they still had to have it as an ongoing project. Do you want to fill up development cycles with something that is not the core competency of your business?
How we solve the problem
Rewind’s founding philosophy is that solutions should be simple. The problem and the process to automate a backup and restore process can be complicated, but our way of solving it makes it easy for the end user. By design, there are very few configuration options: what and how often you backup is built in. The initial backup can take a while—it is backing up everything, after all, and we want to make sure that we backup everything that is part of your repo. After that, it runs on a fixed interval, incrementally backing up changed parts of your repo; we want to do as little as possible to ensure as much coverage as possible.
We do backups like this for a couple of reasons. First, this incremental backup methodology is related to how we model the data in your repos. It’s also how git works—each branch points to the most recent commit, which lists the changes it includes, and each commit points to the previous commit. Code is updated incrementally.
Second, we look at it as a way to play nice within the ecosystem of these services. Obviously, their APIs can handle a fair amount of traffic, but we don’t want to be the ones negatively impacting their ability to provide their services. By looking only at repository’s deltas, we minimize the amount of back and forths that we have with the service.
When it comes to focusing on the one thing that a service does best, we practice what we preach. Our service is built on the cloud—AWS—and we use as many managed services as we can with our public cloud. That lets us focus on the business logic of our backup services. We aren’t ones to roll our own databases or bare metal to drive our APIs. If there’s a solid product that solves our problem without us having to sink developer hours into it—sold! That said, we care about data security more than most—our business is handling your data—so our cloud is as durable as the AWS services that we are using will allow us to be.
For those of you wondering about our tech stack, we’re a Ruby on Rails shop. We love the quotes about how Ruby on Rails doesn’t scale; it drives us to prove them wrong. Based on the volume of data that we deal with on a day-to-day basis, I think we’re something of a poster child for what a SaaS company can do with Rails. We’re not the size of Shopify for example, but we’ve achieved impressive results using something that supposedly doesn’t scale.
As we’ve grown, we’ve found that the diversity of APIs has led us to adopt new approaches to working with them. The term REST refers to a general architectural style, and how companies implement that style can be vastly different. We’ve learned to abstract commonalities in order to make processes more efficient on our end. While many of the services that we support have thorough developer documentation, we’ve learned to never take for granted that those docs necessarily represent how those APIs work in practice. There can be a ton of business logic behind the scenes that won’t be obvious until you use the API, some of which the author might not have even intended.
As such, we are very hands-on with the APIs that we consume. It used to be that you could implement an API in a project and forget it. Now some services change their APIs quarterly, occasionally breaking old implementations. If someone wants to restore data from a year ago—four API revisions in between—we have to have a plan in place to migrate that data as it existed back then to the new version with all of the additional business logic.
Sometimes, because of limitations in the repository service API, we can’t automatically restore every backup. Our approach is that having the backup data solves 90% of the problem. Restoring that data is of critical importance, but if our service can’t do that automatically, our customer service can get it done. We pride ourselves on our customer service; we bend over backwards to make sure that our customers will be able to recover from a disaster, even if the issue is a limitation in the service API.
What’s next for backups
Right now, we’re focused on growing the products that we offer for developers. As such, we’ve launched JIRA backups in an alpha form with our eyes firmly locked on other developer tools in the not too distant future. The modern software engineering process happens on GitHub and JIRA, so preserving a record of that process and the decisions therein can be key to your future success.
Data security and location has become more important lately with GDPR and related regulations. Compliance audits like SOC 2 can be onerous to go through, so we have centralized management of repositories to support the security controls that this audit requires. For those that want a copy of this data in their cloud spaces, we’ll soon offer the ability to sync your backups outside our cloud storage to your own.
If you want to check out our backup service now, check out www.rewind.com.Tags: backups, code repositories, partnercontent
“ You may think backing up your online repo is pretty easy; just run `git clone` and you’re done—backup accomplished. But your repo is more than just the source code. There’s a whole lot of metadata around the code that would be lost. ”. That’s simply not true, `git clone` gives you a clone of the online repository with all of the metadata and history… that’s the entire point of distributed version control.
pull requests would not be though, which is why I always advocate for meaningful commit messages. I hate projects where the git commit messages are “Fixed the bug” and all the interesting info is in GitHub.