Privacy is an afterthought in the software lifecycle. That needs to change.
“Move fast and break things” sounds great on a t-shirt when you’re building a calculator and the greatest risk is a minor miscalculation. Consider a civil engineer working on bridges and tunnels—would we be comfortable if they wore a t-shirt with the motto “known shippable”?
No, because we want to be assured that those engineers whom we entrust to build these critical systems have designed them for structural integrity, safety of the users, and the longevity to serve a generation of people.
We software engineers are in the same boat now. The things our community builds are awe-inspiring and wonderful, and the world is eager to participate in our invention. But with that comes a duty of care to those that don’t understand the inner workings of software or data processing, to build systems that respect those individuals, and to ensure the integrity and safety of the systems we provide society. The term of art for this consideration is “data privacy,” but in practice it’s the “respect” concept that may be more apt. Respectful systems are ones where choices are presented transparently to users, and the information users elect to share with the system remains under their control at all times. It doesn’t seem like too much to ask, right? In our field, it’s becoming table stakes.
The good news is, it doesn’t mean we can’t move fast. To quote the endlessly thoughtful DJ Patil, we can “move purposefully and fix things.” The key to combining privacy and innovation, rather than pitting them against one another, is baking privacy into the SDLC. Analogous to application security’s (AppSec) upstream shift into the development cycle, privacy belongs at the outset of development, not as an afterthought.
Compliance vs. respect
There are two ways to look at the recent emergence of regulatory interest in our field: an oppressive tax on the agility of our industry (compliance), or an opportunity to build better software that respects its users first and foremost. I’m an optimist so I choose the latter, but this requires implementing fundamental engineering principles and tools. A great reference for this is Dr. Ann Cavoukian’s seven principles for Privacy by Design, which are:
- Proactive not reactive; preventive not remedial
- Privacy as the default setting
- Privacy embedded into design
- Full functionality—positive-sum, not zero-sum
- End-to-end security—full lifecycle protection
- Visibility and transparency—keep it open
- Respect for user privacy—keep it user-centric
Today, data privacy for most engineering teams is something tackled after software is shipped to production. It’s a little like application security a decade ago, where you’d complete implementation, and then security would happen after the fact by pen testing and fixing issues as they were identified.
However over the last few years, AppSec has “shifted left” to be part of the SDLC. Well-built software is secure before it leaves the repo you were working in, and security is achieved through a change in engineering mindset and a suite of awesome tools for static code analysis, application security testing, and threat modeling that ensures the code you ship is as secure as it can be. Engineers are increasingly proud to have a security mindset and understand defensive design and zero trust principles.
But again, today privacy is after-the-fact. For example, if you’re in an early-stage team, you iterate on an MVP until you find product market fit. That lean approach makes business sense, but as soon as you have traction—and customers—you’re collecting data. Because you didn’t plan for this part, you’re instantly exposing your users to risk and accumulating a form of data planning debt that’s nearly impossible to climb out of. The more successful your trajectory, the faster your growth, the more data you accumulate, and the more acute this problem becomes. You can’t go backwards and unpack the debt, so now you’re building unscalable mechanical turks to mitigate risk but unable to fix the underlying problem.
The problems of privacy
Let me give some real-world technical problems this yields:
Real world problem #1: Orphan data sets
In large tech companies, “orphan data sets” refer to datasets that exist in their infrastructure that are still in use but there’s no clear understanding of what is in there, where the data comes from, and how it’s used.
We’re all familiar with this story: you build your first product quickly, so perhaps your documentation isn’t perfect or your data model evolves organically. You find success and start growing, so more engineers join the team. Team hierarchy is pretty flat and can spin up a database and start working—awesome, we’re the essence of agile!
Then you’re concerned about system availability and scaling so you start to chunk things into a more elegant service-oriented architecture: different engineers pick pieces off as you add functionality to meet customer demand. Some of your early engineers leave the company, new ones join, the cycle repeats, and before long you have a platform of 500+ microservices and hundreds of databases without any consolidated annotations or metadata. You have no certainty in what serves which purpose and there’s lots of grey areas.
How’s this solved today? A team of dedicated data engineers sift through these systems, starting from the dataset and working back forensically through the services layer to understand what data is being stored, what the services are doing, and what business function is being performed.
And because you haven’t changed your engineering practices—you’re still shipping code the way you were—it’s a Sisyphean problem. Your data engineers are constantly untangling systems to try and understand where data is and how to govern it safely on behalf of your users.
Real world problem #2: Enforcing a user’s rights
In order to respect your users, you must respect their rights and, increasingly, these are backed by legal requirements, like with Europe’s GDPR, California’s CCPA, and Brazil’s LGPD.
These rights vary by location, but the basic principle is that when a user of your system wants to, they can access all the information you have about them, they can update it, or they can delete themselves completely. In addition to this, users have rights on how you are permitted to use their data. This creates requirements to be able to suppress an individual user’s data from being used in a specific process in your systems if they disagree with it.
Imagine you have a really elegant, service-oriented architecture of 400 services, each responsible for a distinct piece of business logic. If user A in your platform opts out of processing for 27 of the things you do, you need to be able to identify that individual, a single entity in your dataset, and enforce a constraint that may only apply to specific attributes of their entity recordset on a single process—just one services request.
At a 10,000 ft view, this is all possible if you’ve designed your architecture and data model to support this level of granularity and policy enforcement from the start within the SDLC. But that’s not what usually happens.
In a production reality, you’re left with manual policy enforcement, new policy tables being joined to entity records at runtime in an ad hoc fashion to start to enforce specific constraints. It becomes a Rube Goldberg machine of your users submitting their consent preferences and checking boxes. It looks good in the UX at least, but in the backend it’s a bunch of unscalable manual processes, workflows and runbooks to update users’ preferences and ensure they’re excluded from certain pipeline jobs. It also generates reports for legal teams to ensure that when a user has said “no” to something, their consent preference was enforced across your complex data plane. For a successful business, there’s the possibility that they’ll have to do this thousands, even millions of times.
If these are the symptoms, what’s the cause?
As I wrote earlier, privacy cannot be truly fixed with a bandaid after a system is built. The cause of data privacy failures happen upstream in the SDLC, long before most of us think of privacy.
Like application security, the source of problems and the locus of best prevention is where software is being designed, implemented, and tested. This is where instituting design decisions, implementation practices, and privacy devtools can mitigate privacy risks like the ones we so commonly see.
To put it simply: if privacy was part of the software development process, the example of trying to unpack orphan datasets would never materialize, because code merged in a privacy-centric SDLC will describes its purpose, the types of data it uses, and its data I/O so that you can easily observe across the data plane. You’d have a complete map of the data passing through every piece of code—because of steps like thorough annotation at the code level, you know explicitly the substance, source, and purpose of the data in your systems; every dataset has a home.
An SDLC with privacy baked in results in a system that respects its users rights out of the box without bandaids or manual processes. The system identifies data flows clearly such that data discovery and mapping efforts are redundant; your product stack self-describes its data operations immediately upon deployment.
Treating privacy as an afterthought results in a vast waste of engineering resources in production-related privacy and data compliance operations, which could be obviated with better tooling in the CI/CD pipeline.
A proposed solution to data privacy
The core privacy problem of our age is that systems aren’t built to be respectful from the bottom-up. Some of the reasons for that are cultural, sure, but some are related to the tools at engineering teams’ disposal. Simply put, up to now they haven’t been fit for the task. I’ve been considering this problem at length, and have some ideas for how this problem could be solved.
1. Low-friction privacy dev tools
The first step is ensuring engineers have a form of ‘privacy grammar’ or a shorthand, low-friction way to describe what their code does and what types of data it uses and make this part of their editor and CI process. Code comments and documentation are a good start, but a thorough tooling solution is both human and computer readable, integrated into both manual and automated workflows.
When implementing a given functionality in their project with integrated tools, developers can easily describe the context of what their code is doing and on what types of data it’s acting.
Also, in the same way that security/SAST tools are now an accepted component of a well-built CI/CD process, tools that provide CI hooks that enable the privacy grammar and data classifications to be evaluated on each commit and MR, providing easy to apply nudges to developers on how to ensure their work is privacy respecting.
2. Automated risk evaluation
At least for the foreseeable future, I don’t believe we can or should remove humans from the loop. However, the current process for most companies creates a lot of friction between engineers trying to ship code and the manual privacy review process, while also occupying valuable bandwidth.
Automation in the commit and merge phases can better groom high-risk and low-risk issues for human oversight and remove the need for engineers to completely async privacy review forms. Privacy risk evaluation can be an elegant part of the CI pipeline.
3. A global personal data taxonomy
So much of privacy starts with understanding the data your software uses. One of the most common issues with healthy privacy is agreeing to a standard for the types of data in a system. Too often, data types are created by the teams that use them, leading to a hodgepodge of formats. Legal requirements in different geographies make standardization a challenge but this can be achieved by taking the ISO27701 PIMS standards, specifically ISO/IEC DIS 19944-1, and building a single view of personal data classification that can be modeled by every engineering team.
Using a single standard to classify data across a system, we can simplify interoperability of systems for privacy operations and users’ rights requests as well as achieve uniform inter-system governance.
Today policy and governance enforcement requires extending an existing Role-Based Access Control (RBAC) or Access Control List (ACL) type system to roll your own fine-grained governance solution, and it still typically acts at the abstract dataset or system level.
A fine grained data policy model should be designed from day one to enforce data access at the individual field level for each entity based on policies managed at the entity level. This granularity is the only way to ensure that privacy rights can be truly enforceable. For this to be possible, however, it needs a scale of interoperability, so integration with a standard policy enforcement model like Open Policy Agent is a key step.
5. Make consent part of validation in front-end frameworks
Today consent is generally dealt with through cookie banners or individual flags set at runtime in a form flow. Consent can and should be a part of the validation module of all front-end frameworks. If we already know at runtime what data is being collected (because we’re validating for it) these same validators could be used to check for context. Tools like the fine-grained data policy model described above would permit any engineer to extend these validations to sensibly check for data type and consent.
We already use validation against field types to generate business logic checks for data type, structure, etc. This must be extended to ensure that field labels ascribe whether consent should be collected automatically and provide a central consent tracking service at application runtime so that user consent is unified throughout application usage. These steps make it harder for systems to accidentally fail to register consent choices or communicate a user’s rights.
6. An immutable privacy log
Every application should persist a hashed record of privacy related actions. Such a log would enable engineers to reliably track user rights and privacy activities through application and user lifecycle and demonstrate the decisions made in the privacy CI pipeline.
Trust through transparency
Transparency and privacy might seem strange bedfellows, but they’re actually closely linked. Think back to Dr. Cavoukian’s sixth principle of privacy by design: visibility and transparency. A commitment to building trustworthy systems applies to the very tools building those systems. Then every engineer, even without deep training, can do a better job of baking privacy into their code while also reducing friction between engineering and legal teams.
As software engineers, we build the systems that billions rely upon for their finances, safety, health and wellbeing. Our tools and processes must reflect the seriousness of our responsibility to respect users’ data. A post-hoc approach to privacy is labor-intensive and risky. It is fundamentally reactive, unable to scale with the rapid growth of the systems that power our world.
By equipping engineers with the tools to implement privacy by design within the SDLC, we build better, safer and more respectful systems. On an operational level, we reduce the pain of fixing issues in production, enabling our teams to focus on innovation instead of scrambling to achieve after-the-fact compliance. On a cultural level, we deliver systems that genuinely earn users’ trust, which are the only systems worth building in the 2020s.
For years, engineers left AppSec as an afterthought. Today, it’s almost impossible to think of the typical SDLC without AppSec embedded into it. We can and must follow a parallel approach with privacy.Tags: privacy, sdlc, software development
It is interesting how more things keep being moved to “the outset of development”. Don Norman said that understanding the users should come before any project definition whatsoever. But if everything is important and should be moved to the beginning, don’t we just end up back at Big Design Up Front? It sounds like a generic call to become smarter and more knowledgeable.
Don Norman argues that this goal has become impossible due to how knowledge has expanded in our lifetimes. We have outgrown the brain, and the ability to design automated assistance for it. We brought this on ourselves.
Good point @SR that we’ve brought his on ourselves, however I’d love to help clarify that I’m not proposing that we move everything to a large design sprint pre-implementation in the BDUP way you’re thinking. If you compare to security, even in a secure-SDLC paradigm, it’s impossible to mitigate all risks in upfront design, but what has happened over the last ~10 years is an acceptance that “doing security” at the end of the development process is a missed opportunity to better consider security in technical design, during implementation and throughout the CI process on each commit/PR.
The same is true of privacy, there are many areas we could improve on with consideration in upfront design for sure; there are also insertion points in text editors/IDEs, in Git and during code analysis that would make it easier for every developer to implement basic privacy practises – so back to your point – we have brought this on ourselves but I believe we can make this a far lower friction problem that results in better software by providing better privacy integrations in each phase of the SDLC.
Long live monoliths!
I’ve been waiting for an article like this to appear for a long time. Microservices were supposed to be the next golden hammer, the follow-on architecture from SOA that freed developers from the shackles of constraint. Microservices, yaay! Oh, wait. It all needs to be secure, integrated, privatized, internationalized and coordinated. How do we herd all of these cats?
Unfortunately, by the time the scenario described above has been realized, it’s too late.
I’ve long felt that microservices weren’t suitable as an overall architecture. Microservices do have their place, but like all tools, they must be used where they are appropriate. As the article clearly explains, the problem is coordination. As I see it, the only way you can corral microservices like this is to have a strong project manager with veto authority over the salespeople and the entire development team, someone who will see to it that these things are implemented from the beginning of the project.
Does it take time? Yes. Does it take money? Of course. But would you rather attempt this after you already have 500 microservices in production?
In other news, 500 microservices seems like a lot, even for a very large company. If that’s how companies are doing things today, I suspect they are confusing microservices with classes.
The “strong project manager” you describe sounds way smarter than the average bear. It is like the mythical 10x programmer. (powdered sugar?) Projects are too big these days, you can’t use basic reasoning to handle meta-complexity.
Thanks @Robert – interesting observation on architectural design patterns and approaches and how they affect this.
If I’m honest, from my perspective I’m not sure it’s an either/or. For me, this relates to how you can ensure that whatever architectural model you adopt, you ensure that you can respect the processing of a users data. You’re right, you could argue that a heavily fragmented SOA style product is harder to enforce upon. You could probably equally argue that making changes to enforcement or policy in a monolith could necessitate a level of refactoring that makes it impossible or hard to justify for a company.
If anything the variety of architectural designs a team might make, to me, reinforces why privacy needs to be a design and implementation consideration at the outset; in that way you could assure that whether it’s monolothic or service based you’ll be able to impose changes in policy or data use without having to rip the guts out every time.
My little wisdom is: “don’t expect care from people who use abbreviations in the first sentence”.
Where is the upvote button? Back on SO in 2015.
The “Compliance vs. respect” really hit home for me. It’s frustrating whenever you’re a developer and you **know** the implications of some of the data collected about you. You know what analytics are going on under the hood and being acted on by 3rd party libraries. You know that ad-block and incognito don’t block the fingerprinting and other shenanigans that happening. It’s infuriating how much of a game it becomes where you try so hard to not be tracked and try to spell it out but the product you’re using doesn’t care.
The frustration doesn’t end at the data being collected though. Again, as a developer, I know this is happening. Lay users are often oblivious. It’s the difference between being followed around in a store by a clerk a few feet away writing notes about everything you are and do vs the same thing happening but by a video camera. I feel that if more people actually understood the actionable data being collected about them, and how often it gets shared and misused, they would be more cautious.
I get sad every time I have to adjust uMatrix to visit a site and see 100 different 3rd party libs that I have to enable just to read an article. I get even more sad whenever I get asked to add in 6 different analytics libraries to some application I’m working on because I know that I’m basically contributing to the problem.
When you say “I don’t want to be tracked”, you want that respected. The “compliance” level isn’t something that you should be striving to hit, it’s a level that you should be trying to surpass by miles.
Agree with everything here @Stephen K, such good points.
I believe we are in a transition period though for developers in relation to privacy.
By that I mean today, privacy == compliance for the most part and actually this is not correct. By thinking of it as compliance, it’s in the same frame as SOC2-II and ISO-27001; meaning observing controls and checking boxes to prove that a minimum bar has been achieved in order to avoid a fine/legal issue.
Privacy should not be seen as compliance, in the same way that security is not just compliance. These are fundamentals of how you built durable technology that’s fit for purpose. Because it’s currently ‘compliance’ it’s lead by legal teams, creating friction for developers to check boxes, but we’re quickly moving to a place where privacy is a subject matter for developers, just like security to ensure the code that’s shipped is not just ‘compliant’ and checks some boxes, but performs in a way that is suitable for, and respectful of it’s users.
As this transition happens we’ll go from checking boxes in tasks on Jira to implementing standard libraries the ensure certain functionalities are built into every system.
In America at least everyone has the right to privacy. Until the Government enforces the law, privacy is a joke.