The most dangerous shortcuts in software

LaunchDarkly is a feature management and experimentation platform that allows you to decouple software feature rollouts from code deployment so you can manage features safely and securely.

Connect with Tom on Linkedin.

This episode’s shoutout goes to user Boris Gorelik, who won a Great Question badge for asking Removing handlers from python's logging loggers.

TRANSCRIPT

[Intro Music]

Ryan Donovan: Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. My name is Ryan Donovan. We're gonna be talking about the most dangerous shortcuts ever taken, and my guest to discuss that is Tom Totenberg, head of release automation for LaunchDarkly. Welcome to the show, Tom.

Tom Totenberg: Ryan, thank you so much for having me, and thank you for being willing to talk about these most dangerous shortcuts. I don't know if it was intentional to make a parallel to the most dangerous game where it's the humans who get hunted here by these shortcuts, but couldn't help but think about the parallel.

Ryan Donovan: So, before we get to our topic, we like to get to know our guests. How did you get into software and technology?

Tom Totenberg: I think I took a bit of a non-traditional path, which is that I got my formal education in music and in being a teacher, which, surprisingly, you know, there's a lot of crossover. There's a good number of people in software who came from, I'll say, creative arts. And I think that there's a lot of crossover in terms of the types of discipline, daily practice, breaking down big, daunting musical pieces into individual notes, right? And then carrying those same sorts of practices into software. So, in my personal life, I've always been a big nerd. I've been building my own gaming computers since I was in high school, and so really that was my entry point into technology. And it turns out that a lot of those same neural pathways that lend themselves to music also lend themselves to tech, and being able to talk about it, and communicate with teams, and work together in an ensemble, whether you're rehearsing in an orchestra, or trying to build complex enterprise software, a lot of the same skillset is needed.

Ryan Donovan: You're in charge of release automation. I'm sure you've seen some sketchy shortcuts. What are the sketchy shortcuts that you've seen?

Tom Totenberg: To be clear on what my role is, LaunchDarkly helps with some release automation strategies for our customers. And so, what I am helping oversee is partially a product direction and strategy for release automation. How we can standardize the change management process, and put observability guardrails, so that we know which metrics we care about, so that as we are releasing and as we're standardizing this practice of exposing something, do we have the signals that we need to automatically respond to that in an appropriate way? Like pause it, roll it back, that sort of. So, my perspective on this comes from digging deep into a lot of different organizations, change management processes, which is where my whole technical background is from, and then also helping them improve that process. So, overall, there are a lot of themes that we'll be getting into, but, one of the consistent themes that I always think about is some of the best engineers out there are fundamentally lazy in that they will take the path of least resistance. You know, if you are measuring something, they will try to gamify that measurement, and as long as the reports look good, as long as it's easy and they're not getting bothered, they're gonna take whatever shortcut they want. So, yeah. Happy to get into specifics there. But it's like water, right? It's gonna flow wherever it's easiest to go.

Ryan Donovan: I think there's possibly apocryphal quote attributed to Bill Gates that if you want something done efficiently, give it to a lazy man.

Tom Totenberg: A hundred percent. And I appreciate that too. You know, there's the sort of thing where if you can save yourself five seconds of annoying clicks with an hour of upfront work, honestly, that's a great quality, 'cause that pays off forever. You know, as long as you don't go too far overboard with it.

Ryan Donovan: Right. And I think that too far overboard is what we're getting into today. So, the shortcuts that you see, where did they happen most of the time?

Tom Totenberg: Some of the fundamental reason, in addition to the sort of structure surrounding the people who are taking the shortcuts, is also the business pressure. There's a trend: we want to go faster, and faster, and faster. It's not good enough to release twice a year. We have to release either in an agile two-week cadence or even, you know, continuous deployment, continuous delivery, continuous release. All of this business pressure then puts on pressure for the technical folks who are supposed to deliver value to their end customers in some way. And okay, as long as we can keep that in mind, then a lot of the time, some of the technical details start to get missed. And an example of that will be sort of exacerbated, I'll say in the coming years, is stuff like home brewed tooling. This sort of duct tape connector that wasn't necessarily intended to be load bearing, but someone had a solution that they made and then all of a sudden one of their teammates said, 'uh oh. Hey, I think I can use that.' And then someone else said, 'oh, I can use that too.' And all of a sudden, this little duct tape connector is something that multiple people are using and they are getting away with it until it reaches some sort of scale, causes some sort of incident, and 'bang,' now you're in trouble. So, a concrete example of that is stuff like configuration management. People have recognized the value of decoupling deployments, which is the technical act of putting code into production, or wherever you're deploying it, right? From releasing, which is the business decision to expose new features and functionality to end users, right? Those are two separate things, deployment versus release. And so, I have seen so many weird cobbled together, arcane, strange configuration utilities out there. That do operate in runtime that allow you to deploy darkly, but then, oh my gosh, if anything goes off the rails, it's an incident. It's so tough to navigate and figure out the underlying root cause. So, that's just one example. There are a lot of those sorts.

Ryan Donovan: With vibe coding, we're about to enter a golden age of Homebrew tooling.

Tom Totenberg: Oh, absolutely. Yeah. The cost of entry, the barrier to entry has never been lower, but as with everything, the same sorts of shortcuts exist, and notably, if you're just handing over the vibes to whatever agent you want to code this out, then you are personally more disconnected from the actual code that's making its way out to production, or that you are relying on to safely deliver software. So, everything gets slower as a result.

Ryan Donovan: It is a sort of secretly important thing, 'cause that's gonna prevent you from leaking API keys, leaking secrets, leaking a whole bunch of stuff into production, right?

Tom Totenberg: Oh, sure. Or even just, think about the AWS outage that, I'm sorry for anybody still listening where that's in recent memory, but that was a DNS config change, right? And so, often what we don't think about is that a lot of these connectors can bypass our otherwise secure SDLC, right? Or there's governance and reviews and everything put into place, but then if you've got a configuration management tool that's circumventing that, that's still a change. Maybe it doesn't have those same controls, the same scrutiny, same sets of eyes, right?

Ryan Donovan: Right. Does that go through your build process, your test process?

Tom Totenberg: When you've got the duct tape solution, it's not necessarily there anymore. Even a more egregious example that I saw pretty recently, which is by working with customers, I get to see a wide variety of approaches to AI code jets, which some of them are cautious, and some of them are saying, 'no, nothing in production.' And some of them are saying, 'hey, yeah, pop it through the SDLC, and well, you know, we've already got the reviews in place and let's see what happens.' So, something that I saw at a customer was they didn't modify their peer review process. So, a human– so Ryan, you could go tell Claude, 'hey, go generate some code, work on this thing,' and then great, it's gonna get a peer review. Ryan, you're up here. You can review that. So, you can tell it to create something, that AI creates the code, you review it. No other human sees that. And so, being able to even circumvent that shortcut. Even in this forward thinking organization of saying, okay, if it's AI generated code, we need at least two humans, because that will make sure that one person isn't gonna be the person who shoves something out the door with the assistance of Claude, or Gemini, or whatever LLM you're relying on.

Ryan Donovan: Yeah, yeah. I mean, I think people are gonna be rediscovering the importance of code review. I think there's a lot of small stuff, rubber stamping, just get this through. But if it's AI generated, it's like, 'well, what is this?' Like, let's get a couple people to look at this.

Tom Totenberg: Right. And well, and there's of course the old joke about if you want your PR to fail, then make it 10 lines long. If you want it to succeed, make it 10,000 lines long 'cause nobody's gonna review that. As we know, AI is really good at generating a lot of lines of code, and so that sort of shortcut, that fundamental laziness that we were talking about – if it's taken too far and you start saying, 'nah, it looks good to me. Works on my machine. You know, just ship it. Go right ahead.' No amount of automated testing is going to catch every edge case, or think as deeply about the end user flow, and problem space, and actual value that you're trying to deliver to the end customers, like, why are we building this thing? None of that is gonna make its way into the context with the same amount of depth as a traditional human developer, at least not yet.

Ryan Donovan: You talk about shortcuts. What is the difference, if any, between a shortcut and tech debt?

Tom Totenberg: So, tech debt I think, is a result of shortcuts, right? Shortcuts can absolutely lead to tech debt, and tech debt comes in multiple forms, right? Whether this is something like wrapping code in some extra debugging observability, wrapping some code in feature flag, or tech debt can also absolutely be the concept of someone trying to get the MVP out the door without thinking about the underlying fundamental platform that maybe should support that feature. So, let's say, you know, coming up with a new workflow platform or something like that, right? Okay, great. We have some really easy things like, I need to go from phase one to phase two, but did you create the APIs? Did you make this extensible? Have you thought about eventually putting in branching logic, these sorts of things that might happen in the future if we are not taking the time, if we are just vibing out, or just taking the shortcut as a human and saying, 'yeah, I'll get the MVP out the door without really considering where this could grow in the future.' That absolutely is tech debt because it means that you're going to have to cause yourself and your team more work down the road to refactor everything that you did, which is so much tougher to unwind later, rather than just building it up in the first place.

Ryan Donovan: You know, it sounds like a lot of these are sort of pressures from the business, right? Like: move fast, get it done, get an MVP out the door. And I'm sure every engineer listening has experienced that, right? Do you have any recommendations for pushing back against that?

Tom Totenberg: When we think about the SDLCL, a lot of emphasis tends to get placed on, you know, what's our lead time between we get a feature and we're able to build something, right? And are our automated tests, are they succeeding? Do we have any sort of flakiness? How long does it take us to deploy or redeploy? But what tends to get missed in that conversation is the planning phase. Now, say what you want about waterfall methodology here, but one thing that it has going for it is that there's a lot of really intentional planning, very thorough comprehensive planning. Now, I'm not saying that we need to go back to the 90s and early 2000s and, you know, create these waterfall and massive Gantt charts or anything. But what we should be doing is taking the time upfront to be able to not just define what is this individual thing that we are building, but where does this fit into the overall picture? So, at least do that so that we're not doing things like duplicating the functionality that another team may be already built. Is this something that could be reused? Great. That is time well spent. If it takes you one and a half times to build this in an extensible way, then that is better than the two x time from a business perspective that it would've taken to build it twice. The same idea then is about, for the feature itself, knowing upfront how are we going to define success and how are we going to define failure? So, what is the risk of failure? What types of failure do we care about? Is this things like performance degradation, or is this business conversions that are gonna go way down? If you're processing payments, you gotta protect that core flow. You can't introduce an API or something that could introduce any sort of failure rates or massive latency, anything like that. So, we need to be able to define the failure modes, but we also should be able to define the success criteria of these individual features. And that, in my mind, is something that should happen during the planning phase. So, we not only have this clear picture, we know where it fits in, and then we have a clear picture at that feature-level of what's the failure modes? What's the success criteria? And so, as we're building this, we're thinking about those eventual pathways and thinking about those eventual signals and how they build up into the greater picture.

Ryan Donovan: I haven't heard anybody rep waterfall in a while. I like the focus on planning, and something I've been thinking about is why is it so hard to define what good is? Why is it so hard to define what the success metrics are?

Tom Totenberg: A couple reasons, and this is a big shift in the industry. Reason number one is: we have made a big emphasis over the past 10 years, maybe longer at this point, to 15 years, into smaller, more nimble, agile teams that have complete ownership. And the complete ownership part, great. I love the idea of flattening responsibilities, and this is one area that AI is 100% changing people's job descriptions over the next few years, is making sure that you've got a more comprehensive view on the impact of what you're building. You don't just get to build and not consider the impact. You don't just get to test without thinking about security considerations, and only think about functional testing. So, this sort of flattening is great. But the idea of small, nimble teams means sometimes you can be shipping your organization chart, which is if you are in a small team, you're going to ship small solutions, you're going to answer small problems, and then sometimes you conflict with another team, or you had two different ideas of how to build parallel things. That is where suddenly, again, more more tech debt is going to come into play because there is less emphasis on this big overarching plan, which going back, the pendulum swings back sometimes there where maybe we overcorrected in a lot of institutions toward these small, nimble teams. We have to have some clear, top-down direction. Even if that means sacrificing a little bit of that individual ownership, it then means that we're coordinating better as a team. There is a balance there. There's no one right answer for any given solution, or platform, or whatever it is that you're building, but you've gotta be able to find that right balance and make sure there is this clear North star coordinating all of this. That is what enables you to be able to continuously move fast, not eventually run into all the annoying stuff like, 'oh crap, they built a similar thing over here as we did, and now we gotta go back to the drawing board. Right?

Ryan Donovan: I've heard of microservices being primarily an organizational feature and not a engineering feature. I remember, you know, my last job, first time I saw like a microservices chart of everything, you know, the 100 or so services doing thing. It's like, there's a lot of services here that don't do anything anymore. Who is owning the whole of this?

Tom Totenberg: Exactly. Yeah. And controversial thing that happened, Elon Musk took over Twitter and then, you know, there were all sorts of instability issues because he fired a bunch of people and said, ' slash all of these other services.' Honestly, I've never had a Twitter account. I don't actually know, but the website is still going, and the guy– I have no idea what the operational costs look like. But there was some element of him being able to be like, 'you know what, that doesn't seem useful.' And sure, maybe people weren't able to authenticate for a day, and they scrambled. Again, I am speaking from a place of ignorance on this, but it seems like they slashed a bunch of stuff, and that's not entirely a bad thing. The other example that I always think about is: this would've been early 2000s, the famous, Jeff Bezos saying that everything needs to be a platform. Everything needs to have clearly defined inputs and notebooks. And this is just like an online bookstore at this point, right? Barely an overall retailer, but that sort of focus on interoperability, on clearly defined input-to-outputs, clear measurements on making sure that as an organization we know how to fit together, and we're not just building something in isolation only for our own team. That sort of focus has real long-term benefits. I mean, it eventually became AWS right? And we all know that.

Ryan Donovan: It's interesting 'cause I think it's built for scale, right? It lets you replace things very easily. And I saw a video a while back that almost every large software company has rebuilt their software from the ground up to get 10-15% performance increases. At some point, planning ahead and being like, 'we're gonna have each of these very easy to kill.'

Tom Totenberg: Not just easy to kill, but easy to switch. I think a pretty good example of this recently is Open Telemetry. Open Telemetry is, if you're not too familiar, it's a sort of open, standard framework and language around observability, right? So, take metrics, errors, logs, traces, and all of this is now a standard that most observability providers have adopted, right? They can all speak O-Tel. Something that we could do as we're defining our observability strategy, then is like, should we just use the native methods or can we spend a little bit of time upfront, build a wrapper that will allow us to then shift, and that will allow us to change strategies as new methods are introduced as new vendors come onto the scene. This will allow us to plug them in if they're filling in functional gaps in the current observability. Make sure you've got your own internal platform around whatever it is, in this case, observability, so that you can be nimble with how you are dependent upon these external providers.

Ryan Donovan: Again, that's another feature, another bit of engineering work that isn't immediate, that isn't serving the MVP. Yeah, I wonder if you've seen instances where people were successful at implementing that slow-as-fast arguing that, other than coming from the top down, like you talked about with Bezos.

Tom Totenberg: There's a couple interesting patterns, I think, that tend to happen. Number one, you can have sort of central, top-down buy-in, and then there might be some platform team, right? Central ownership and so, I have absolutely seen success with that. I'm thinking of a large banking, financial, retail investing, that sort of customer that I've worked with, surprisingly tech forward where they recognize, 'okay, we've got thousands of applications, we've done mergers, we've gotta try to onboard people if there is no one-size-fits-all process. However, we are going to have a golden path, a supported set of tools, and techniques, and concepts. And this is a central COE, Center Of Excellence, that will actually be able to support this. Right? And then you can take it and configure it, or customize it based on this golden path. But we've got some standard metrics that we're gonna measure you on to just to make sure everybody's performing well. So, for that sort of scale, that sort of size, it is a clear winner to standardize and go 'slow as fast,' because it's a forced multiplier on everybody else who's using that, right? And so, they have the mandate, and the time and space and resources to actually do the research and actually implement some of this stuff. If you are a smaller team, or if you are an individual tiny branch of a mega corporation conglomerate, you don't have a budget, you don't have funding, you just have your tickets, and you gotta work on it, then the other pattern that I've seen emerges is there tends to be this sort of grassroots groundswell of common practices. So, the other example I'm thinking of is in the health insurance space where I have seen, even through lots of different reorgs, and shifts, and restructurings, and whatever other nice words there are for layoffs. I have seen the opposite pattern, which is that eventually there are some common practices and standards that tend to emerge, and still takes someone sufficiently high enough to realize, 'oh, hey, they're all doing this. They all say that this is the right thing to do. It's time. Now we should actually invest in this.' And so, then that is similar to what I was talking about with the shortcuts before, which is now at this point, you might have common practices, but different vendors, different processes, different accounts that would need to be migrated or merged together, right? That sort of thing. And so, that's a larger scale of the refactoring that we talked about a little bit earlier, but you can eventually still get there. Eventually, there does have to be centralized ownership, centralized maintenance for this sort of observability platform, observability approach, or release management approach, deployment approach from testing approach, all of this sort of tooling. There's economies of scale there, but how we get there can be a little different top-down or groundswell going up.

Ryan Donovan: It's almost like there's this folk engineering going on where everybody in the bottom is just sort of like getting their piecemeal, their little parts to make the thing, and somebody has to put together the canonical version.

Tom Totenberg: You know, that could be a hackathon. And a principle engineer says, 'you know what? I'm gonna solve this problem. I'm going for distinguished this year, this is my pet project.' And that could be a thing. Or it could just be someone from management and leadership who recognizes, 'oh, we better do this right.' You know? To avoid that duct taping that we talked about earlier.

Ryan Donovan: That is one of the benefits of the top of the food chain, right? I used to call it 'executive laser beam,' right? That's gonna make things move.

Tom Totenberg: It could also be personnel changes in leadership, too. I've seen a lot of times a leader will come on board and say, 'you know what? We did this at the last place. I'm gonna make a big splash, a big impact in my first year. Let's onboard exactly the same setup as we had at the last place.' Consequences be damned at the new place, where it might not make as much sense. But yeah, so for better or for worse, that laser beam, it has a big impact. Right?

Ryan Donovan: That's interesting. We talked about trying to convince folks of ' move slow to move fast,' but how do you then resist changes from somebody coming in with a wrecking ball and being like, my way, the highway?

Tom Totenberg: This is where I think flattening the responsibilities, and to be clear by 'flattening' what I mean is, yes, you can still have specialization, you can be narrow and deep in a particular area in which you have ownership and expertise and time and can oversee this. But, the expectation as businesses evolve and become more AI-friendly, which is the trend, this will happen, this is the next big wave, right? Like it or not, these are the skills that we're gonna have to learn. And so, to answer your question, everybody should be able to have a good understanding of the impact of their work and the impact of their processes. So, there's a concept called 'value stream management' out there, and the whole idea is, if you're not familiar, there's a really good evocative image that I have in my brain, which is, I think it was a car manufacturer in Germany or something like that, that from the book that I read. And they were talking about how everybody, if you're sweeping the floors, if you're in accounting, if you're doing maintenance on the machines, if you're in security, whatever it is, everybody can see the product line of the cars that are going out the door. Everybody knows this is why we are here. If I'm in HR, cool. Ultimately, the business is here to deliver cars to our end customers. That is the value, that is what we are delivering here. And so, when you get the question then of, hey, this leader is coming in and they want to take a wrecking ball to the safe, stable thing that has been working. If you can provide business metrics on that and say, 'you know what, we currently have the data, we have a good signal that what this is doing is maintaining great uptime, and MTTR, and our change management is going smoothly,' right? And so we're able to release quickly according to these industry benchmarks. Sometimes, if you take a look at yourself, you're like, ‘how did I end up here? Why do I care about business metrics? Who have I become? What happened to the child who that turned into me?' But it's important to be able to protect your day in and day out, including from leadership, because you know what is always a defensible position? Protecting your customers, protecting your end users. So, then if they come in and say, 'we want to change this,' you can start to have a leadership-level conversation and say, 'hey. This is my concern. If you change it, this could degrade these metrics we're currently measuring here. We don't want to degrade that. Let's talk about it.' And so, then you can have an honest conversation rather than just rolling over and saying, 'but I don't wanna.' That's not a compelling argument.

Ryan Donovan: Yeah. In the initial pitch, talked about creating a sort of sustainable release process, right? Getting a point where things are good and can continue that way, but as we've seen a lot of sort of business pressures are towards getting more. How do you argue for your sustainable release process against the forces of more?

Tom Totenberg: Well, that is where all of the smart automation and surrounding processes need to be built up, right? This is where the 'go slow to go fast,' and I think the rubber really meets the road because it is a greater example of setting up a macro on your computer to type in a common string every time you need to type it, so you don't have to retype it every time, right? It's a macro of password managers, that sort of thing, which, sure, we gotta set it up once, then we can use it forever. And so, for release processes, it's exactly the same thing. We should be able to have an honest conversation and internally to talk about what are the various different categories of releases that we have. What are the different risk levels that we need to consider? What are the different architectural areas and responsible teams to oversee these? So, how can we come up with automation practices, not just in terms of how we deliver the softwares – who's in wave one, who's in wave two? Are we doing, you know, blue greens? Are we doing wave canary rollouts? How are we doing this? So, for each of these different categories of releases, we should be able to have not just the control strategy, but the measurement strategy, right? So, how do you actually know? How will you confirm that this thing is going well? And there are likely some pretty good reusable metrics. This is actually one of the reasons that I was talking about a centralized observability strategy before, is like, if there's high-level things that you care about, uh, is the overall system actually up or not? Which is, you know, we got status pages, and we should be able to see if that microservice is responsive or not. That sort of thing. We should be able to reuse these, but they're not gonna be the same signals for every single change category, right? So, that is where the control and the measurement together forms this concept of automation, which is: via control, we should be able to determine, based on the qualities of the change, which path it's going down. Is this a fast, aggressive, non-breaking change? It's something cosmetic – we're updating some wording on the front end, oh, okay, cool. Whatever. Is this something that's a lot more risky because this is a change to a data schema? This is a change to something that involves PII, and we're introducing new APIs towards these sorts of things that are a lot riskier, not just technologically, but to the business, should go through a slower, more controlled exposition release process. So, why should we not be able to release to 1% of our beta users, and then measure the effect that it has on that 1% of the beta users? That requires some correlation between who's exposed and how that change is actually impacting the people who are exposed to that change. So, there's a lot of interesting stuff that you can do there based on that sort of release strategy to make sure that, again, the blast radius is contained, that you are not getting woken up at 2:00 AM, which is something we all maybe done once and hope to never do again. And so, this automation really is those two concepts, the way that I think about it, which is control on the front end, measurement on the backend to make sure that we're not just relying on pre-prod automated tests, but then we can actually get the signals that we need to confirm that this is doing the job that is expected of it, and it is not introducing regressions simultaneously, so that we can confirm the quality, and really validate what's going out the door. That's really what I mean by this standardization. That takes a little bit of setup, but then look at all these paved paths we can go down, and with the guardrails already on the bumpers in the bowling alley. Thinking conceptually rather than on specific practices, specific tooling, I think, is going to be more important than ever. Things are changing so fast, we all know that. And so, being able to take a step back, think about the concepts, think about the first principles of all this is how we all survive the upcoming AI apocalypse, or however we wanna call that.

Ryan Donovan: Well, it's that time of the show again where we shout out somebody who came onto Stack Overflow, dropped some knowledge, shared some curiosity, and earned themselves a badge. Today, we're shouting out the winner of a Great Question Badge, somebody who asked a question that earned a hundred or more points. So, congrats to Boris Gorelik for asking, 'Removing handlers from Python's logging loggers.' So, if you're curious about that, we'll have it in the show notes. My name is Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have comments, concerns, topics to cover, please email me at podcast@stackoverflow.com. And if you wanna reach out to me directly, you can find me on LinkedIn.

Tom Totenberg: Ryan, thank you so much again for having me. I'm Tom Totenberg from LaunchDarkly. You can find us at launchdarkly.com, and we're also, of course, on LinkedIn, and Twitter, and YouTube, you name it. So, come find us and say hi. Thanks again for having me.

Ryan Donovan: All right. Thank you for listening, everyone, and we'll talk to you next time.

Add to the discussion