How AWS re:Invented the cloud

This episode was recorded live at AWS re:Invent. Listen to our other episodes from the floor with the Stack Overflow team and Corey Quinn.

Keep up with the latest AWS updates, including what they’re doing with AI, at their site.

Connect with David on Linkedin and Twitter.

TRANSCRIPT

[Intro Music]

Ryan Donovan: Hello, everyone, and welcome to the Stack Overflow podcast, a place to talk all things software and technology. I am your humble host, Ryan Donovan, and today we are recording from AWS Re;Invent. My guess is no stranger to AWS – longtime builder of the Cloud Platform, David Yanacek, Senior Principal Engineer at AWS. So, welcome to the show, David.

David Yanacek: Thanks for having me.

Ryan Donovan: Of course. So, before we get started, we'd like to get to know our guests. How did you get involved in software and technology?

David Yanacek: We got involved just thinking that it was pretty cool that you could write a program to quickly get something to happen that was actually useful. And so, I was always chasing things that were making people's lives easier. Just wherever, I was thinking, 'oh, how could I help somebody do that thing more simply?' In high school, I worked at a bank as a teller, you know, doing transactions for those who maybe haven't visited a bank in a while. I used to do deposits, and withdrawals, and everything like that.

Ryan Donovan: Right. The real person teller.

David Yanacek: Yes, that's right. And in doing so, I saw a lot of things that were people working around me who– I could actually make your job easier. I could help you with that tedious task that you don't like working on, like sending out schedules for everyone. And I just found, you know, I could make a quick little application, something that would be easier for you to maintain over time, not a program that you're not gonna understand, but I could wire up Excel with a ton of V lookups to make schedules that will just save you all the time from having to fax schedules around to people. And so, I just became very excited by the amounts of happiness and relief I could bring other people by making their lives easier, removing the tedious stuff.

Ryan Donovan: Nice. And then, obviously about 20 years ago, you got started at a little startup called Amazon.

David Yanacek: That's right. I thought it was a really fascinating place with a ton of scale and data problems to solve. I found it was really interesting, 'cause before I started Amazon SQS, the Simple Q Service, was already out there in a beta, and I thought that was very interesting of, okay, I've been writing these applications to help people, but they're always needing to operate something. I can give them a solution, but then that also comes with the tax of always having to operate that thing, keep it running. It has an underlying database or server that it's running on that needs to be upgraded and everything. I saw, immediately, the kind of toil that would come from something like that. Oh, it helps you, but at what cost? So, I saw this SQS idea, this Simple Queue Service in beta out there. And I was just fascinated that we could provide some kind of a building block, a low-level building block as a queue. Why do I need that as a building block? It's like, 'well, actually, I would've to otherwise do a bunch of operations, have a server running all the time. That's hard to deal with.' And so, if instead I can just call an API from something easy shortcut, I found that really interesting. And so, then I joined Amazon to just be a part of that and make everybody's lives as easy as possible.

Ryan Donovan: Sure. So,

Ryan Donovan: I want to fact-check a story I heard about the origin of AWS, possibly apocryphal. I heard it project from some other engineers. Amazon Web Services – did it get started because there was a huge Black Friday capacity ramp-up, and then the rest of the year there was no use for that capacity, and then they decided, 'let's productize this.' Is that true?

David Yanacek: I think there's a ton of factors to what led to AWS being formed. We had a lot of experience in automating server operations. We had built a bunch of tools and realized like, hey, this is actually something that when you have teams structured to own all of the development and operations like we do at DevOps at AWS, there are so many things that they're responsible for; and we realized we were building a bunch of tools for ourselves, and that expertise of having built tools for ourselves, we said, 'well, we could actually build those tools for everybody, too.'

Ryan Donovan: Right.

David Yanacek: I mean, it is also true that scaling for Peak was an interesting thing. I actually started on Amazon.com when I joined Amazon for a couple years, and I was actually responsible for figuring out how many web servers we would actually need physically to land and become all wired up and everything. That peak prediction calculation is extremely stressful with nearly no reward, 'cause if you just choose too many, and buy too many, then why did you waste our money buying too many? And if you buy too few, well, that's a huge problem.

Ryan Donovan: Right, right.

David Yanacek: So, I scrambled to figure out how to get everything more efficient in time. So, that peak provisioning thing is a real problem dear to my heart, because I was having to do a bunch of automation around that forecasting for so many different sites. The good news is there was so much growth at the time—and remains—that if you do overbuy, then there's always some other use for those things, so.

Ryan Donovan: Right. That was the better place to miss. Right. Did you come up with a better sort of heuristics or algorithms to manage that peak estimation?

Ryan Donovan: There

David Yanacek: were a bunch of smart people who know a lot of math compared to me. When I showed up, I was handed a book on how to do a forecasting 101. So no, I didn't do anything particularly smart, but I figured out how to bring in those different tools, and to run a bunch of experiments on the data, and it was a pretty interesting project to be able to learn a lot more about forecasting, and to build systems to automate it.

Ryan Donovan: Yeah. You mentioned things like SQS server, and building tools for yourselves for automating servers. What were the other pieces that you all built that were fundamental to AWS as an eventual product?

David Yanacek: Well, I mean, AWS is an ongoing project, so really, there's no point in time where we thought, 'oh, this is actually what we need.' Right?

Ryan Donovan: Well, I mean, there was an MVP, right? There was the first time where you're like, 'we're gonna get a customer, we're gonna turn the lights on. Here's all the things that make it.'

David Yanacek: Compute and storage, storing things, durably, securely – it's really hard. So, that's why S3, the Simple Storage Service was one of the first ones because it solves a discreet problem of like, 'how do I store all of my objects, my files, or whatever?' And to be able to access them whenever I need them.

Ryan Donovan: And reliably and consistently.

David Yanacek: At any scale. Exactly. You know, when it comes to figuring out which services to build and everything to go after, it's something that I built my career around, and so many of us at AWS just have this singular focus: how can I make every developer's life easier? And we look at people having problems storing things. Databases are actually the one that I resonated with most, I think. When you have a problem that you're solving an automation thing you're building, or a product you're making for customers, you need a database, you need index to data that can quickly retrieve things, and store things that are queryable by the things that you need to query.

Ryan Donovan: For customer data, right?

David Yanacek: Yeah. For anything you're storing. If you're storing something about customers where they need to retrieve data, then yes, you need to index that data.

Ryan Donovan: Yeah.

David Yanacek: Databases are a really difficult problem though, because scaling databases used to be notoriously difficult. Keeping them patched was difficult. We found that earlier models for databases, you really had to have a whole team of people who were specialized in running a database. And so, when we were building tools internally, if you were building a tool and needed a database for it, you had to hopefully make friends with somebody who could run a database for you. But then, we ran our databases ourselves – 'oh, how hard can it be to run MySQL, or something like that?'

Ryan Donovan: Yeah, I mean, in the early days, I'm sure you ran into sharding and replication issues where it was just like, how do you maintain a data chain of command?

David Yanacek: Exactly. All the sharding cluster management backups, which actually, as an aside, as an industry calling the feature of being able to restore databases, backups, you think of it as a backup, where you really need to be thinking of it as a restore. How do I get things back into service at a certain point? That's the hard part, the backup is the easier part. But anyway, when I was running databases on my own, I'd have to shards scale, patch all the time. And at the time, I was on a team that was doing web server operations, so automating our web server fleet ops. But so, we needed a database to keep track of all of these things. And this is before RDS, this is before DynamoDB, this is before any of these services to make this easier. And I found that– great, now my server fleet operations is easier because I have this nice tool that automates it, but now I have this database that I keep having to deal with failover and everything. It would page me more than the servers. And so, when I heard later on that we were building this highly available, scalable database, DynamoDB, I joined that team. Now, it's funny 'cause I didn't like server operations like our database operations. It wasn't enjoyable. It was frustrating, I guess, because it was so distracting from the thing that I was actually trying to do. But the truth is, I also love database operations; I just wish I could spend all of my time on that, and just solve it.

Ryan Donovan: Do you think that's possible?

David Yanacek: That's why we built DynamoDB. Ever since then, I've had this selfish interest of why I've worked on each team in AWS.

Ryan Donovan: I hear so many companies with the massive scale database is trying to solve that problem. Right? The huge amount of data that a lot of these modern enterprises need.

David Yanacek: That's right. We've built so many special-purpose databases that are managed to help you with, whether you're storing graphs or you're storing indexed low-latency data, or you need to store– now, there's actually a DSQL that is a distributed, globally replicated, highly available database.

Ryan Donovan: Okay.

David Yanacek: So, I don't think we are ever going to stop building the next database because there're just such a wide range of problems to solve and things optimized for.

Ryan Donovan: You know, in addition to the data, there's the compute and functionality part, right? And I see AWS has a lot of different looks at compute, right? You have different CPU scalings, you have GPU, you have Lambdas. What was the pathway in understanding compute there?

David Yanacek: Well, it started with general purpose compute. Then we realized that along with compute, we need storage, like file system storage. Having the storage co-located with the compute on the same actual physical machine, or even the same part of the data centers, made elasticity hard. So, if you wanna scale up or down, add more servers, remove servers, you have to then– if you also have a data movement problem, then that can be very slow to replicate the data, and just a much harder problem. And so, what we did is we built the elastic block service, EBS. It was interesting, kind of abstract. When I heard about this I was [like], 'what is this going to be?' So, you just separate the discs, essentially, onto a whole different part of the data center, and mount them as block devices from the actual servers. Each server has a block device, but you can add and remove servers in a much more elastic way without having to actually provision the amount of storage on that compute. And that's been such a big unlock over time was the fact that we could separate the storage. The second thing that helped compute move along quite a bit was when we built Nitro. So, we realized that we'd been doing all of the virtualization, the normal kind of hypervisor way, but then when customers wanted to bring new operating systems, we wanted to support more and more hardware types, we were realizing this was gonna be a bottleneck, and around adding more instance types, more compute types, more operating systems. And we realized it was just getting in the way of adding overhead actually, to each piece of compute that we were serving. And at the scale we were at, we realized we needed to do something. So, we built Nitro, which is a separate actual server attached to each server, if you will – a separate card that does the virtualization, the access to the network, the access to the box devices, and everything. So, that's been such a huge evolution in compute.

Ryan Donovan: Does that help with virtualization across machine and resources, like the hypervisor, virtualization– like [it] treats compute across all servers as part of a single machine? Is that right?

David Yanacek: The hypervisor is more about subdividing physical. Compute into different kind of guest instances and stuff, if you will. So, this is about chopping up one server and making it so that you could do all the fair share of resources, underlying resources between those guest operating systems.

Ryan Donovan: Is it possible to have an instance that is compute across multiple servers?

David Yanacek: Well, yeah. So, this is where it's really interesting, at least for a few ways to go about this when it comes to high-performance computing. But one thing that I think is interesting is we found people wanted to treat compute more abstractly. They said, 'well, okay, I'm having to think about servers, think about provisioning them.' We had services that would help you, like autoscaling, that would help you target seeking, like, 'I don't know how many servers I need at any instant in time, but I know that I should be running them around 60% CPU.' Yeah, you still have servers. You have to add and remove them. You have to distribute work through a load balancer, which we then also offered load balancers to make that make your life easier there, too. But we realized, okay, this is a lot of work still that people can do, it's something that the industry is pretty good at. People are provisioning servers, running things on them, but how could we make it easier? And so, that's why we built Lambda. We invented this serverless concept. What does it mean to not have a server? Of course, there is a server–

Ryan Donovan: Of course.

David Yanacek: But did not have to deal with the abstraction made it so that now I can just say, 'here's some code, and here are some triggers that need to run the code,' the input to the code, and then the output of whatever that function is, the program is that runs. And so, we built Lambda that– you can think of it as a single computer in a way. There's no shared memory or anything. If you think of the abstraction as, okay, here's some compute that's just gonna run, and I don't need to think about keeping headroom, I don't have to think about patching the OS, then it's actually a really nice abstraction.

Ryan Donovan: Is it just a big job scheduler behind the scenes?

David Yanacek: There's a lot behind the scenes. We've been evolving the architecture of it continuously to make it scale better, to make it more responsive to your changes in workloads, to reduce the cold start overhead. It's a very big architectural shifts over the years in how Lambda works. There is a schedule or component that needs to do that schedule and place workloads in less than a millisecond, ideally, because every amount of time that we're spending deciding where to run your compute and provision things is time that's adding latency to your customer. A lot of the times, it doesn't matter if you're doing asynchronous workloads, but a lot of the time it does if you're doing things like rendering a webpage.

Ryan Donovan: And I'm sure there's a big durability component behind the scenes, too?

David Yanacek: To Lambda? It's interesting with Lambda, for a customer to then use Lambda for an application, again, we separated EBS, the block storage from compute. With serverless, you're doing your storage, you're building your application to store data somewhere else in a database like DynamoDB, or a Search Engine. Index, like open search. So, you're actually storing your customer data, your applications data, in a different place other than within the Lambda execution environment, but internally for durability, it's very important for us to keep track of behind the scenes – what compute have we provisioned for you and what is its current utilization of free busy in terms of being able to route your request to the best place to avoid cold starts, and just have low latency and have high utilization? So, that's all hidden. That's all the fun part, frankly. That's why I worked on the Lambda service for a while – to help with that part of it, that placement.

Ryan Donovan: Right. Making the actual work of it invisible behind the scenes.

David Yanacek: Exactly.

Ryan Donovan: And obviously, now we have increased abstraction of the compute and hardware with containers and Kubernetes. Did you have any hand in the Kubernetes containerization services?

David Yanacek: Only in the adjacency to lambda, the serverless compute. One building block that is shared across is we realized that customers wanted good isolation between these containers, and one thing that we always like to remind and think about is that containers are a really useful way of divvying up resources, but they are not a security boundary. And so, what we built was an actual VM isolation. The nice thing about containers is that they're lightweight. They don't have a lot of per-container overhead. And so, we thought, 'well, could we have the best of both worlds? And then apply those to different compute environments, like Fargate did with its container management service, or Lambda with the serverless compute. So, we built something called Firecracker. It's a micro VM technology. It's actually open source that has very low overhead per VM, but still gives you that actual VM hard isolation between workloads. So, if you have a micro VM container inside that, kind of just a one-to-one relationship there, then you have the best of both worlds, low overhead per execution environment per container, and you get the security properties, and everything else that comes with advantage of having a nice micro VM boundary.

Ryan Donovan: I remember years ago working on some cloud computing stuff. There was a lot of concern about multi-tenancy. Does this solve the multi-tenancy concerns?

David Yanacek: Yeah. The micro VM is all about multi-tenancy because it's one tenant per micro per VM. Having that VM boundary per one of your tenants is actually really important. And actually, AWS is all about, can we build you– we have different tenancy models for different services, but as a lazy developer, I'm drawn to multi-tenant services that kind of goes hand in hand with serverless, in a way, where I don't have to provision things and I just am participating in this pool, but still having that isolation. And so, yeah, multi-tenant systems with things like Firecracker, Micro VMs really help you, because you just provision one per your tenant.

Ryan Donovan: Lazy developers are the most efficient developers, right?

David Yanacek: I think so. And this has actually continued. This multi-tenancy journey has actually extended into the world we're in of agentic AI. And so with that, with agents that are invoking LLMs that are deciding what to do next, that's essentially the security properties that you need of an agent, you can reduce it to, 'well, what if it is running arbitrary code from whatever input happened into the system?' And you want that multi-tenant isolation then, at almost a 'per your customer of the agent' level. And so, that's why we built this thing within Bedrock Rock, Agent Core Runtime, that gives you that per, you can pass in a session, ID, whatever represents your end tenant, and it runs and provisions, in a serverless way, on per tenant, isolated compute.

Ryan Donovan: I wanna get to the agentics, but before that, I think I wanna talk about the GPU and the LLM, because I think that has been an additional complication, right? It's another piece of compute, and LLMs have to reside in a lot of memory. Have you done work to better utilize GPUs, and reduce the load that any given LLM takes?

David Yanacek: I think one of the key unlocks– I haven't worked a lot in that space, but one of the key things is that the Nitro card has made it so that we can do our virtualization and all of the kind of tenancy controls outside of whatever underlying type of compute we're dealing with. I think that's been a big enabler for us to be able to do GPU scheduling.

Ryan Donovan: Now we have compute across regions. How many regions does AWS have?

David Yanacek: If you're including the edge caches, it's even more. Gosh, it's so many.

Ryan Donovan: Yeah, just the big ones. The, you know, AWS Service West.

David Yanacek: 30 or something? You lose count.

Ryan Donovan: What was the sort of unlock to let that sort of traffic pass seamlessly between two vastly different places in geologic space?

David Yanacek: It's interesting. You talk about the communication between regions, but the key thing that we obsess over all the time when it comes to regions is that, when it comes to our services that we offer within those regions, is that they don't talk to each other. It's that they are isolated. We don't want to have shared fate between regions. We offer many different building blocks so that you can build highly scalable, resilient applications. Availability zones are one. So, we want to offer that as an abstraction of a thing that has correlated failures, and regions are another example of that. So, we want these things to operate separately, but then we also need to provide services that help customers run across them. And so, things like I was mentioning earlier, DSQL, a distributed database that lets you have your state across. There are features that we've added to storage. When it comes to what makes it hard to operate across regions as a customer, and how to build applications that take advantage of more than one region, it's the state; and so, we built things like DynamoDB Global Tables with a bunch of new features that replicate your data across. S3 has replication of objects that you can have buckets that replicate for you. So, we do that in a very controlled, safe way so that we're not gonna have that shared fate between regions. And so, we can offer you those nice building blocks. Another really hard one is, which is actually surprising, is how do you actually shift your traffic from one region to another in a very reliable way? Because at the end of the day, you're going to wanna be doing that, you know, maybe when your application is having problems or maybe when a region is having problems. And so, we need a very reliable button. It actually sounds so funny, but how do you make a button?

Ryan Donovan: They just a big button in the office.

David Yanacek: Yeah. How do you make it just a big button? You know, put it on the wall, maybe like a lever that they can pull. But really, how do you make a very reliable button that operates on very reliable signals that span the idea of a region? So, we built something called the application recovery controller that helps you move between regions very reliably. At the end of the day, it's implemented behind the scenes with DNS health checks. It's the most reliable thing that's happening all the time. We have this idea that you can make your system scale and be very reliable when it does constant work. It's just always doing the thing that it needs to do, not just only when it's called into action does it do that thing right? That's something where, now all of a sudden, you have to get all this machinery going, but if it's always doing the thing, then when you need it to happen, it's just going to do the thing that it's always been doing. So, that kind of idea with DNS health checks. And then, some machinery behind the scenes to decide how to respond to those health checks. So, that's what Application Recovery Controller ultimately is.

Ryan Donovan: Yeah, I'm glad you brought up DNS. I think anytime there's been a sort of large internet outage, I think y'all had one recently, CloudFlare had one, DNS has usually pointed at the culprit. Why is DNS so complicated and hard in a cloud in a sort of distributed situation?

David Yanacek: DNS is interesting because of the scope of impact of it. It lets you have an endpoint with a name, and it's just a very simple thing. It's like, 'oh, I'm going to talk to some endpoint.' Okay, so where do I go to talk to that endpoint? Well look it up, but with its name. And so, DNS, it makes the news because it's so often a system [that] is designed to have a name. We've been doing a bunch of work that's pretty interesting to make it so that even a multi-tenant service can have your name for that multi-tenant service. And so, that is just inherently reduces that blast radius, that kind of scope of impact so that if everybody has their own name, then we can change names independently. And that's just a really useful tool that we're building.

Ryan Donovan: And I'm sure with very disposable instances, [to have] a reliable IP address is more difficult.

David Yanacek: Yeah. IP addresses, well, IPv6 that we've brought on certainly makes having dedicated IP addresses easier, but at the end of the day, you can have an IP address for an instance but nobody hard codes an IP address into it. There's always this bootstrapping problem. You think of, 'how do I know what IP address to use?' Okay, well, there's a DNS. It's like 'the turtles all the way down' idea.

Ryan Donovan: Right, right, right.

David Yanacek: At the end of the day, you need to program your program to look up who to talk to. Service discovery.

Ryan Donovan: I know you all are moving pretty hard in the agentic area. I mean, it makes sense that you have the platform for it, right? Can you tell a little bit about the agentic stuff that you're working on?

David Yanacek: This is very in line with my singular focus for my entire career is to make developers' lives easier, and we've been making a bunch of tools over the years that offload everything that I need to do, from running servers to patching them. But there's always still something that remains. When it comes to DevOps, I still have to upgrade software. I still have to do things that are distracting from my main–

Ryan Donovan: Right. The tech debt sort of stuff.

David Yanacek: Exactly. There's all this infinite backlog. Every team has an infinite backlog of either features they'd like to build, or improvements they want to make, or chores that they have to do. So, I'm very excited about what we're using. So, I guess, why haven't we been able to fully tackle that problem before? Is it because environments and each customer has a very unique setup? Everything's just a little bit different. The tools that they're combining around, 'where do you store your code? What do you use to do CICD? What do exactly flavor of everything are you combining together? How do you define your infrastructure?' Everything's different. With AI, it's suddenly this highly adaptable thing, and so we've realized that agents are very good at solving this kind of problem. We built Kiro, an agentic AI IDE, that helps you write code. I've been able to get so much done, get further into my backlog.

Ryan Donovan: We've talked about Kiro on this program.

David Yanacek: Yes, exactly. Yes. And so, that's been a good start, but what we've done is we've realized that we can make these agents be even more autonomous, learn and have memory over time, and improve over time – learn your environment better. And so, we've built what we call a new type of agent, called 'Frontier Agents.' And these types of things have learned, can scale, can deal with ambiguous tasks, can run autonomously for days or hours. We've started by releasing three of them that are in this software development lifecycle area: one that helps you build software; one that helps you with security; one that helps you with DevOps, the operations side. And so, these three agents, this is what I've been working on and the most excited about, that they can just reach further into your backlogs and operate autonomously to do things like, okay, well now we have a bunch of code coming through. We have to make sure that we're following our organization's security policies, and checking things to make sure that they are secure. Doing pen tests. You know, when you have a higher velocity of change, now that's a lot more that needs to happen. And so, these agents are the answer to keeping everything up with, 'okay, well now I need to do more load tests or operational checks on everything to make sure we have the right instrumentation – that we're not dealing with any kind of production alarms.' Now, [with] all of these, we're going after making our jobs so much easier, so that we can focus on what matters to our customers.

Ryan Donovan: You're making developers' lives easier by keeping them away from InfoSec and SREs.

David Yanacek: So, it's interesting, at Amazon, I think AWS is pretty well positioned to do this. It's really interesting, because we're talking to customers all the time learning about the problems.

Ryan Donovan: I mean, you're the place where your code runs in production, right?

David Yanacek: It's also just the how we've done it over these decades. We've done DevOps, and so, now we're gonna get into a little bit of labeling of different development models, which is gonna be, you know, really interesting of like, well, does this mean that? But to me, DevOps means that there is no separate DevOps function. The developers do the ops. You can just chain all these together, like DevSecOps, Dev PM, everything, ops. The thing that drew me to work for Amazon is that we have this model where developers just wear all of the hats. You're just doing all of these things, some amount of everything. I'm talking to customers, I'm doing the operations, i'm securing my services and making sure they stay that way, i'm writing code. And so, I really like this model. This is to me, the DevOps model. And so, we've been doing that as these independent teams doing all of the things together, not just assuming that somebody else has their back, right? We were just making sure that we, at the end of the day, we're accountable for everything. And so, it's interesting that now adjunct AI is making that even more possible for teams. We're building what we've learned over the years about how to operate that way into agents to make it so that you can have these agents that operate as an extension to your team.

Ryan Donovan: Thank you very much for listening. I have been Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have questions, comments, concerns, topics to cover, if you wanna yell at me on the internet, email me at podcast@stackoverflow.com, and if you wanna reach out to me personally, you can find me on LinkedIn.

David Yanacek: And it's been great chatting about the history of AWS. Thanks for having me. My name is David Yanacek, I'm a Senior Principal Engineer at AWS, and you can find me on the various socials as D Yanacek, or David Yanacek. And happy to chat. I love getting into stuff with people.

Ryan Donovan: All right. Thanks for listening, everyone, and we'll talk to you next time.

Add to the discussion