Speeding up the I/O-heavy app: Q&A with Malte Ubl of Vercel

We recently published an article exploring what sort of infrastructure is needed to run edge functions. For that piece, we talked to several industry experts in wide-ranging conversations. While we weren’t able to use all of the conversation in the article, we wanted to share them in full so you can get all the interesting bits that we couldn’t include. Below is Ryan’s conversation with Malte Ubl, CTO at Vercel.

This conversation has been edited for clarity and content.

—-----

Ryan Donovan: For the actual physical server structure behind edge functions, do you have servers that you own or do you have a partner behind that?

Malte Ubl: For the actual edge functions in production, they're running on Cloudflare's worker infrastructure. However, it's very different from the product you can buy from Cloudflare. Cloudflare's worker product is the thing that's terminating traffic.

It takes on this primary role as the reverse proxy. The way we use them is as a backend, right? Because we are terminating traffic in our own infrastructure. So we use them very similar to a serverless function implementing a route—we get the route, we determine, okay, we need to route it to this function, and then ask it to perform this request.

Importantly, this is something that we call framework-defined infrastructure. Edge functions aren't primitive like Lambda or workers that you program directly, right? Where you make a terraform file and you say like, I want the Lambda, and this is the file that I compile, that I upload there, blah, blah. It's not like that. Instead, in your framework, you just use the idiomatic way of making your pages or API routes. It doesn't really matter, because you use the language of your framework.

We will take that and say, okay, we turn the thing that worked on your local machine and wrap it such that when we deploy to the infrastructure that we use, it behaves exactly the same way. That makes our edge functions produce this more abstract notion because you don't use it so concretely. Next.js has a way of opting into edge functions. Right? And you never have to think about Vercel in that moment. It's just that it works.

RD: About the data rollbacks…does that require a lot of replication across these servers or is there a central place?

MU: This is a good question. The way our system works is that by default we maintain activity within certain generous time limits—you wouldn't wanna roll back to something six months ago, right? Hopefully. Because everything we do is serverless in this abstract notion, as in there isn't this physical infrastructure, so it doesn't actually apply to edge functions. But with our more traditional serverless product, which is based on AWS Lambda, we will archive your function if it hasn't been called for two weeks.

But when against all odds, we get another request to it, we will unarchive it on the fly so that it behaves absolutely transparently. It's just a little bit slower. Again, in practice, this actually never happens on the production side. A static asset in a way is more interesting, because there's a lot and those are accelerated through pull-through caches. If you roll back to something old, it could be temporarily a little bit slower.

But in the instant rollback case, this actually isn't the case because it's actually much more likely that you're rolling back to something that maybe had traffic 30 minutes ago. Likely, it's still very hot.

RD: And just, just to be clear that the functions are serverless, right? No state stored on them, right?

MU: They're serverless—you don't manage anything about their lifecycle. I think what's really interesting is how our edge function product is priced versus a traditional serverless function and how they behave over time.

But one of the decisions that has been prevalent in the serverless space is that you don't actually do concurrency on the individual function—honestly, I think that's kind of weird. Especially as people use node.js on our platform a lot, where the founding idea was that you could handle multiple requests concurrently on the same core, right? On the traditional serverless stack, that actually doesn't happen. That's not particularly efficient if your workload is I/O-bound.

Most websites that do real time rendering, the pattern will be, you have a bit of a CPU-bound workload, which is rendering the page, but you're also waiting for the backend. Almost always. There's always gonna be some backend request and that's gonna take some amount of time and during that time that CPU could do other work. On the edge functions product, that's absolutely the case.

I think what's very unique about our product—even compared to workers on which it's based—is that our pricing is entirely based on net CPU. What that means is that you only pay for when the CPU is busy. It's entirely free while waiting for the backend.

It's very attractive if your backend is slow. This is hyper-top-of-mind for me because of all the AI APIs, which are incredibly slow. Very common use cases are that you call an API, you run almost no compute. Maybe you do some munging on the JSON, right? But it's almost nothing, we’re talking two milliseconds. That's the net CPU. But OpenAI takes 40 seconds to respond. So on this pricing model, you pay for two milliseconds, and that is actually incredibly attractive.

It's possible because of very tight packing. The traditional serverless pricing model is based on gigabyte hours—they basically pay for RAM. The way this whole product is designed is that you can't afford having many concurrent ones that don't use incremental RAM.That's why this works out for both of us. It basically enables the AI revolution because you can't afford running it on traditional serverless, because they're all going viral like crazy, you also can't really afford running them on servers.

So it's a really good time for this.

RD: Speaking of that, do you have any infrastructure or does CloudFlare have an infrastructure on these edge workers that supports AI/ML processing? Any sort of GPUs,TPUs on the backend?

MU: Not on the edge function side, but that use case is the basically I/O bound use case, which they're ideal for, where you outsource the model and inference to somewhere else and you just call that.

What we are investing in is on the serverless side on the CPU-bound workloads. Because that's what they're good for anyway. You just need the flexibility.

So example: limitations on what kind of code they can run. It's primarily JavaScript and Wasm which gets you pretty far. But you can't run langchain. You can't do any real inference. You also can't run Python, which we do support in our primary serverless product.

What's really popular there is in the same application. Building a Next.js or Remix application for your front end. But implementing the API routes, which traditionally would be also written in JavaScript and Python. People like doing that because they get access to specialized libraries that just aren't available in the JavaScript system.

RD: So back to the initial proxy, how do you get that firewall call to be so fast worldwide? Is there one server somewhere that's waiting for the call?

MU: I think the fair answer is many layers of caching, right? And lots of Redis. It's definitely not one server, it's many servers.There's three primary layers involved.

One does TLS termination and is the IP layer of the firewall. It’s a little bit of an HTTP layer firewall, but primarily looks agnostically at traffic and tries to filter out the bad stuff without paying the price of knowing really what's going on. That's layer one. Absolutely optimized for high throughput HTTP serving.

Going one layer down is the layer that has the biggest footprint. That one understands who the customers are, what their deployments are, and so forth. That's driven by substantial caching. We have something we call our global push pipeline. When stuff changes, it pushes it into all data centers so that actual origin hits—where you go back all the way to some central database—use the hot serving path of user traffic, especially for sites that have any form of like non-trivial traffic. You can always produce a case where you make a request per day.

Then the last layer is our edge function evocation service. It's still something that runs in our infrastructure. This service has multiple roles, but primarily it acts as a load balancer. One thing we're really happy with is when you use the CloudFlare product directly in this traditional role, it can feel really good on your machine. Because it takes your H2 connection and keeps assigning the same worker. It's very fast.

Because we have the luxury of having a layer in front, we basically emulate the same behavior where we load balance as ourselves and say, okay, we have here a worker that can take a little bit more traffic and then multiplex another request on the same connection.

Not sure how much you know about HTTP, but it's basically just HTTP `Keep-Alive`. It utilizes the same connection to communicate with the CloudFlare back end.

RD: So just keeping that connection to the one worker and not going through the same firewall path.

MU: Exactly. We have like this invocation service that's also not only one machine, data center, right? But it's substantially fewer than you would need workers. Because first of all, they're multi-tenant. And also this is a very simple service in the end, which only does high performance HTTP proxy.

RD: You talked about it being good for I/O-bound workloads. What are the sort of applications that this sort of edge function is really good for?

MU: Well, I mean, they don't have to be I/O bound—I think I/O heavy is actually the right word, because you can use CPU. That's not really what it's about. They do I/O and you wait for some backend, right? If you only do CPU, then it's not necessarily the right platform.

We don't really see that heavily. I was mentioning the AI use case as the kind of extreme one where it’s particularly valuable. But for the typical dynamic webpage, we support other rendering modes, like incremental static site generation, which is serving the static page a little at a time. There's asynchronous updating in the background before I'm serving a live dynamic asset. That's effectively always an I/O heavy operation.

Because why do I make this? Do you have a terrific example of responding with something random? But that's not really what you want. Right? You wanna talk to a database, you wanna talk to your API gateway. Something like that is what you wanna do. So you're always gonna be waiting for that to return, because you're offloading processing to some other system, even if it's very fast.

That's every web facing workload: give me your shopping cart, render the product detail page, render the recommendation for a blog post based on the user. All of these things that people do on Vercel all fall in this workload.

The ones that are truly CPU-bound are really an exception. I wouldn't say it's unheard of, but the edge function model is good to default to because the typical dynamic web workload fits really well into this pattern.

RD: The way we got the this interview was that you are getting into the database game with serverless databases?

MU: Yeah, absolutely.

RD: How does that fit into edge functions?

MU: We actually offer three different databases, plus one that’s not really a database, but what we call our data cache.

They're all related. The KV (key-value) product that we launched is an edge-first global-replicated KV product. So that straight-up fits into edge functions.

Our Postgres product that we launched really isn't serverless. But you pick a region, that's where the database is. We do support reading from the edge, but you have to be careful, right? If you do waterfalls, then you're gonna have to go through a long, high-latency connection to do it. So that's something people have to be aware of, but, on the other hand people just want a Postgres database even though maybe it's not the perfect thing.

There's two ways to mitigate it. If you do a heavy workload against such a database, then the data cache that we ship is perfect for this. It comes with the trade offs of caching, like that you have to invalidate things.

But that's the key thing that we spend time on making it really, really nice. The other thing that we support is that users can opt into invoking the edge functions in a region. It's a bit of a weird thing, but basically they're always deployed globally, but then through our internal proxy and infrastructure.

You can say, invoke it next to my database. That's for the case where you wanna have that regional system. Could be a database, could be that your backend in your on-prem company data center. Because of that, we’ve gotta support this, which obviously doesn't give you the global low latency anymore. But once you talk to your database twice, it's always cheaper from a latency perspective.

So that's why we decided to ship this feature. It's super helpful for this case. You still get the benefits basically of the pricing model and cold start performance and so forth.

Add to the discussion