AI agents for your digital chores

Yutori is building AI agents that can reliably handle everyday digital tasks on your behalf on the web.

Connect with Dhruv via his website.

Congrats to the winner of today’s Populist badge, user Don Kirkby, who earned it with their answer to Find all references to an object in python.

TRANSCRIPT

[Intro Music]

Ryan Donovan: Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ryan Donovan, and today we are talking about AI agents. Now, I know we've talked about it before, but today we're talking about proactive agents instead of reactive agents, and we're joined today by Dhruv Batra, who is co-founder and chief scientist at Yutori. He's gonna tell us all about what a proactive agent is. So welcome to the show, Dhruv.

Dhruv Batra: Thanks, Ryan. Happy to be here.

Ryan Donovan: So, top of the show, we like to get to know our guests. Tell us a little bit about how you got into software and technology.

Dhruv Batra: Sure. Again, thank you for having me. I'm an AI researcher, and I've been in the field coming up on 20 years at this point. People think of the current wave as marked by an epochal event of ChatGPT launching. We had a similar event about 12 years ago at this point, which in the community, we refer to as the 'AlexNet moment,' or the 'deep learning wave.' I got into AI in 2005, which is significantly before that. Back then, it was not respectable to use the phrase AI or AGI; you were not considered a serious scientist if you used those phrases. So you said you were working on 'machine learning' with applications to domains, like computer vision and other aspects of AI. Over the years I've worked in, you know, core computer vision problems like recognizing objects and images, building chatbots, which are core NLP problems, or natural language processing, or understanding problems. I was a professor at Georgia Tech. In 2016, when I got there, I created their deep learning class. I'm coming off of spending eight years at Meta. I was a senior director leading FAIR embodied AI – FAIR is Meta'S Fundamental AI Research Division, and 'embodied AI' refers to AI for robotics and AI for smart losses. So, one of my teams developed the image question-answering model, that, in the earliest days, they collaborated with the product team and shipped it on the RayBan Meta sunglasses. Other teams of mine developed the world's fastest 3D simulator called Habitat for Training virtual robots and simulation, deploying them on the Boston Dynamics Spot Robot. And that team took it to a White House correspondence dinner to show the congressional staffers the technology that's coming. So, over the years, I've spanned the spectrum of all areas of AI. I've seen, at this point, two completely distinct epochal waves of technology coming in, and it's been a fascinating journey. You know, most people who've been in the area this long would tell you that we didn't think we'd be at this point, and it's simultaneously true that we have made tremendous progress, but there is still plenty to be done. I am not one of those people who, I think, are playing word games around the phrasing 'AGI'. I think the original visions from the 1950s of, you know, an intelligent agent that can interact with the world and accomplish goals is still significantly far ahead of us.

Ryan Donovan: Yeah. It's interesting– just talk about spanning the whole gamut. I remember my first AI program class was in 1997, so that was even older. A lot of bays, trees, a lot of genetic algorithms, and even a neural net there. So.

Dhruv Batra: Yeah, this is a question my now former colleagues, now that I've resigned from Georgia Tech, my now former colleagues at Georgia Tech—and I'm sure this is happening at other universities as well—they're having to grapple with the phrase AI, and there's a course called Intro to AI that is typically taught to undergraduates, and it today does not teach the methods, or at most places, it needs to be completely revamped. Because the set of ideas that we thought, in the 80s, and the 90s, and the 2000s, that would lead us to developing general-purpose intelligence systems have not, so far, panned out. And a set of ideas that are the most promising are—[audio drop]—[f]eatured in that course, that is today called 'Intro to AI.'

Ryan Donovan: Yeah. Well, you know, you talked about how AI, AGI, were dirty words– do you think we're on a track that'll pan out now? Do you think it's still something we should say with a little more care?

Dhruv Batra: So, the way I think about that is, I think what's been happening is we've gone through two phases. Phase number one: it is certainly true that, in the 2010s, when a renewed emphasis on the phrase 'AGI' came about, it was trying to fight against a pattern in literature about focusing on narrow purpose problems. So, you know, it is certainly true that if you go back 15-20 years ago, the computer vision community was focused on one set of methods; the natural language understanding community was focused on one set of methods; the robotics community was focused on another set of methods; and it was exceedingly hard to cross over discipline boundaries. You had to learn entirely new things, and even within those discipline boundaries, people were developing hyper-specialized methods. Like, if you wanted to build a chess-playing bot, or a go-playing agent, as DeepMind did, you focused on one specific set of techniques that told you nothing about building chatbots, that told you nothing about recognizing objects and images, and progress in one domain did nothing for other domains. And so an emphasis on generality was needed, that 'hey, we are here to solve the bigger problem, not to solve these narrow problems.' But what I noticed in the last, you know, handful, two to three years, is a certain sleight of hand happening where we are redefining AGI to mean, well, 'robotics' is outta scope. Like, physical intelligence is not AGI, tactile sensing is not AGI, maybe all of video understanding is not AGI either. So, we've sort of defined AGI to mean the set of things we can do today, like digital environments, largely language-based interaction, where language is defined more generally than just English or commonly spoken languages, to mean even programming languages and any tokenized sequences that follow a particular pattern and a grammar; but that's not AGI.

Ryan Donovan: Right. It's a specific form of intelligence, right? It's an amazing natural language interface at worst, right?

Dhruv Batra: Yeah. It is fascinating how much progress has been made, and it is economically valuable, it is intellectually interesting, and I'm not one of those who says that this is a dead end or that this is headed towards a wall. I'm intellectually humble enough to realize that, I don't know, I can't predict that far out, but I can say that there's a certain opportunistic redefinition happening here.

Ryan Donovan: I do love the wet blanket on the hype because we are going into a very hyped topic. You know, everybody is talking about AI agents, but, you know, most of the agents are kicked off by a prompt, right? Pretty much the same as ChatGPT, but you're talking about today– proactive agents. Can you talk about how that works in practice?

Dhruv Batra: So, at Yutori, we're building proactive agents. Before I dive into proactive agents, let me just, you know, tell you about the company and why we started this line of work. The phrase or the word 'Yutori' is a Japanese word for a sense of mental spaciousness. It literally translates to when you have elbow room or leeway in your mind. It's the opposite of a feeling of mental fragmentation. Yutori is the feeling that you experience when you feel you have time and space to do the things that are important to you, whether that be stepping into a state of flow, spending time with loved ones, you know, pursuing a particular activity, whatever, when you have the time and space to be able to do that. And we named the company that because we wanna build AI agents that can deliver Yutori to our users. And where we're starting is... we think that the web is an extremely, you know, it's– humanity is one of the greatest inventions and a clunky mess due for an overhaul. We spend hours of our time on these mundane tasks, filling forms, tracking appointments, buying and returning things, tracking information online, coordinating events, securing information from webpages, and ultimately, we're all capped by our own bandwidth. How many times can we go through those same workflows? How many times can you check a webpage? How many different webpages can you check? And what we imagine is a future where, in some sense, no human has to interact with a webpage again, no human has to interact with a web again—that every human on the planet has a team of AI assistants that are executing workflows on the web, coordinated by your own personal digital concierge, or an AI chief of staff that you talk to, that understands your context, that understands what you are working through, and then executes and coordinates workflows for you on the web. You know, the analogy would be riding horses; you can probably do it for entertainment, but you're not going to do it for utility. Those days are gone, and that's where we have to get to. But no human should have to interact with the web. So that's the vision. In terms of answering your question about proactive assistance, we've launched our first product, which is instantiation of proactive agents. Our first product is called Scouts. Scouts is a team of agents that monitor the web for you, for anything that you may care about. So, you can come to a scout and say, 'hey, I'm interested in campground reservations in this national park, the dates become available on a certain day, let me know whenever those dates become available.' It could be a certain product whose price you're tracking. It could be a, you know, trip that you wanna take with a certain configuration. It could be an extremely hyper-specific news event that only you are interested in, that you know, 'let me know whenever there's news about this event.' Or a band coming into town. It could be, you know, things related to your work. So, maybe you're tracking the announcement of AI agent startups that are announcing raising funding, maybe because those can be potential customers. Maybe you're tracking people announcing that they're quitting their positions because those can be potential hires. So, I think the abstraction is– there are lots of digital workflows where you go to execute them. There is a piece of information that is not yet available, it will be available, and what you'd like to do is for an agent to monitor them. And so the proactive nature comes in in the monitoring aspect of it. And the description is completely in natural language. You can think of this as a Google Alerts for the AI era, if you will, or an RSS feed for the web, described in natural language.

Ryan Donovan: You know, you dream of the webless future, but we still have the webpages now, and it sounds like this is a lot of 'visit and report back.' I've heard of a lot of folks complaining about the sort of load that AI agents are putting on websites. Is there any thought to how to mitigate that?

Dhruv Batra: Absolutely. So, you made a couple of points. One, that the web, today, is designed for human consumption, and in some sense, agents are having to consume information the way humans would consume information, because there isn't a parallel pathway for the entire web. In the Scouts product that we've built, actually, we allow both. Anytime there is an API surface that agents can talk to directly through, let's say, an MCP interface, which is a model context protocol, so that the AI agents can absorb information through APIs, we use that. So, your web search, and your weather, and your finance APIs – there is no need for an agent to spin up a browser and type things literally into a Google.com homepage. You can get that information. However, today there is a long tail of the web that is just not designed for agentic flow. You know, your indie developer who wrote a tennis code reservation system in the Golden Gate Park – there's not going to be an API available for that. You just have to access it like a human would, and for that, we do. We have in-house browser use agents that operate browsers like humans would, perceive webpages through screenshots, you know, click buttons, operate those forms. Now, it is the case that when you do that, you are accessing this information and contributing to the load on that website; however, I think it's important to note that in this case, this is exactly what a user wanted, and I would distinguish this. Classically, there is this historical understanding of automated systems on the web as adversarial—that the value exchange is only one way. That bots come to your website, they scrape your content. They do not contribute to anything valuable on your website; they just contribute to your traffic and your bills. I think we have to rethink this going forward. If we imagine a world in which most traffic on the web is user issued-agentic traffic, in that world, the value exchange can be much more fairer. Somebody arrived to your webpage because a human told them to. They would've done it themselves, they would've opened the browser on their laptop; instead of doing that on their laptop, they asked an AI agent to do it on a remote browser. Functionally, there is a difference, but I think intentionally, there isn't a difference, and in that world, new economic incentives have to be created. Today, the incentives are the way they are because the web world is set up for advertisements to be served to human eyeballs.

Ryan Donovan: It's an attention and economy, right?

Dhruv Batra: Exactly, and when there aren't human eyeballs visiting your webpage, there can still be a value exchange because somebody is sending an agent. You can think of that agent as a buyer's agent. They are there because they're representing an actual human with a very high degree of intent. And then you can talk about value exchange. Maybe, in some cases, the agent pays for access to the website. Maybe, in some cases, the website pays for attracting the agent to that website because you want to offer something that is relevant to the intent.

Ryan Donovan: Maybe the agent, or the underlying LLM, licenses the data. We've seen that too.

Dhruv Batra: That would fall under the paying for access that I mentioned, yeah. Absolutely.

Ryan Donovan: So, I wanna talk about the sort of functional nature of this, right? Traditional agents are at rest, not spinning up EC2 instances until a prompt comes along. What are the proactive agents doing?

Dhruv Batra: Yep. So, in Scouts today, the proactive agents are proactive after you've told us what you care about, what you want to monitor. So, what they are doing, you know, in purely technical terms, you can think of it as agentic search wrapped in a cron job, meaning that, you know, we are going to go out to the world, you were interested in some piece of information, we're going to go out with some frequency. That's the simplest way of understanding it. There are more technically challenging and interesting ways in which you can optimize this, because depending on the query, you should decide intelligently how often you want to go out into the world. This is where there are unique challenges that lie in this sort of product. If someone told you that they're interested in a particular piece of data that is correlated with the markets, that only happens 9:00 AM to 4:00 PM, then you shouldn't go out outside of that window, but they're not going to write all of this down. You should be intelligent enough to figure this out yourself. And you know, in other queries, depending on what you're finding in the world, maybe you're tracking whether there's a band that has come into town, you don't need to check every hour. You can check every day, you can check every week. And you can tell based on what you are getting as feedback from the world when you did go out the last time. So, there's proactivity in that. That's the product as it is today. It is a read-only product that does not go past authwalls and does not buy, book, or reserve anything on your behalf. However, where we are headed is exactly that world, because the reason why you issued this monitoring query is because you care about something. The reason why you're monitoring a band coming into town is because you want a ticket. The reason why you're monitoring a tennis court reservation system is because you want an appointment, and the next time these agents are going to come to you and say, 'hey, we found that that time slot you were looking for. Do you want us to just buy it for you?' That is an escalation of trust. Still, you're in control, but it is proactivity in the sense that they're going to then go ahead and make that booking.

Ryan Donovan: So, for some of these write actions for AI agents, I've seen some organizations use things other than LLMs for that. Obviously, LLMs have some hallucination built in. Are you using or planning on using something other than LLMs, or do you think this is something LLMs can do by themselves?

Dhruv Batra: We are using LLMs—multimodal LLMs—because, as I mentioned, websites are laid out for human consumption, and so you have to see the website like a human would. I think the thing you're referring to is for any specific narrow use case and workflow – like if all you care about is this one particular tennis court and you're looking for a 7:00 AM reservation–

Ryan Donovan: Just fill in my address, right?

Dhruv Batra: Yeah. Fill in my address. Click this button. You don't need intelligence. All of the intelligence lies in the head of the programmer that writes this particular scraper out. And you just run it unintelligently in a cron job. That is the world we live in today. That is what people have been doing. There are entire communities of people writing, you know, scrapers and bots for, you know, restaurant reservations and catching shoe drops–

Ryan Donovan: Comments on websites.

Dhruv Batra: Comments on websites. We are taking an intelligence-first and a completely general approach. Anything on the web, if there is a piece of information out there, if you can do it as a human with a browser, we should be able to do it. It's not there yet uniformly. We– it is generally this phenomena in AI that we tend to have jagged surfaces. There are some things that we are going to be superhuman at, there are other things that we're going to be worse than human at, which is why our first product is a read-only product. Mistakes are less costly. When you go to write actions, certain mistakes are going to be far more costly than other mistakes, and so you're going to have to sequence that. That's a product decision. From a technology perspective, what that means is we are going to have to create sandboxes where we can practice those things. These agents are trained with a set of techniques, for example, reinforcement learning, which is learning by interacting with the world and learning from feedback. And what you often need in those cases are sandboxes so that the mistakes aren't costly. This actually refers back to something that I said at the top of the show, which is– this is how we train robots as well. This is why there are 3D simulators of physical worlds. Because you don't break yourself, you don't harm others and yourselves in simulation.

Ryan Donovan: Right. You don't want the accidental terminators.

Dhruv Batra: You don't want that. But also, the methods that we have today are, you know, extremely data-hungry, and it is easy to generate that data in simulation.

Ryan Donovan: You know, a lot of this you're talking about– something I've sort of posited to folks is that there's gonna be one entry point to everything in the future. You have one piece of interface. And we had a writer write something that was– the AI is the UI. It's not the program itself; it's the UI. What do you think that that one final interface will be?

Dhruv Batra: Wonderful question. I don't have a pithy short answer to that question, but internally, at Yutori, this is exactly how we think about it: that we are reimagining humanity's interface with the digital world and the web. There are two key components we need: intelligence and generative user interfaces.

Ryan Donovan: Mm-hmm.

Dhruv Batra: Today, a human sits down, and it's typically a designer, thinks about that workflow that a consumer goes through. Where are the friction points? What makes sense? What is natural? What is aesthetically pleasing? What is frictionless? And they design that workflow. Tomorrow, that's not going to be the case. You're going to have interfaces generated for you. You're going to talk to intelligent systems. They are going to fan out and, you know, secure information from multiple websites and sources, so there isn't a single address that you're going to.

Ryan Donovan: Mm-hmm.

Dhruv Batra: When you say, 'tell me about a band coming into town,' or, 'tell me about, you know, my meetings today,' or, 'tell me about something,' you want an interface that compactly represents that information that you can interact with, you maybe want to zoom into. It is going to be a visual medium because, just the way we are wired up, it is a high-bandwidth pathway into our brains. You know, pixels are much higher bandwidth than trying to talk to humans. That is going to happen, but it's going to be an interface generated for you and your query – ‘what did you ask us to do?’ This is a dream that many people have thought about, and it's going to be a front-end to the web. What does that look like?

Ryan Donovan: What's the front-end to everything? Yeah. It sounds a little bit like the 'UI minority report,' if you've seen it.

Dhruv Batra: I have, yeah. You know, it actually goes much longer before that. There's this– Doug Engelbart of Xerox PARC gave a talk—I think in 1967 or 1963—which is now retroactively known as the mother of all demos. In that one talk, this man and that team introduce basically the fundamentals of what we today consider modern knowledge work. In this one talk, this man introduces graphical user interfaces, the mouse, a collaborative document editing system, a video calling interface– two people get on a call, they are editing the same document simultaneously. Each one of those features, each one of those ideas, over the next 50 years, becomes its own a hundred-billion- to a trillion-dollar company. Today, we think of that as knowledge work. This group of people imagined this interaction in the late 1950s, early 1960s. And I do think, with AI, we have that ability now. We are going to imagine, 'what does knowledge, work, or interaction with a digital surface look like?' It's not going to look like what we think of it today. This is a culmination of the last 50 years—the paradigm of the last 50 years. But there is a new paradigm coming, which is... talk to it.

Ryan Donovan: Yeah. So, I think this is– you promised a little wet blanket, but this is some good idealism for the AI era. But, I think, you know, there's gonna be a bit of a resistance from the folks who've built businesses on this. I've seen, you know, a lot of the larger enterprise companies who are building AI agent stuff – they want it to stay within their world, their ecosystem. You know, obviously, we know why they're resisting, but how can they get on board, you know, become part of the one world soup?

Dhruv Batra: I think we both understand why, entrenched in interests, resist change. And it's not even– I generally don't ascribe to malice. This is classic innovator dilemma mixed with the principal-agent problem that, you know, you're a large enterprise, you have distribution, you have an existing product, it's a mature market, you have to serve your existing customers, you have optimized your product to the market and the customers. It's hard from that point to do a fundamental rethink that is going to immediately cannibalize your existing revenue. And understood. That's hard. That, coupled with the fact that with, you know, large existing bureaucracies where there are fractured interests, where you're trying to optimize, as a middle manager, your local pathway as opposed to the bigger picture... hard problem. That idea is not new. People have understood that this is what smaller players have an advantage at. I think what really matters is, is there a change that is beyond both the bigger and the smaller player coming? Is there a fundamental rethink that can trigger this? Is there a new technology? Is there a new regulation? Is there something new that can actually change behaviors? I do think we are in that moment. You know, this goes back to some of our earlier conversation – we're not, at AGI, as broadly defined by the original thinkers of AI, but we are at something special. We have created general-purpose interaction machines with digital content, not with physical content yet, but with digital content. We have not yet productionized it to a degree that every problem is solved, but we have line of sight. And what this means is that we can actually rethink our relationship. The last 20, 30 years, it's been—the consumer has basically been—at the mercy of this economic incentive. That the way you pay for things is, you know, you are served advertisements, and that you get access to free services. I do think that incentive is ripe for change; that the consumer has demonstrated they're willing to pay for subscriptions and services. The arrival of AI systems means we can actually build personal assistants and AI chief of staffs that serve you, and that you're willing to pay for, because they're delivering value to you. I agree with you that your data today, because of historical reasons, is locked into various services. The existing incumbents have a few choices: you can either start putting up walls, as some incumbents have–

Ryan Donovan: Sure.

Dhruv Batra: 'I do not let you take your data out even though you want to.' I think these moves will play poorly. You have users that want that value. I do not think that you can trap people for long. You may be able to do it for a short period of time into a service that they feel that they are no longer in control.

Ryan Donovan: Right. And until that business threat becomes existential, and then maybe it's too late at that point.

Dhruv Batra: Maybe it's too late. And that is how change happens.

Ryan Donovan: Yeah, hopeful and dire.

Dhruv Batra: I think one thing that we, understandably so, didn't get a chance to cover is the unique and interesting nature of these kinds of agents. So, our scouts, for example– our product has only existed at this point for 10 weeks.

Ryan Donovan: Okay.

Dhruv Batra: But I have scouts that have been running for those 10 weeks. That is an extremely long-horizon reinforcement learning problem. I have agents that have been interacting with the world and keeping me updated. I created a scout 10 weeks ago, or maybe just over that period when Meta had recently announced its acquisition, or acquihire, of Scale AI co-founder. This predated the term 'Meta Super Intelligence.' At that point of time, I created a scout– 'hey, let me know if there's any future news about this acquisition.' That Scout has, for the last 10 weeks, gone on this narrative arc. It interacts with the world frequently. It discovered that following the acquisition from scale, there is a new lab that Meta created called Meta Super Intelligence. Then it began tracking Meta Super Intelligence: who's getting hired at the MSL lab, what places those are hiring from, what is happening to the labs that these people are moving from, what is happening to the startups that these people are moving from, and most recently, what is happening now to the departures from MSL of these people who in the last two and a half months have decided to leave. This is an extremely long-running agent. This is not how we typically build agents. Most coding agents, most LLMs, they're extremely short-lived. One interaction, a few turns in a chat, you know, a few hundred lines of code. We are moving towards a world where there are going to be persistent 'always on' entities that are tracking the evolution of something that's happening in the world. That's an interesting world.

Ryan Donovan: Yeah. It's the cement you're tracking just having the keyword search. It's like you said, it's that AI search, LLM search, applied to a more proactive agent, right?

Dhruv Batra: Yep.

Ryan Donovan: That's awesome.

Ryan Donovan: All right, everyone, it's that time of the show where we shout out somebody who came onto Stack Overflow, dropped a little knowledge, shared some curiosity, and earned themselves a badge. Today, we're shouting out the winner of a populous badge – somebody who came to a question and dropped an answer that was so good, it outscored the accepted answer. So, congrats to 'Don Kirkby' for answering 'Find all references to an object in python.' If you're curious about that as well, we'll have a link for you in the show notes. I am Ryan Donovan. I host the podcast, edit the blog here at Stack Overflow. If you have questions, concerns, topics, comments, et cetera, et cetera, email me at podcast@stackoverflow.com. And if you wanna reach out to me directly, you can find me on LinkedIn.

Dhruv Batra: Thank you for having me, Ryan. And I'm Dhruv Batra. I can be found at dhruvbatra.com, and my company is Yutori. We can be found at yutori.com. Thank you for having me.

Ryan Donovan: All right, everyone. Thanks for listening, and we'll talk to you next time.

Add to the discussion