Every ecommerce hero needs a Sidekick

We spoke with Shopify about how they’re building developer-focused AI products last May; you can check it out here.

Sidekick is Shopify’s AI assistant that combines commerce knowledge with advanced reasoning. Learn more about how Shopify is using AI agents to evolve their product taxonomy at scale on their blog.

Connect with Vanessa on Twitter.

Congrats to user Erwin Brandstetter for winning a Great Answer badge for their answer to How to convert empty to null in PostgreSQL?.

TRANSCRIPT

[Intro Music]

Ryan Donovan: Tired of database limitations and architectures that break when you scale? Think outside rows and columns. MongoDB is built for developers, by developers. It's asset compliant, enterprise-ready, and fluent in AI. Start building faster at mongodb.com/build.

Ryan Donovan: Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ryan Donovan, your host, and today we are talking about AI as a renaissance and what else is going on in the world of e-commerce. And my guest today is Vanessa Lee, VP of Product at Shopify. So, welcome to the show, Vanessa.

Vanessa Lee: Thanks for having me, Ryan.

Ryan Donovan: So, before we get into the details here, we'd like to get to know our guests. Tell us a little bit about how you got into software and technology.

Vanessa Lee: I am a robotics engineer. We call it mechatronics, for those who are really hardcore. But I did engineering in university here in Waterloo and Canada. I've always loved building things, and so it was a very natural post-secondary education for me to have. And I think from there, I did a couple startups. I think this is back in the time before startups were that cool, and so it was not that in vogue to do startups. So, I felt like a little bit of a lone wolf. But whenever I was part of a team, whether it was in school and we were building a robot, you end up picking a specialization, because in robotics, someone does the sensors, someone does the hardware, and I was always the software person. And so, I really learned how to code in C++ and some Java, because that was the language of the day with Arduinos, and Raspberry Pis, and those kind of things. And so, I came out, did my startup, in which case I was the only person coding, and then I found myself at Shopify nine years ago. And I started here as a senior PM working on our app platform. So, back then, we were really just getting started as a platform. We had some great APIs. There was actually a bargaining ecosystem of app developers already, because the opportunities were so, so vast building for entrepreneurs, but we hadn't put a lot of deliberate effort at that time to build our platform. We didn't version our APIs when I joined, which was wild. We didn't have extensions, we didn't have functions, we didn't have a lot of the stuff that we have now. And so, almost a decade later, it's amazing to see how much our platform has grown in terms of capabilities, in terms of what you can build on Shopify's developer. And then, yeah, I've expanded my role, but that's where it all started.

Ryan Donovan: Okay. So, last time we talked to the fine folks at Shopify, we had Glen Coates on as Head a Product. Obviously, you've been there for nine years, so you were immersed in the sort of philosophy of whatever Shopify thinks. How do you see this role, having newly ascended to it?

Vanessa Lee: We do so much as a company, I find. Our scope has increased from where you go to build your online store over the last decade, to where you go to get a point of sale, where you come to connect your store with agentic surfaces. We've just grown and grown and become truly the operating system of merchants' businesses. I've worked on quite a few parts of our platform, online store, and liquid. Some of the horizon updates that you chatted with Glen about I had worked with the team on quite a lot. So, it has been a fire hose over the last six months, but perhaps one that I had already, in some places, dabbled in. So, it's been, yeah, it's been a fun six months.

Ryan Donovan: Yeah. Let's get into the details today, the topics today. You think of AI as a renaissance for technology. We talk about AI pretty regularly on this program.

Vanessa Lee: I'm sure like all in 2024.

Ryan Donovan: Yeah, absolutely. And we get a healthy amount of pushback on it. We've had some skepticism, we've found in our developer survey. Basically, the more people use Ai, the more skeptical they become. How do you see AI as a renaissance in that space?

Vanessa Lee: Yeah, that's a really good question. Toby had, very early on, put forward a video, which kind of shared our ambition for Sidekick, right? It showed Sidekick being able to work on the platform to create products, to create collections, create all the primitives inside of Shopify, all the resources in Shopify, and do it, basically, with you alongside. So, you'd be able to review everything, but it's essentially able to draft a whole bunch of resources in Shopify. And that was really the start of our Sidekick journey. This was back in 2024, I believe. The last couple of years have really been an exercise of how do you build an AI agent at scale, right? Which, for those who have done it, it's not an easy feat, especially when you are starting from scratch. And so, the last couple of years we've been working a lot on Sidekick. When we came out earlier this year with a new architecture of Sidekick, we started seeing Sidekick be a lot more successful in most conversations. So, I purposely waited and held back the team from talking and shouting too much about Sidekick until I thought that it really drove some value.

Ryan Donovan: For folks who don't know, Sidekick is what?

Vanessa Lee: It's our, essentially, AI assistant. So, very similar to what a lot of platforms are doing, it lives alongside in the UI, you can ask it questions, it could create products for, can create collections for you if you're in Shopify, it can help you edit your online store. And so, it's able to help you traverse the entire platform. And so, when you're building something like that, we had to make sure that every question you threw to it, it would be relatively valuable.

Ryan Donovan: Right.

Vanessa Lee: And so, earlier this year, after we launched that architecture, we had finally seen, okay, now merchants, our users, are starting to truly demand Sidekick in more places. And so, I'd say the last seven months since then have been super fun for us. So, after building a ton of foundations over the last two years, now is the time where we get to really stretch our legs and say like, 'okay, in what places in the admin can we also deliver value using AI? And so, Renaissance just captured, I think, our approach to where we're at. We put in a lot of hard work to get to this point and to make AI something that wasn't just a great demo feature, but something that actually repeatedly would deliver value.

Ryan Donovan: Yeah, and I think when getting a lot of pushback on AI in general, it's because of the sort of non-deterministic hallucination aspect of it. And you're talking about having it answer specific questions very well, or any given question – how do you prevent it from going off the rails and selling a car for a dollar, or something like that?

Vanessa Lee: There's a lot you do to make sure that it answers properly on the topics, but one thing that I don't think is very obvious for folks who haven't gone through this journey is how important your evaluation set is. You had written a little bit about it, as well. It's such a creative process where you use LLMs to grade other LLMs, which is such a fascinating thing. You also use LLMs to generate synthetic data that you can then use to form ground truth set. So, one of the ways that we did that we put a lot of work into the foundations of evaluations, but you also make sure that you have enough variety in that judge's training set, where you also have, essentially, negative cases. Right? And grading it negatively so that the judge is also able to spot when it answered and tried to sell you a car, which we absolutely do not want. Right? So, honestly, it's a lot of grunt work, it's a lot of time investment, but it's also being super creative about how we build this data set of evaluations and how we build that judge. But I think we're finally at the place where you put in that work, and then you do start to see the development internally of AI features get faster and faster, as a result.

Ryan Donovan: Mm-hmm.

Vanessa Lee: And so, that's what's enabled us to really build a ton more features into Sidekick, and capabilities into Sidekick, reliably over the last six months.

Ryan Donovan: Glad you brought up the evals. I think those are becoming increasingly important. And somebody did some research that about 80% at the top end, the LLMs aligned with human preferences. Do you have a human in the loop to evaluate that extra 20%, or any way to identify that extra, ' this might be a bad eval?'

Vanessa Lee: Yeah. So, how we incorporate people into this, when you're looking at evals, I like to say to all RND teams that work on AI, your evals are your new spec. Right? We're so used to, okay, we're gonna have a requirements document depending on which org you're in, whether requirements document or spec; and then you take that and then you build software, which is, in traditional sense, very easy to rule-based systems, APIs, you're able to just build according to spec.

Ryan Donovan: Mm-hmm.

Vanessa Lee: In the world of AI, where you have to be able to assume a variety of input, your spec is actually your evaluation. Right? And that is the thing, right? So, if you think about how do we make sure that people are actually people on Shopify's side? Like, our opinions of what Sidekick should be, how it should act, those are all actually embodied in the ground truth set. Right? And so, that's how you go from human– it's not just LLMs making LLMs. We use LLMs creatively to help us scale, but for example, I would have a team say, 'okay, go and generate a bunch of conversations between Sidekick and let's say an LLM.' And you take those conversations, you just have to edit them, right? And so, that is the human in the loop. Now, the 20% that you're talking about that don't have the human alignment–

Ryan Donovan: Right.

Vanessa Lee: At the end of the day, there's too many. We hit a hundred million conversations on Sidekick. When you're looking at that scale, you can't have human in the loop for every conversation, but what you can do is you could take some of the sampled conversations that people are having that you don't align with, and you can grade them and say, 'this, we do not align with, put it in the ground truth set.' So, that 80% will continually get better and better, but I always tell folks, especially on the product side, 'cause I think this is how product is really changing as a craft – if you are building AI features, funnel all of your opinions, how you think this agent should work, into ground truth set. That's now the new spec for building AI, and that's the human in the loop. So, it's not like there's not someone behind the scenes helping to shepherd Sidekick. There's definitely a lot of humans.

Ryan Donovan: Yeah. It's interesting the spec-driven evals. I've heard of spec-driven development with the AI agents.

Vanessa Lee: Yes.

Ryan Donovan: Is there a case for using both? And if you use both, is there an issue with contaminating the data sets?

Vanessa Lee: For us, we really use just ground truth sets and judge evaluation. That's really what we find works the most. And so, that's been the basis of almost every AI feature that we have launched. It's hard to say if there's a difference for both, but we definitely internally have a preference for a robust ground truth set, a judge that has a good respect for grading in the same way that a human would grade a conversation, has the same overlap, and then using that judge to accelerate all the other development. Then, the devs working, let's say in this case, on Sidekick, have something reliable that they can run for every PR to say, 'okay, how have we changed? If I run 10 test conversations across my PR or my branch of Sidekick, how much does that match? What is the LLM judge that I now can trust? What is it spitting out as an improvement because of this PR in comparison to Maine?' That's been our focus.

Ryan Donovan: And in running these tests, conversations, how much do you rely on simulated environment, simulated sort of user conversations?

Vanessa Lee: Less and less so, right? That, I think, was a big part of how we had to get started.

Ryan Donovan: Mm-hmm.

Vanessa Lee: When you have no conversations and no real user interactions to go off of, you need to use the synthetic data generated, and it was hilarious. There were some times where we didn't quite tune the merchant LLM in our case—we call them merchant LLM—quite well. And so, they would just agree with each other, and they would just go off into like a never-ending conversation, 'cause that's what LLMs obviously are trained to do. And so, it's actually not that easy to generate in synthetic tests. I'd say nowadays, now that we've actually gone live and we have real user conversations that we can then grade, and make sure that we align with, our judge, we use that more than we do synthetic.

Ryan Donovan: Yeah. It feels like, when people would rely on the synthetic and the evals too much, it almost feels like you're getting towards a sort of model collapse, right?

Vanessa Lee: Yeah. It becomes a bit recursive. Yeah. I remember there being a point where like, what exactly, how did we get here? How are we using the LLMs to talk to other LLMs? But you always just have to remember, to the human in the loop, you have to find the place in the development cycle where you are going to insert your opinion. If you don't have that, then yes, that is a little bit– are you sure that there is a ceiling? The humans always bring the ceiling, right? So if two LLMs are talking, and they've gotten the answer correct, but you're like, 'it's not quite how I'd wanna answer. I would like it to be more concise,' or whatever your opinion would be. It's you who's then going in, correcting it, and raising the ceiling. Because yeah, if not, you're just gonna peter out with, okay, this is what the LLMs are doing. I guess you could change your prompt a little bit, but the human in the loop is what really raises the bar. And to be honest, that work never really ends. We still have people today in Sidekick, every month, putting in and refreshing our ground truth set.

Ryan Donovan: Yeah. A while back, we talked to one of your engineers, distinguished engineer, Ilya Grigorik, and he was talking about micro front-ends and components. I've talked to other folks who are tokenizing at the component-level for software. Are you thinking about that at all? Sort of Sidekick bringing up pre-vetted components?

Vanessa Lee: We tried that. We aren't launching anything yet. We're playing a lot with UI. Okay, so one of the things that we've talked about internally is, how does Sidekick come out of just a text shell, right? So, right now you speak with Sidekick, you have conversations, but one of the things that is fascinating that I'm super curious to see where we go in the next year, is how does LLMs help to change the way that we interact with UI, as well?

Ryan Donovan: Mm-hmm.

Vanessa Lee: Right? So, not just text-based, but also, okay, let's say, especially in our case, where we have millions of businesses, and every business has a different workflow, a different need, how does Sidekick, or let's say another LLM that we build, how does it build UI to fit a merchant's specific needs? Which is not something that was ever possible before.

Ryan Donovan: Right.

Vanessa Lee: Right? Without LLMs. And I think that's a really exciting way to think about software. In the past, software is limited by the pixels that are on the screen, and so you try, as elegantly as you can, put in as much functionality without overwhelming your user, and that's perpetually something that's really hard to do. And we still want to make sure that UI is really phenomenal out of the box. But there is room for some customization that can be done by a merchant. So, for example, it's not quite what Ilya is talking about, but it's something that we're launching called basically the ability for Sidekick to generate apps for you, custom applications for your business. So, if you, for example, manage the tags of your product metadata in a different way than how our UI represents it, and that is a frontline thing that you want your merchandisers to edit—we put it in the at the bottom of the page, you want it at the top of the page—you can then say, 'okay, I want to create a merchandising application where my merchandisers can go in, and it has tags at the top, it manages certain meta fields, which are custom fields over here,' and then it becomes a new way for merchants to work with Shopify. So, that's been a pretty fun thing to go and offer merchants, which it probably would've taken them a while, or at least cost them a lot to build for themselves.

Ryan Donovan: It's almost like building in vibe coding, yeah?

Vanessa Lee: It is. Vibe coating for us, for not the average merchant, has been something that I think we've all done in our own day-to-day work, but I think how do you take that, and then how do you give that power to a user that's not as technical?

Ryan Donovan: Mm-hmm.

Vanessa Lee: If we don't, if we're not fearful about it, if we just explore that optimistically for a second, that is something that we would've never been able to do without AI. And I do think that's a really exciting way to think about user interfaces in the next decade – how are user interfaces highly personalized to what Ryan wants? What Vanessa wants? And that's pretty cool.

Ryan Donovan: Yeah, so I think we've both all seen those demos of on-the-fly interfaces per person. Are you actually thinking about that level of customization?

Vanessa Lee: I think we're still in the early days of it. So, there's also latency concerns, and we have to make sure that UI isn't changing on you every two seconds. There's still some fundamental like user behaviors. If things were to change on you every time you log into Shopify, that's too jarring. But this felt like the right first move for us where we're saying, hey, this business, one, we've actually had the luxury of investing in our app platform now for almost a decade. So, we have all these tools, we have the right GraphQL APIs, we have the right front-end components in terms of what we've done with Polaris. And then now we can give all of those platform tools to an LLM and say like, 'okay, now create something that is bespoke for this business,' and then they can install it and use it over and over again.

Ryan Donovan: Mm-hmm.

Vanessa Lee: I think we're still a bit of ways from real-time generating UI for our user, but this felt like the right slight shift for us to start to see whether this is gonna be valuable.

Ryan Donovan: Yeah. The couple of e-commerce API platforms I've worked on, I was surprised just to see how much of that was just storing data on products. Does Sidekick help out affect the data side of the house?

Vanessa Lee: Yeah. So, when you're talking about data, are you talking about for a user, like it's actually able to generate data on the backside for you?

Ryan Donovan: Yeah, for the products, the t-shirt sizing– like, I know sometimes products will have very complicated and very specific requirements on what data they store.

Vanessa Lee: Yeah. So, one of the things that we've worked on for the last couple years is actually looking at the data model of Shopify and understanding, hey, we have now products across millions of merchants, and that is a fantastic position for us to be in, but also makes it very hard for platforms who connect to us to understand, this t-shirt from Merchant A and this T-shirt from Merchant B, all of the metadata is stored in different ways, right? So, you have the product description, which might have some details. You have the details of the size and fit in a meta field in a different merchant store. And so, one of the things that we did, actually starting a couple years ago, was use LLMs to start to properly categorize products, and properly create attributes. So, this is where I'm super proud of one of these launches. We've worked on it behind the scenes over the years, but last year we actually basically embedded these predictions into Shopify. So, if you started and said, 'okay, I'm creating a new product. Here's my sweater,' right? Upload an image of the sweater and then write some, like, hey, this is the 'Vanessa sweater,' it would start to be able to say, 'hey–' an LLM would run in the background, say, 'I know the category of this is apparel tops, sweaters,' let's say-- we have a standardized taxonomy that we've created—and then the attributes are sleeve length, material, color, right? And so, these are then, based on the images that were uploaded, then also automatically suggested for you. So, it just makes your life a little bit easier and nudges you a little bit into, okay, yes, I agree that it's colored black, and the sleeve length is X, and then it allows us to actually create better, more standardized product listings, not just for their shop, but also for all merchants as a whole. We're able to then work with partners like OpenAI and say, 'hey, we have a product catalog that you can plug into,' so that our merchant's products are actually surfaced in these surfaces, and all of the products are actually categorized, and have the right attributes. Right? So, this is work that's ongoing, but it has been something that we've really worked hard on over the last couple years.

Ryan Donovan: Yeah. I talked to somebody machine learning at Etsy, and talking about how they are trying to categorize products, and I'm sure you all have a similar issue where it's like, could be anything, could be 'custom handshake', 'cursed mannequin', whatever. Right? How do you categorize those?

Vanessa Lee: We have a pretty robust taxonomy tree that we continually add to, because you're right, there is so many different types of underwater cameras that you had no idea, and so many attributes of them that people need to be able to understand which one to buy. So, I think that's just an ever-growing task. We started this actually, about a year and a half ago at this point, and it's just something that we've continually invested in. I don't think there's a secret sauce to it other than you need to train a model. You need to create a bunch of labeled data sets. And this is just, it can be just a large ML model. It doesn't need to be an LLM, necessarily, but I think it just takes a lot of work, to be honest. But it is an important task. Everyone whose product data across many sellers will be very familiar with this problem.

Ryan Donovan: Right, right. Yeah. Yeah. So, when you're thinking about new features for this, how do you weigh the needs of somebody who doesn't know anything about it, they're creating a little store for whatever, for the wedding registry or something, to the developer who is coding in, soup to nuts, everything in the e-commerce platform?

Vanessa Lee: When you're building your own brand—and this is something that we've kept true to ourselves throughout—we always underestimate how much merchants care about their brand, and how much it's about expression. Right? And so, one of the things that we've been really passionate about is making sure that, whether you're a developer or you are a mom and pop who has no developer on staff, you can come and create something, let's say an online store in this case, that feels native to your brand.

Ryan Donovan: Mm-hmm.

Vanessa Lee: Right? And feels honest. And so, a lot of the times that could mean, okay, I'm able to hire a developer. But in the case of no code, you're able to go to the theme store, find a theme that feels close to what you want, and then be able to customize it, and build it in a way that, 'okay, now it really is my brand.'

Ryan Donovan: Right?

Vanessa Lee: And so, I think when it comes to building yourself, and we've had a lot of conversations over the years, especially during the 2020 era where there's a lot of folks building headless, especially if they did have developers on staff. I think that no matter what, there will be, always, different architectures, different constellations of services that you might have to bring together, especially if you're in the larger category where we're always gonna have that escape hatch. Our approach has always been, we wanna be with you no matter if you choose to develop your own, let's say headless storefront, or if you are coming in and installing a theme. But one thing that has always been true throughout, I know the last decade, is: I've always observed merchants to be extra efficient. They live and die by how efficient they are in their day, how productive their teams are. And so, I think that no matter what they wanna achieve, their questions are always coming back to, 'okay what's the most efficient way for me to achieve the brand that I have in my head, the customer experience that I wanna create?' And so, I think that we offer both, but I think at the end of the day, we've seen a lot of folks just say like, 'you know what? I can do a lot, even without needing to go headless.' But we're never one shop to say, 'okay, you can only go a certain way.' I think we always have to acknowledge that developers will always have needs and wants, and brands will always have things that they wanna do that's unique to them.

Ryan Donovan: So, Sidekick is now out in the wild. What are you excited about for the future of, this AI renaissance?

Vanessa Lee: As user behavior changes from just working in our UI to now working more and more increasingly with Sidekick, one of the things that was really important to me was that we made sure that there was a way that our ecosystem could come with us, right? We're never gonna be a platform that builds every piece of functionality across millions of businesses, across all verticals and all sizes. That's just always been our belief, and so, one of the questions that I get a lot from our ecosystem is, 'when will Sidekick be able to work with my app?' If a merchant says, create me a discount, how can Sidekick then go create the discount? But also say, 'and let me draft that in an email to this customer segment for you,' and let's say I use an app for my email, how does that app participate in that conversation? And so, one of the things that we started releasing in a developer preview, 'cause we wanna develop out in the open, is our ability for Sidekick to essentially launch what we call 'App Intents,' which are ways for you to register tools for Sidekick to be able to then use that in their conversations in workflow so that merchants can actually access your app from conversations. So, that's one that I'm probably– it's a developer preview, so it's early, but I'm excited to see where the next 12 months goes.

Ryan Donovan: I think the last time we talked, you all had an MCP server. Does this use MCP or anything like that?

Vanessa Lee: So, it's very MCP-like in the way that we've architected and built the schema. So, you define the schema very much like MCP. It's not MCP exactly, because there's some stuff that we wanted to make sure you didn't have to do. You didn't have stand up a server yourself. But it is very MCP-like. Our MCP tool that we launched earlier this year, we also upgraded that. So, if you are a developer building an app on Shopify, not only did the MCP tool that I know you spoke to Glenn about, not only can it do GraphQL like it did midyear halfway through 2025, it can also now basically use the Shopify CLI. So, we put a lot of work into our CLI, makes it easier for you to create a test environment, create an application, create the tomo file where you specify what your app is supposed to do in terms of meta fields, and everything. And then now, it's able to do that holistically. So, this MCP tool can actually create an application from start to finish with all the tools that we offer in our platform, which has been fun to see what people can do very quickly in Cursor, and they can now build an app very quickly. So yeah, I'd say both of those things are two big launches for the developer community that I'm hoping will make lives easier for developers to participate.

Ryan Donovan: All right. It's very exciting. Very exciting.

Ryan Donovan: It's that time of the show again where we shout out somebody who's gone into Stack Overflow, dropped some knowledge, shared some curiosity, and possibly earned themselves a badge. Today, we're shouting out a great answer badge winner, somebody who dropped an answer that was so good, it scored a hundred points or more. Congrats today to Erwin Brandstetter who answered 'How to convert empty to null in PostgreSQL?' So, if you're curious about that, we'll have the answer for you in the show notes. I'm Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have topics, questions, concerns, comments, you can email me at podcast@stackoverflow.com. And if you wanna reach out to me directly, you can find me on LinkedIn.

Vanessa Lee: Thanks for having me, Ryan. I'm Vanessa, and you can find me on x, v.laurenlee. And yeah, excited to have had this chat, Ryan.

Ryan Donovan: Thanks for listening, everyone, and we'll talk to you next time.

Add to the discussion