What (un)exactly do you mean by semantic search?

Qdrant offers high-performance vector search at scale with any deployment model.

Connect with Brian on LinkedIn or email the Qdrant team at support@qdrant.io.

Congratulations to user Brad Larson for winning a Populist badge for their answer to Find the tangent of a point on a cubic bezier curve

[Intro Music]

Ryan Donovan: Hello, and welcome to the Stack Overflow podcast, a place to talk all things software and technology. I am your host, Ryan Donovan, and today we are talking about the difference between vector databases and Lucene architectures, when they're appropriate to use, one or the other, and if there is a composable, portable way to use them instead. So, my guest for that is Brian O'Grady, Head of Field Research and Solutions Architecture at Qdrant. Welcome to the show.

Brian O'Grady: Hi, Ryan. Thanks for having me on.

Ryan Donovan: Before we get into the database topics today, can you tell us a little bit about how you got into software and technology?

Brian O'Grady: When I graduated from university back in 2016, I initially went into finance, actually. I was working at Fidelity Investments. After some time there, I got an opportunity at Goldman Sachs, and I entered their technical organization and worked there for four years as a data scientist. I then moved over to Shopify, where I was a data scientist in MLE for about two and a half years. And after that, I moved into the world of databases. So, I moved on to DataStax, and once that was acquired by IBM, I jumped ship and moved over to Qdrant, which was a quasi-competitor at that point because DataStax had its own vector search offering. So, I saw Qdrant a lot in the field and came to appreciate its product begrudgingly.

Ryan Donovan: Yeah. It seems like today every database has a vector offering, right? And today, we're talking about when it's appropriate to use the vector database and the sort of older technologies. For folks who may be unaware, what is the Lucene database?

Brian O'Grady: Apache Lucene is a text search engine that was developed in the late 90s. Very mature, very rich feature set for text search. It is the underlying index that powers Elasticsearch, Solr, OpenSearch, and even if you are like me, working with Cassandra, even powers the text indexes in Cassandra. So, for a long time, if you were ever on a website and searching for red Nike shoes, and you saw the search results pop up underneath, more than likely that was being powered by Lucene.

Ryan Donovan: Obviously, vector database has become very popular because of AI, but Lucene, and Elastic, and things that are powered by Lucene are still very popular. What's the sort of comparison the way of thinking of when each one is appropriate and when do they have trade-offs?

Brian O'Grady: It's definitely going to depend on what you're trying to achieve with your application. So, if you think about when do people use Elasticsearch, OpenSearch, what kind of general use cases are they looking for, a lot of times it's these live applications where you're surfacing e-commerce results to an end user. A lot of what actually drives their revenue are logs and analytics. So, this is just, "hey, I have a dump of all of my security logs." You dump them in Elastic, and you just occasionally search through to find whatever happened on this specific day, right? And this is where a text search is really good because typically, in security events logging, you're looking for exact terms, right? You're saying, "I want to know exactly where this error appeared."

Ryan Donovan: Some other trace UUID or whatever.

Brian O'Grady: Yeah. And the issue with if you try to do vector search for the same thing is you wouldn't get exact matches because vector search is approximate, and you lose information. So, if you tried to search for this exact UUID and embed it as a vector, and then try to search that vector against other vectors: number one, you're losing information when you're doing the embedding because that's a natural flaw, like natural process loss; and then number two, you're only going to be getting approximate results because vector search at scale is an approximate search, whereas text search is always an exact search, right? You're always getting exact results back. So, that's, I would say, a key use case for Elastic and OpenSearch. People doing logging analytics where you need that exact search functionality, but when you start diving into you have maybe [a] user-facing application that needs to service a lot of requests, you don't necessarily need to have exact matches. Maybe you want people to– when people search iPhone, you want to also be able to surface them other types of phones out there. You don't just want to surface iPhones, you want to give them a bunch of options. Text search will fail here because text search will only look for pieces of text that include iPhone, whereas, quote-unquote, semantic search, which is really representing text as embeddings, tends to preserve this idea that different phone types are related to each other. So, you can surface non-exact results to people that may actually still be relevant to what they're looking for. At scale, this is typically where at Qdrant, we see Lucene-based architectures start to fail.

Ryan Donovan: A lot of database types today, they have a vector add-on, right? Is that sort of vector add-on useful in conjunction? Does it help mitigate those failures, or are you better off with a sort of pure vector database for those sorts of things?

Brian O'Grady: Yeah. What I'll say is that not all vector indexes are created equal. And if we look at who are the, what we call 'vector natives' in the space, we think, okay, there's Qdrant, there's Milvus, there's Pinecone, right? Pinecone's a big one. And I think that's the main three or four that you see people going out after there. Then, you contrast with, okay, what offerings are there that are what we call in the industry bolt-on vector search indexes. The main one that people tend to be using is—it's either two, it's either they have an existing Elastic or OpenSearch cluster, they want to add semantic search, AKA vector search capability, they bolt on a vector index to their existing cluster, and voila, they blow up, they run out of memory, and then they have to completely resize and think about how to go about their workload. The other common one outside of Elastic would be, and probably the most well-known example, is Postgres. Postgres has their pgvector. Very common. People are using it all the time. And what's interesting is that I think a lot of other vector database companies out there see someone using pgvector and they'll write a lot of, "no, pgvector is bad, it doesn't scale." And I'm a bit like– I don't see it as much of a threat. I actually see someone using pgvector as an indicator that they will eventually use Qdrant. Yeah, I see this so frequently that people start out on pgvector because it's super simple to use. You know it's just gonna work. People can stand it up in Docker locally, and they can just add on pgvector extension, and suddenly they have everything. They have transactional data, they have vector search running in the same workload, right? The issue becomes when they actually hit scale, and typically what I see is around– they'll have 10 million rows in their database, and they'll have their vector search running, and suddenly their latencies are spiking to 60 seconds for a single request. And by the way, also their traditional SQL transactional workload is failing because the vector search index in the background is taking up so much computational power that they have to eventually separate them and go to a dedicated service. I think a lot of people are afraid of pgvector, but again, to me, it's more like– I've called it almost like a gateway drug to vector search. As soon as you use pgvector, you're in.

Ryan Donovan: Yeah. Postgres almost seems like the new MySQL, right? It's the new starter database. But that sort of answers an interesting question that I've had, where it's like, now I see so many specialized databases popping up. I've also seen some that are like, "you only need this one database." What is the advantage of just focusing on a vector database?

Brian O'Grady: Yeah. I think it's like anything else in technology, that there are a lot of use cases where, yes, your single database or your single monolithic architecture will work up until a certain point, right? And it's not just databases, right? I go on to a lot of GitHub repos, and there are those monolithic repos where they have everything under the sun, right? And maybe Qdrant is a part of it, and I'm trying to update the version, and suddenly I can't just update the version because it's a monolithic repo, now the build time is astronomically large. I'm running into weird dependency conflicts because this version of LangChain Qdrant is different than the LangChain version that comes with the LangChain community; they have installed some other part of the repo. So, what I think it comes down to is Qdrant's take on this is that we want to follow the Unix philosophy, where it's do one thing and do it extremely well. So, you can imagine a scenario where people who follow that philosophy, they tend to find specialized tools for their different tasks, and then just coordinate them properly, rather than trying to work with a more monolithic architecture that tries to handle everything, which can become untenable with time and scale, right? So, I think it becomes easier to maintain. It allows for better separation of concerns. To say you want to have a dedicated vector database, maybe some people say there's some added overhead of it now having to coordinate different services, but we're already living in a very microservices-centered world, so it fits in.

Ryan Donovan: Yeah. I've heard people talk about microservices as being more of an organizational ordering philosophy, but this sort of focus on separation of concerns is interesting because that makes every piece of your stack replaceable, or swappable, composable, if you will.

Brian O'Grady: There you go.

Ryan Donovan: So, I don't usually hear about databases being part of that composability. Can you talk about how that sort of works, [and] how you think about that at Qdrant?

Brian O'Grady: Yeah. So, the general idea is that our opinion is that no matter where you're working with vector search, you should have a single API that does it for you. And this means you have vector search that's running on EC2 instances in the cloud. You have a use case where you need to deploy vector search on an edge device, like we have. We just released a demo showing how to do vector search on device for anomaly detection of real-time street cameras and identifying violent events using vector search. So, this is possible with Qdrant. If you want to run like Qdrant just in the cloud, completely abstracted from you, you want to stand it up in Docker on your local Mac, you can do all of these things, and the API is the same, right? So, you can use it anywhere you want. And the kicker is that they're even running Qdrant on supercomputers. So, you can run it literally anywhere, from the smallest edge device, provided there's enough storage on it, to a supercomputer. You can run Qdrant and have the same API, unified API.

Ryan Donovan: For a specialized database like Qdrant, is it easier to scale to do all the sharding, duplication, all the sort of things that people talk about with a scalable database?

Brian O'Grady: Yeah. These are all things that are natively baked into Qdrant's API. And obviously, you can do these things on your own with open source, but these things become easier in Qdrant Cloud. We don't have anything proprietary in Qdrant Cloud from the perspective of the engine, right? So, this means we want people who are using it locally to be able, again, to have the same API. They don't have to change their API going to the cloud. There's not some search functionality that's only available in the cloud that's not available locally. But what we do is we just coordinate the orchestration of Qdrant, and have solved some of the harder problems around its coordination that were you on open source, you're probably dealing with some of those pains.

Ryan Donovan: Sure. Yeah, no, I've talked to a sharding group, and that is incredibly complicated when you get down into it. You talk about the vector search for a bunch of things that aren't necessarily text. Are there sort of complications to that with a database?

Brian O'Grady: Qdrant's perspective, no. Pretty much, since these things are these elements, these entities are representable as vectors, they're just more vectors to search over. So, Qdrant doesn't necessarily care whether or not the embedding that you're dealing with is a text embedding, an image embedding, a video embedding, what have you. The search is the same, which is a sort of a bit of a different topic, right? In how it's quite amazing to me that our researchers have figured out a way to essentially take all these human level kinda entity abstractions and represent them mathematically. That in and of itself is a very interesting topic, and ability to represent anything, quote-unquote, in a way so that it can be interpreted, so to speak, by a computer is super interesting.

Ryan Donovan: Yeah. It's just next token prediction. I've seen it for gestures and movement, as well as steps in a process, like a job search or something, and that is super interesting.

Brian O'Grady: Yeah. So, to reiterate, from Qdrant's perspective, there's nothing necessarily different about it. The only difference would be each embedding model that you use has different characteristics. And this, again, comes down into the hard math of it in that the underlying vector space of where the embeddings are, quote-unquote, living in hyperspace might be, quote-unquote, shaped differently, which might lead to different search characteristics. So, what's really interesting is that with text search, historically, you do a search result, you have a certain corpus size, you know how fast things are going to be. But with vector search, depending on the topology of the underlying vector search space, you might have the same configuration, same sort of infra setup, but get very different results in terms of search latency because [of] how, quote-unquote, hard it is for the approximate nearest neighbors mechanism to search through the vector space could be different depending on what the embedding model is.

Ryan Donovan: That sort of makes me wonder – what is the most complicated vector space you've seen or heard of?

Brian O'Grady: I know what ones have looked cool–

Ryan Donovan: Yeah.

Brian O'Grady: So, one thing that I often do is I'll take datasets that customers are working with, and I'll take a sample, and I'll do this dimensionality reduction technique that allows you to map out how their– the shape, the topology of their database, called UMAP. And what I'll often see is that—and I, this is a podcast, so I can't show—but they sometimes form these really nice looking clusters, these really nice visualizations, where everything is– there's a cluster here. [It] almost looks like one of those graphs of a social network that they have; those images where it's, oh, here's the nexus, and here are the connections that are going over to this other node, right? And I tend to see these with a lot of the newer in-text embedding models. They tend to have this nice shape to them. But then, there's some older embedding models [that] people are using, and it just looks like a giant blob, where it's, oh, the vector space doesn't really have much useful information in it, right? It's all semi-random. It's interesting.

Ryan Donovan: Those sort of nice clusters of almost meaning or semantics there, is that deliberate in the embedding model? Yeah?

Brian O'Grady: Absolutely.

Ryan Donovan: Makes the space a little easier to search.

Brian O'Grady: For– and this is the caveat, for certain algorithms, specifically, for this one algorithm called Hierarchical Navigable Small Worlds, HNSW, it's a state-of-the-art vector search algorithm. It solves the, quote unquote, "curse of dimensionality problem," where you have a large number, you have a big vector space, meaning your embedding has a large number of dimensions; and what happens when you have this issue is that calculating distances between points becomes more difficult because now, because there's so many dimensions, everything is far from each other. It's very sparse, right? But what these, what we call 'neural embeddings' have done—these embeddings from OpenAI, and Gina, and everything, and Cohere—they have been able to represent this high-dimensional space, but essentially encode it such that most of the information lies on this what we call ' manifold,' where there's this nice curve in hyperspace, where most of the "semantic information" is. And then, there's these outliers that are out there inhabiting these very sparse like remote regions. But HSW is able to solve this traversal very well compared to prior approximate nearest neighbor algorithm iterations.

Ryan Donovan: In other spaces, the sort of mapping or scheduling these sort of traversing a logical space, and that feels like they're applying it to a more familiar pattern, right? This is the Monte Carlo mapping. Do you think that's true, or is it just [that] they learned the lessons from the vector space and were like, "this is just hard to navigate if it's just all random"?

Brian O'Grady: Yeah. I think that what happened when they were training these models is that these patterns are just inherent in human speech and human text, right? So, because what ends up happening is this transformer neural network is trained in the background, and its job is just to recognize these patterns, essentially. And that's why you mentioned it's like next token prediction. It is able to encode these patterns mathematically, and that's how we get this nice shape, is because when we think about meaning, if we think about meaning geometrically, yes, there are these kind of clusters of meaning, right? And this is– the example I like to give is that 'dry' and 'arid' are very similar words, but again, to contrast with text-based search, you search for 'arid whatever' in text-based search, it doesn't look for 'dry'. But you search for 'arid' in a search space that has preserved meaning, and it's able to say, "okay, arid, dry," right? "These are similar concepts. I'm going to return results that are in that area." So yeah, I think it's just a consequence of the technology that we developed, the technology being transformers, merging with patterns inherent to the way humans communicate meaning.

Ryan Donovan: That is interesting, and I think one of the sort of struggles I had first learning about vectors and embeddings is trying to get my head around the dimensions, and I don't think dimensions mean anything towards humans, right? Has that understanding changed at all?

Brian O'Grady: So, for me, the way I think about it is, I think it's hard for people to think about dimensions meaning anything, right? But actually, if you take a step back, and we've been talking about text search this entire time, but have you ever thought to wonder what text is? Have you ever asked that fundamental question, ' what is text?' What does it actually represent, right? And you think about it and you say, 'okay, text represents words,' right? And we think of it, let's constrain ourselves to English and the Latin alphabet. We say, "okay, I write down this sentence that I'm speaking right now,' and that sentence that's written on text, it's just a collection of symbols that represent what Ryan said at 1:36 PM Eastern Time, right? It's a representation of what he said, right? And if you just think about the vectors themselves simply being another symbolic representation of that exact same text, that's all it is. It's just another representation. It's a representation. Instead of in the Latin alphabet, it's a representation in terms of FP32s that are concatenated together. That's what an embedding is, for text embeddings, at least.

Ryan Donovan: English and human language is notoriously imprecise in its meaning. I feel like these embeddings are a much more precise machine-readable version of the semantics of it.

Brian O'Grady: And the way I like to think about it is I—I'm talking to you right now. There's gonna be a transcription of this conversation afterwards, but there's still a loss in that transcription, right? There's– you don't hear the tone, right? You don't hear necessarily the pauses between everything we're saying. So, there's already some information loss if you were to just read the transcription. If you then go further and take an embedding model and create an embedding from the entire transcription, there's going to be some further loss of information. At every point in time, it's just a, what I call, a 'dimensionality reduction,' at every point in time.

Ryan Donovan: So, it's like a quantization all the way down, right?

Brian O'Grady: There you go.

Ryan Donovan: So, I wanna get back to the composable aspect of this. So, you talked about the sort of portability, but are there use cases that are enabled by having a composable vector database?

Brian O'Grady: I think a really good one would be code search. This is one that we're actively working on with some prospects, and the idea being, I'm going ahead, and I'm like– let's say I'm an enterprise and I can't buy Cursor, because I'm a highly regulated entity, and buying Cursor would trigger three years of procurement that I just don't wanna go through. But what I do know is that my company already has, let's say, a contract with Claude, or OpenAI, and I can use Codex or Claude to do code search for me, right? One idea would be, what if I'm building, essentially, a Cursor-like replica for my company, and I want to enable code search on our code bases. When I think about this issue, how a lot of companies treat code search right now is they'll take the user's code base that's sitting locally, and they'll embed it, and index it in the cloud. And then, every time a user wants to do a search over their code base, they have to query the cloud. But to me, that's a bit of an anti-pattern because the code is sitting locally, right? So, why pay like a network tax to search over data that's existing on device? What we're doing with Qdrant Edge is enabling use cases where the search completely happens locally, and you don't have to make an API request to conduct semantic search over your own local code base. But when you do a commit, and you do kinda like a Git commit, you do a similar thing with your vector index. Your vector index also gets committed to a hosted centralized Qdrant instance, where all of these different indexes live for people in your own organization. So, this allows you to not only search over your own code base, but if you so desire, to then search remotely against other users in your organization who are working where their committed vector indexes are living in this kind of centralized database.

Ryan Donovan: So, what is the overhead of a local vector database? I mean if you have a hundred megs of code files, how big is that vector database?

Brian O'Grady: Someone recently released a tool, LlamaIndex. They released a tool called Semtools a while back, and they recently replaced their vector search capability with Qdrant Edge. And whereas previously their binary size was, they said it was like several gig for their binary size; after moving to Qdrant Edge, it shrank to 300 meg. So, you can run vector search very efficiently with, very little overhead. So, we're talking– it's a couple of hundred meg, but for most people's Macs, that's nothing, right?

Ryan Donovan: Yeah. For sure. So, I think most organizations have been playing around with vector databases for a couple years now, two, three years. What does the future hold for vector databases and embeddings?

Brian O'Grady: What we'll see is, we'll see, number one: an increase in the number of entities that are representable in embeddings, and I think we'll also see a big kind of explosion in video embeddings in particular. I think that moving forward, that's actually going to be one of the primary use cases, and this is where you would most likely choose a Qdrant type. You wouldn't choose a vector native because you don't need necessarily all of the functionality of Lucene's text search just to do search over video embeddings, which number one, tend to be quite large, and then there also tends to be quite a lot of them, because there's an enormous amount of video data out there, and every chunk of the video you can imagine having its own embedding. So, I see this as growing in particular. Already, I'm seeing that a lot of the use cases that people pick, where they pick vector natives without even considering a bolt-on alternative, are for the text-to-image search use cases, where people have a set of images that are proprietary that they want to allow their users to search over. They don't even consider other alternatives. Oftentimes, they'll just start on Qdrant open source and just eventually migrate to our cloud solution. So, I see those two as actually becoming bigger for vector search than the kinda typical RAG. But the other use case that I see related to LLMs and agents, so to speak, are these kinda local agent syncing to remote cloud workflows, where you need to keep track of some sort of context across different devices. And the one that I was thinking about recently is everybody's talking about OpenClaw and everything, and I was like, what if I had a family-level OpenClaw that was for my family, and it was executing tasks on my wife's computer, on my computer, all across, like all over the place. But I needed it to sync its, quote unquote, context and its vector index, and I was like, I could stand up a server in my house that had Qdrant running that was syncing the context across all the devices, and OpenClaw's out there remembering things that happened on other devices, quote unquote, right? So, I thought that'd be a funny– but actually, and then I was like, why not throw in one of those Tesla robots to it, as well? And that has OpenClaw installed on it.

Ryan Donovan: Oh man, that's the full YOLO stack right there.

Ryan Donovan: All right, folks, it is that time of the show where we shout out somebody who came on to Stack Overflow, dropped a little knowledge, and earned themselves a badge. So today, we're shouting out the winner of a Populus badge – someone who dropped an answer that was so good, it outscored the accepted answer. So, congrats to @Brad Larson for answering, "Find the tangent of a point on a cubic bezier curve." Curious about that? We have an answer for you in the show notes. I'm Ryan Donovan. I edit the blog host, the podcast, here at Stack Overflow. If you have questions, concerns, comments, topics to cover, et cetera, please email me at podcast@stackoverflow.com, and if you want to reach out to me directly, you can find me on LinkedIn.

Brian O'Grady: Thanks, Ryan, for having me. Thanks, everyone, for listening. Feel free to reach out to me [on] LinkedIn. My name is Brian O'Grady, and if you are interested in Qdrant, please come to our website and fill out our contact form. Our email is support@Qdrant.io, or of course, feel free to reach out to us on LinkedIn, or Twitter, or wherever you typically haunt. But yeah, thank you.

Ryan Donovan: All right. Thank you for listening, everyone, and we'll talk to you next time.

What (un)exactly do you mean by semantic search?

TRANSCRIPT

Add to the discussion