Even GenAI uses Wikipedia as a source

They discuss how this project helped offload the burden that scraping was creating for their sites, what Wikimedia.DE is doing to maintain data integrity for their entries, and the importance of user feedback even as they work to bring Wikipedia’s vast knowledge to people building open-source AI projects.

Wikimedia.DE announced the Wikidata Embedding Project with MCP support in October of last year. Check out their vector database and codebase for the project.

Connect with Philippe on LinkedIn and his Wiki page.

Today’s shoutout goes to an Unsung Hero on Stack Overflow—someone who has more than 10 accepted answers with a zero score, making up 25% of their total. Thank you to user MWB for bringing your knowledge to the community!

[Intro Music]

Ryan Donovan: Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm Ryan Doman, and today we're talking about the Grand Vectorization MCT Project that happened with Wikimedia Deutschland. My guest today is Philippe Saade, who is the AI project lead at Wikimedia Deutschland. So, welcome to the program.

Philippe Saade: Thank you so much.

Ryan Donovan: So, before we get started, you'd like to find out a little bit about our guest. How did you get involved in software technology?

Philippe Saade: So, for me, I started my bachelor's in computer science.

Ryan Donovan: Mm-hmm.

Philippe Saade: And jumped into my Master's here in Germany. And it was more general computer science, but then I took a few more classes around AI and machine learning, and it was a little bit before what now AI means for people, like before chatbots.

Ryan Donovan: Yeah.

Philippe Saade: So, I was more intrigued into the simpler AI techniques with how you can teach a model to do some classifications, or things like that. So, I delved right into it.

Ryan Donovan: Is that the sort of more classical deep learning, unsupervised, or is it a more labeled data training?

Philippe Saade: Mostly both. I was more obviously interested in the deep learning stuff, 'cause it's just super interesting how you can just from data create this model that does really cool things.

Ryan Donovan: Mm-hmm.

Philippe Saade: From classification or even unsupervised learning. Reinforcement learning is super interesting. So, I just wanted to go into this topic and learn more how things work.

Ryan Donovan: Yeah. Well, today we're talking about a very big classification project, right? So, obviously Wikimedia, and Wikipedia, and all the Wiki sites are a very important part of the internet. I'm sure with all of the new AI real-time data, just-in-time data, you guys were getting pretty hammered, right?

Philippe Saade: I mean, yeah, for sure. We've been having a lot of discussions on scraping and how it's been affecting the infrastructure, which I assume they're probably RAG applications who are collecting all this data to provide answers, but also companies who are gathering the information for training, and that's a huge aspect; and we're hoping to find better solutions.

Ryan Donovan: Yeah. It seems like the better solution is to cooperate and not resist, right?

Philippe Saade: Yeah, of course. I mean, you can't really stop these types of scrapings. Better find solutions to either provide the data in a simpler way instead of having multiple calls on the API, or the Sparkle query service that do a lot of computations. So, it's better to just either provide the data or find a solution to make it simpler, and with less resources.

Ryan Donovan: Yeah, and I think, you know, that's sort of the conclusion we've come here at Stack Overflow, as well. The blog post you wrote about this came to my attention from some of our engineers that said, 'oh, this is, this is great. This is such an interesting project at scale.' Can you give us a little sense of what is the scope of this project?

Philippe Saade: Yeah. So, basically, at Wikimedia Deutschland, we're working mostly on Wikidata, which is the knowledge graph behind the sister project of Wikipedia, and the knowledge graph behind Wikimedia. Basically, the project that we're working on is a vector database on top of Wikidata, and we're hoping to, by developing this vector database, provide a simpler access point on Wikidata and enabling semantic search; and at the same time, encouraging the open source AI community to build projects with Wikidata.

Ryan Donovan: I think the numbers were something like 1.78 terabytes of text with 119 million entries. Is that about right?

Philippe Saade: Yeah, that's true. So, Wikidata currently has around 119 items in it. And you know, items that are found in the knowledge graph. Currently, with the Vector database, we're embedding around, I would say 30 million items, and that those are filtered out based on current testing, seeing what makes the most sense. The 30 million are the ones that are currently linked to a Wikipedia page.

Ryan Donovan: Wow, okay. So, there's some dead pages in there.

Philippe Saade: Not necessarily dead pages. So, a lot of these other items, for example, are linked to or are about articles. So, for scientific articles, which would be super interesting to have a Vector database for that. But then I would imagine if we add more and more items on the vector database, then it would explode, and it would be a bit more difficult. And for now, for testing purposes, it would be best to keep it with the more general knowledge information.

Ryan Donovan: Mm-hmm. I definitely wanna get into those limits at some point, but you talked about it as a knowledge graph. How much sort of organization and structure did you have to work with [in the] beginning?

Philippe Saade: In terms of the knowledge graph, how it's internally structured?

Ryan Donovan: Yeah, yeah. How much information and how much of that knowledge graph were you able to embed in the project?

Philippe Saade: Oh, for the vector database, basically, it's not as straightforward to transform an item on Wikidata from the knowledge graph to embedding, because most embeddings now work with textual information. Currently, we're using a more of a pre-trained embedding model and not trained ourselves.

Ryan Donovan: Mm-hmm.

Philippe Saade: So, there was a process of transforming each item into a textual representation to then transform it into an embedding. And that was a little bit of a challenge, I would say, because we had to pass through the data dump multiple times to gather all the information from the connected links in the graph, and aggregate everything into one text to then pass it to the embedding model.

Ryan Donovan: Yeah. I think you have a very curated data set, right? You have a lot of metadata associated with it. In those multiple passes, what were the sort of additional embedding information that you were gathering?

Philippe Saade: So, per item, we're getting the first, I would say, obvious general information. So, the label description aliases of a certain item, and then on with the data, you have the statements. Per statement, per basically connection in the graph, we get the label of the property of the edge, the label of the connected item, and we create basically one sentence, you can call it, which we include in the textual representation to then embed it. And then you have a lot of data types on Wikidata, not only other items. They could be date and time, URLs, all sorts of information. Even connections to external databases, which can be IDs, and those, for example, we excluded from the embedding because IDs wouldn't really make sense to an embedding model, too.

Ryan Donovan: Right. Nobody's searching for that ID.

Philippe Saade: Yeah. I mean, it's interesting to have some sort of flag to let the embedding model know that it exists, and we do that at the moment. So, this property exists, but the item, the ID itself, wouldn't really- there's no pattern to it to then have the embedding model make sense out of it.

Ryan Donovan: Yeah. And say you're vectorizing 30 million entries and going over it multiple times, did you find ways to sort of speed that up? Did you run into issues tokenizing that amount of entries?

Philippe Saade: So, the process of doing that, basically we worked with the data dump. So, we tried to make everything local with the processing, gathering all the labels that are found on Wikidata to then do the aggregation of all the information. One thing that we did, at the moment, we also pushed a version of Wikidata on Hugging Face with all the labels of the connected items. On one side, this was a little bit easier for me to just read from Hugging Face and push it to the picture Database.

Ryan Donovan: Uh-huh. Yeah.

Philippe Saade: But also, it was a solution hopefully for scrapers, for example, instead of scraping Wikidata, just going to Hugging Face and using the information from there, 'cause it's just easier to. It's already processed, you have all the labels, and it's just easier to pass through the data to either some training or whatever [is] needed for the scraping.

Ryan Donovan: Yeah, and they're not putting that additional load on the Wiki media servers, right?

Philippe Saade: Yeah. Yeah, exactly.

Ryan Donovan: It is interesting, where it's like, it's easier for you to put it there. What made that easier? Does Hugging Face do processing on it first?

Philippe Saade: So, Hugging Face, I believe, they use a structure called parquet, which is very easy to have a script from your side to read each row on its own, to then process and push. Otherwise, you'd have to save terabytes of data to be able to push to the Vector database.

Ryan Donovan: Is parquet columnar database. Is that right?

Philippe Saade: I believe so, yeah.

Ryan Donovan: It's sort of like a simple one-to-one correlation, right? Did that make it easier to process than having to go through, you know, a SQL database or even objects?

Philippe Saade: Yeah, for sure. With this one, particularly because we included in one row all the labels of the connected items, so all the information that is needed was available in one row. So, it was just easier to process row by row, to then push to the vector database.

Ryan Donovan: And then, you know, you're tokenizing, you said using a retain model. Was that Jina AI?

Philippe Saade: Yeah. Correct. So, we're using the Jina embedding V3 model.

Ryan Donovan: So, that's a pretty huge amount of tokens to process. Did you run into any challenges there?

Philippe Saade: So, currently, we're using the API that is provided by Jina. It's just easier for us instead of hosting the model ourselves, using the API, and it's way faster. They have a really good infrastructure there. And in terms of tokens, we had a way of chunking the items of Wikidata, and we chunked them into, I believe, 10-24 tokens. So, on average, an item would be– like, the highest number of chunks would be maybe four per item. The Jina infrastructure has a lot of parameters to do some type of, either for improving the resources, or improving the way you embed the item. So, for example, they're using Matryoshka embedding.

Ryan Donovan: Oh yeah.

Philippe Saade: I hope I'm pronouncing it.

Ryan Donovan: Matryoshka. Yeah. It's like the Russian doll. Embedding. Yeah.

Philippe Saade: Yeah, exactly. It's very cool. It's like a flexible embedding size. The highest on this particular model is 10 24.

Ryan Donovan: Mm-hmm.

Philippe Saade: But we tested multiple sizes, and we saw 5 12 for us was enough accuracy, and it's really good for resources, so we decided to do that.

Ryan Donovan: Yeah. And when you said you did four chunks per item, each item is what, a Wikipedia page? Or is it smaller than that?

Philippe Saade: We're using only the information from Wikidata, from the graph itself. Basically, the way we're chunking is we're giving each chunk the label description and aliases. So, all of them have the general information of an item, but then we're doing the chunk on the statement level.

Ryan Donovan: Okay. So, a per-sentence level?

Philippe Saade: Yeah. Per statement. Statements could be very long with qualifiers and a lot of information. Let's say a person educated at, and then a list of universities, and then per information you have date start, date end, degree, all of these things. So, all of them would be one statement, which we can consider as one sentence.

Ryan Donovan: Interesting. Did you consider other chunk strategies? Did you consider an entire, you know, page, multiple statements, together?

Philippe Saade: I tested different– I was confident about wanting the label description in per chunk. That made the most sense logically for me. And I didn't wanna test every single idea because then I would never stop. But I tested multiple filtering types of properties, like all the statements that I would wanna include or not include, or types of information – whether to include, for example, the external IDs. Whether to include that property of it existing or not.

Ryan Donovan: Mm-hmm.

Philippe Saade: That's something that we tested until we got a good solution.

Ryan Donovan: Yeah. And that is interesting, you know, [inaudible] be like, 'you could've tested this forever.' How did you determine what was good enough?

Philippe Saade: In terms of a testing?

Ryan Donovan: Yeah. When you said you got to a solution that was good, how did you define good at this case?

Philippe Saade: Yeah, it's not very easy for me because I tend to just continue testing and testing until I never stop.

Ryan Donovan: Right.

Philippe Saade: Yeah, someone has to tell me, 'no, you should stop. You got enough.' But I kind of set the type of deadline for myself, and had multiple ideas that [were] not very confident about whether one works versus the other. So, I tested those only. The thing is, with the Vector database, it's easy to know which solution is better. When you have an evaluation data set, you test it on both, and then, 'okay, this one's giving better results, so this one is the best.' But in terms of whether the final solution is overall good or not, that's a little bit more difficult to know, and that's why now we're doing some user testing. We launched our alpha version. We're hoping for people to test it, and give us some feedback, and know if it's actually currently a good solution or not.

Ryan Donovan: How is that testing going?

Philippe Saade: It's going quite well. We released our alpha version in October, so it's been 2-3 months now. We're reaching out to people, understanding use cases. 'Cause vector databases can be versatile in terms of use cases. I would like to understand in terms of Wikidata, particularly what people would like, and understanding features, and if we're communicating the vector database well or not. These are a few things that we would like to figure out to then improve further.

Ryan Donovan: Yeah, and my understanding is that this has an MCP server attached to it.

Philippe Saade: Yeah, that's true. That's a small, little project that you wanted to include, as well.

Ryan Donovan: Yeah.

Philippe Saade: Something that, because the vector database is more for AI developers that want to build something on top of it. But the MCP can be used by maybe end users, or editors of Wikidata who would like to explore Wikidata in an easier way.

Ryan Donovan: Yeah.

Philippe Saade: And it's quite fun.

Ryan Donovan: Yeah. I'm curious about that. I've heard various reports of what building an MCP server is like. Some people say it's super easy, some people say it's a little bit of a janky solution; but I'm curious what your experience is because that seemed be a very public, popular MCP server, I believe.

Philippe Saade: Yeah, it's interesting. I developed it with the idea of, well, I had one use case in mind. I wanted to be able to write Sparkle queries in an easier way with generative AI because Sparkle can be difficult with its syntax. And now, if you ask an AI model to build a Sparkle for you, particularly for Wikidata, it's very rare that it gets it correct.

Ryan Donovan: Mm-hmm.

Philippe Saade: Not only because you have to know the IDs, but also you have to know, within a specific domain of Wikidata, how items are connected to each other. So, the MCP would help you, or basically would help the LLM to search Wikidata with the Vector database, and go through the graph, explore edges, see how things are connected, to then, after gathering all this information, being able to write a proper Sparkle Query.

Ryan Donovan: I'm not familiar with Sparkle, so maybe some of our listeners aren't. Can you tell us what Sparkle is and how that helps write queries?

Philippe Saade: Yeah, of course. In simple terms, it's basically SQL, but for knowledge graphs for Wikidata. So, you have a particular structure, and it helps you filter out information through the properties, or through 'instance of,' and property in Wikidata do any type of complex aggregation and things like this, to then get the information that you wanna look for on Wikidata. So, it's very precise, and you get exactly what you're expecting. But you have to know the syntax, and you have to know how things are structured, or connected to each other on Wikidata; whereas, the vector database is very easy to query, 'cause you just need to write a sentence, and it helps you in a more exploratory way.

Ryan Donovan: Hmm. So, with the knowledge graph and the vectorization, are there sort of ' unlocks' of things you can do? Are there things you could do with both of them that you couldn't do with either alone?

Philippe Saade: With the knowledge graph and the vector database?

Ryan Donovan: Yeah, yeah.

Philippe Saade: Yeah. That's pretty interesting what you can do with both, there. I would imagine there are many exciting, creative ways of making use of both. For example, a graph RAG, where you can use the vector search as an exploration phase to find items that might be useful for your search, and then using the graph to then explore and even get more information. So, it's basically balancing the exploration aspect of vector search that is not super precise, but it helps you as a first initial start, and then Sparkle, which is very difficult. I don't wanna say very, but it's difficult in terms of syntax and writing exactly what you wanna do with it, but then it's super precise.

Ryan Donovan: Obviously, [inaudible], Wikimedia has a very curated knowledge. Is there any issues with the sort of vector database polluting the knowledge graph? Are there any issues like you check the ingested data?

Philippe Saade: So, in terms of the knowledge graph specifically, the Vector database is its own entity. So, it's more of a way to read the knowledge graph, but not necessarily to edit it. But in terms of making sure, for example, the vector database is not polluted by information in the knowledge graph, we have some rules to make sure an item is embedded in the vector database. For example, if it's connected to Wikipedia page, very simple ones, like if it has a label or has a description, otherwise, we wouldn't include it. But at the same time, it's kind of a bit tricky, because in a way, this vector database is not only used to then have a RAG application on top of it. It's used to explore Wikidata at the same time. So, if you have items on Wikidata that are not as, maybe, maintained, then maybe it's actually better to add them into the vector database and make them searchable so that the community can go to that item and add the missing information.

Ryan Donovan: Yeah. Are there additional challenges with those highly edited pieces, that the updates on those?

Philippe Saade: Yeah, that's a good question. Since we're on a alpha release phase, we're not constantly updating the vector database.

Ryan Donovan: Mm-hmm.

Philippe Saade: We're using data from September, 2024, just for testing purposes to see how people react to it and use cases around it. Eventually, we would like to have a solution to do constant updates to the vectors, but at the end of the day, the vectors are representing more of a general knowledge of an item. So, I would assume if there are tiny edits on an item, it wouldn't make a huge impact. But at the same time, you want to include new items that have been introduced to Wikidata. So, for now, we're considering maybe a periodic edit or update of the vectors. Yeah. We'll test that out and see how that goes based on how useful it is, and resource-wise, how much it makes sense.

Ryan Donovan: Yeah. I imagine it's gonna be a pretty significant lift to revectorize the entire thing every time. I imagine you probably have strategies to just find the changed ones, but what is a dif of Wikimedia look like in that case? Have you explored doing that efficiently?

Philippe Saade: I believe there are, at the moment, ways to detect if an item has changed or not. There are ways, like for example, when it was last edited, or checking if the property that has been edited is actually one that we use in the vector or not. So yeah, it's possible to detect that, it's just something that we would need to explore.

Ryan Donovan: This is an alpha, like you said. What are you hoping to learn, and what is the next steps if you learned what you learned?

Philippe Saade: From the users, mostly, would like to understand what use cases they would like to explore with the Vector database, whether it's more interesting for rack applications, or maybe even not necessarily generative AI applications, but things like classification, clustering, things like this. And with that, we hope to understand what kind of features we can include, what things are missing, maybe if the accuracy is enough for the users or not. If we need to maybe fine-tune the model, or things like this.

Ryan Donovan: Yeah. And you know, obviously, as a representative of another user-generated content system, we're always curious: what sort of performance-cost-benefit learnings you had?

Philippe Saade: Yeah, so basically, when I first started the project, we didn't have any big plan of,' this is what we want, and for sure we're gonna build this to push to production. This is for sure gonna stay forever.'

Ryan Donovan: Mm-hmm.

Philippe Saade: So, it made sense for us to experiment with partnerships, partnering with data execs, for example, that is providing the infrastructure. So, they're helping us explore this aspect, understanding if it's actually something helpful for the user, or not. And partnering with Jina as well, that are providing the API for us. So, otherwise that would have been difficult to do, as we don't have like big resources to be able to have this huge vector database that we can just push, and it wouldn't make sense to have a vector database on a small subset, for example, of data. It's not enough. If you want to test it out, you have to do it on all of the data.

Ryan Donovan: Alright, everyone, it is that time of the show again where we shout out somebody who came on the stacker workflow, dropped some knowledge, shared some curiosity, and earned themselves a badge. Today, we're not shouting out a question, we're shouting out an individual—an unsung hero—somebody who has more than 10 zero-score accepted answers, and it's more than 25% of their total. So, get those upvotes in, people. So, congrats to MWB for doing the hard work. We appreciate you. I'm Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have questions, concerns, topics to cover, comments, anything at all, email me at podcast@stackoverflow.com. And if you wanna reach out to me directly, you can find me on LinkedIn.

Philippe Saade: And I'm Philippe. I'm the AI project manager at Wikimedia Germany, and if you want to test out the Vector database and give us your feedback, you can find it by looking up Wikidata vector database.

Ryan Donovan: Alright. Thank you for listening, everyone, and we'll talk to you next time.

Even GenAI uses Wikipedia as a source

TRANSCRIPT

Add to the discussion