With all the talk about the power of AI and the productivity gains you may (or may not) get, few are talking about improving what underpins those AIs: data. If your data is low quality, so will be your AI—garbage in, garbage out. I spoke with Satish Jayanthi, CTO and co-founder of Coalesce, to find out what it takes to ensure your data is good enough to support your AI program.
Ryan Donovan: Tell me a little bit about who you are, how you got involved in data and the company that you're CTO at.
Satish Jayanthi: Over the last 20 years, I’ve worked in all aspects of data. I started off my career as a programmer before I became an accidental database administrator. While working for a startup, I was tasked with supporting all of the data requests, and that’s when I first realized no one person could handle that many requests.
That’s when I knew that we needed a different approach, and I started looking into dimensional modeling and Kimball architecture. This marked my entrance into the data space from an analytic standpoint, and since then, I've been doing that work as an architect and a leader—managing teams, consulting, etc.
RD: Obviously, data is the foundation of AI, especially for the big machine learning models. Can you kind of give a sense of how foundational data is to AI?
SJ: Data is crucial for AI, but I believe the quality of the data is even more important. It's at a different level in this different phase we are going into with all these AI programs and AI functions. I believe that data is far more important than ever before.
That is primarily because you're training the AI models with your input. You know the famous saying: garbage in, garbage out. As much as we're excited about this AI functionality, which is fantastic, we've got to be very, very careful what we’re feeding it so it will give us proper output.
RD: When we talk about AI, we're talking about both the generative AI that's really popular right now, but also other uses of it, more specific kinds of uses, right?
SJ: A hundred percent. AI is basically a model and algorithm that you're using behind the scenes, and you have to train it to do certain things.
To train AI, you need to feed it inputs, which will determine the quality of the output—there’s a direct relationship between the two. Whether it's an LLM where you're interfacing with natural language, or it's just some kind of model sitting there that is trained for a particular function, like fraud detection, it always has to be trained on high-quality real world data.
RD: When you talk about quality data, what does that mean? Most people have databases that are key-value stores, or they'll have a bunch of documents in SharePoint or something. What, what makes that data high quality?
SJ: There are a lot of aspects to data quality. There is accuracy and completeness. Is it relevant? Is it standardized? There are several aspects to this, and all of these have to be taken into account when you're providing data quality.
There's bias as well when it comes to AI. Essentially, we are giving away some control to machines, which is why data quality is so important. In the past, AI was more deterministic. We controlled the input as well as what we were doing with that data to provide some output. We were in control end to end.
Now we are entering a phase where we're inputting data into AI models. It's not quite a black box, but it's still something that we are giving away control to. The AI systems will take this data, and depending on how you train the models it, you will get your output.
What I have seen so far is that it’s never a hundred percent accurate. It's 80 to 90 percent right. That's why I think the quality of data is even more important than before.
RD: I think with the large language models, it definitely feels like there's a bit of magic.Taking some documents, converting them into these massive vectors, and then asking, who invented the telescope?
Can you talk about, on a fine-grain level, how you polish data if, say, you want to get into a vector store?
SJ: It depends on the type of data. I have a lot of experience with structured and semi-structured data, and it all goes back to the foundational things: do you have data governance policies that could generally apply across the board, regardless of what type of data? Who owns and is accountable for these data sets? Do we have somebody checking these data sets and following certain standards?
It's a multidisciplinary function to improve data quality. You should have people who understand the data. They treat data with seriousness because they need to understand how important it is. That's when you get a quality output. It's not one thing that you need to do. You have to have a proper data strategy, data governance policies, accountability, and security—all of these are very important regardless of what type of data that is. Let me give you an example.
I come from a regulatory background working with financial companies before I co-founded Coalesce and before I worked at WhereScape. One of the things that we used to do was asset management, and we found out that we were charging our customers a lot more than we should because our pricing schedule was incorrect. We were feeding this pricing schedule to a third-party system that would spit out the fees. It was almost like AI in a way because we didn't control the system. We were feeding it a bad pricing schedule, and it was giving us bad outputs, and we were incorrectly charging clients. We only realized that four years later. We had to go back and simulate what the charges should have been because they no longer had all the point-in-time data.
It was pretty serious because of the financial regulations, and it could have been resolved or prevented by having proper data governance policies in place, by having rules and processes in place, and by having someone who would have checked this privacy schedule and approved it. We didn't have that.
RD: So we talked about data governance. What's that mean on a fine grained level? Is it just having somebody who owns it and is responsible for it, is it something bigger?
SJ: It is definitely bigger. It's much bigger than that, actually. You're trying to look at every aspect of how the data is being used, how it's flowing, whose hands it's going to fall into, and how they're using it. That means you have to define policies for ownership and standardization of the data definitions. It also means addressing data retention by determining how long you want to keep data because at some point it may become irrelevant.
RD: I'm sure that there must be a little bit of data integrity there, especially around the modern data pipeline structure where you have production data, you run it through an ETL pipeline, and then it ends up in some data lake somewhere accessible to like all the analytics.
What are the challenges with maintaining that data being accurate, complete, and secure?
SJ: It all starts with the culture of the organization and is the culmination of three things: the people, process, and technology. You have to have all these three things. The easiest part of that is technology. And the hardest part is people. There has to be a lot of focus on bringing key stakeholders to work with IT. Get everybody into a room on a periodic basis to understand the importance of data governance.
That has been the number one challenge that people or organizations face: How do you get these people together so they can focus on this? How do you work together on establishing and monitoring these policies and then establish the proper procedures? Of course, you need some technology as you scale up.
RD: It sounds like a lot of these are ensuring you have data quality processes for an organization. Is there a particular extra challenge when it comes to AI?
SJ: The challenge is building awareness and raising the importance of data quality. How do leaders explain that to everybody? The excitement around AI is great, but it's not magic. It does what you train it to do.
A good example is if there were LLMs in the 1700s, and we have Galileo in court. We're going to decide his fate based on what ChatGPT says. If we were to have asked ChatGPT back then whether the earth is round or flat and ChatGPT said it was flat, that would be because that's what we fed it to believe as the truth. What we give and share with an LLM and how we train it will influence the output. That awareness is the challenge.
A lot of people get carried away with wanting to feed it tons and tons of data and it's magically going to give me the answers that I'm looking for, which is false.
RD: I think a lot of people think they can just shovel the raw internet onto an LLM and get an on-call genius.
SJ: Yeah. It's not going to happen that way.
RD: Do you have any examples of bad quality data causing challenges at Coalesce?
SJ: Yeah, we encounter challenges whenever we have incomplete information. We collect a lot of logs to help improve our business processes. We have our own internal data warehouse that we analyze to figure out where and how to make improvements. A lot of times when we do this analysis, we come across issues where something looks wrong, and we don’t know what’s happening. If you go under the hood and take a deeper look at the issue, more often than not, somebody has defined a metric in a certain way, which skews results and is not intended to address the output we were looking for. It’s possible that maybe both versions are true, but they were not labeled properly. Half of our people are thinking it's one thing and the rest of them are thinking a different thing, which causes a lot of confusion.
RD: It sounds like the data accuracy would be a huge problem. Does somebody have to read through all of the data? Is there a shortcut so you don't have an intern looking through every database?
SJ: I don't think there is a shortcut, but there are processes and procedures that you can put in place, like training everyone so they understand the issues and making sure that we define things properly. Again, going back to governance, it takes time.
Depending on the organization's maturity, governance can be set at different levels. If it's a small organization, they typically do not spend a lot of time bringing people and processes together for data governance, which is unfortunate.
RD: Yeah. It's another thing to shift left into the development cycle.
SJ: Right. People get carried away with all the shiny dashboards and the shiniest things out there, like LLMs. Does that provide a lot of value? Very few people will take a step back and say, “Hey, what is that I need to do in order to actually make use of this technology that's there?”
We want to do it properly. Most of the time, what that leads to is foundation building. You have to build a proper foundation by bringing a team together to collaborate. There are no UIs. There are no flashy dashboards. Nothing. You first have to work hard to build the plumbing.