Privacy-friendly machine learning data sets: synthetic data

With the proliferation of AI technologies like GitHub Copilot for code completion, Stable Diffusion for image generation, and GPT-3 for text, many critics are starting to look at the data that they use to train their AI/ML models. The privacy and ownership issues around these tools are thorny, and the data used to train less prominent AI tools can have equally problematic results. Any model that uses real data has a chance of exposing that data or allowing bad actors to reverse engineer the data through various attacks.

That’s where synthetic data comes in. Synthetic data is data that is generated via a computer program rather than gathered through real-world events. We reached out to Kalyan Veeramachaneni, principal research scientist at MIT and co-founder of the big data start-up DataCebo, about his project to open-source the power of big data and let machine learning ingest the data it needs to model the real world without real-world privacy issues.

We’ve previously discussed synthetic data on the podcast in the past.

Answers below have been edited for style and clarity.

Q: Can you tell us a little bit about synthetic data and what your team is releasing?

A: The goal of synthetic data is to represent real-world data accurately enough to be used to train artificial intelligence (AI) and machine learning models that are themselves used in the real world.

For example, for companies working to develop navigation systems for self-driving cars. It’s not possible to acquire training data that represents every possible driving scenario that could occur. In this case, synthetic data is a useful method to introduce the system to as many different situations as possible.

In September, my team at DataCebo released SDMetrics 0.7, a set of open-source tools for evaluating the quality of a synthetic database by comparing it to the real database it’s modeled after. SDMetrics can analyze various factors associated with how well the synthetic data represents the original data, from boundary adherence to correlation similarity, as well as the predicted privacy risk. It can also generate reports and visual graphics to make a stronger case for non-engineers about the value of a given synthetic dataset.

Check out the SDMetrics website to see the different elements of the SDMetrics toolbox.

Q: What sort of scenarios does synthetic data protect against?

A: Synthetic data has a lot of potential from a privacy perspective. There have been many examples of major privacy issues related to collecting, storing, sharing, and analyzing the data of real people, including instances of researchers and hackers alike being able to de-anonymize supposedly anonymous data. These sorts of issues are generally much less likely with synthetic data, since the dataset doesn’t correspond directly to real events or people in the first place.

Real-world data also often has errors and inaccuracies, and can miss edge cases that don’t occur very regularly. Synthetic datasets can be developed to ensure data quality down to a level of detail that includes automatically correcting erroneous labels and filling in missing values.

In addition, real-world data can be culturally biased in ways that may impact the algorithms that train on it. Synthetic data approaches can employ statistical definitions of fairness to fix these biases right at the core of the problem: in the data itself.

Q: How do you generate synthetic data that looks like real data?

A: Synthetic data is created using machine learning methods that include both classical machine learning and deep learning approaches involving neural networks.

Broadly speaking, there are two kinds of data: structured and unstructured. Structured data is generally tabular—that is, the kind of data that can be sorted in a table or spreadsheet. In contrast, unstructured data encompasses a wide range of sources and formats, including images, text, and videos.

There are a range of different methods that have been used to generate different kinds of synthetic data. The type of data needed may impact which methods of generation are best to use. In terms of classical machine learning, the most common approach is to do a Monte Carlo simulation, which generates a variety of outcomes given a specific set of initial parameters. These models usually are designed by experts who know the domain very well for which the synthetic data is being generated. In some cases, it uses physics-based simulation. For example, a computational fluid dynamics based model that can simulate flight patterns.

In contrast, deep learning-based methods usually involve either a generative adversarial network (GAN), a variational encoder (VAE), or a neural radiance field (NeRF). These methods are given a subset of real data and they learn a generative model. Once the model is learned, you can generate as much synthetic data as you like. This automated approach makes synthetic data creation a possibility for any type of application. Synthetic data needs to meet certain criteria to be reliable and effective—for example, preserving column shapes, category coverage, and correlations. To enable this, the processes used to generate the data can be controlled by specifying particular statistical distributions for columns, model architectures and data transformation methods. The choice of which distributions or transformation methods to use is very dependent on the data and use cases.

Q: What's the advantage of using synthetic data vs. mock data?

A: Mock data, which is usually hand-crafted and written using rules, simply isn’t practical at the kind of scale that’s useful for most companies that use big data.

Most data-driven applications require writing software logic that aligns with the correlations seen in data over time—and mock data does not capture these correlations.

For example, imagine that you’re an online retailer who wants to recommend a specific deal for a customer who has, say, bought a TV and made at least seven other transactions. To test whether this logic would work as specified when written in software, you'd need the data that has those patterns, which could either be real production data or synthetic data that's based on real-world data.

There are numerous examples like these where patterns in data are important to test the logic written in the software. Mock data isn’t able to capture that. These days more and more data-based logic is getting added to software applications. Capturing this logic individually via rules has become virtually impossible to do at the kind of scale needed to provide real value to the organizations that use them.

We discuss the limits of mock data in more detail on our blog.

Q: Are there any benefits or concerns with open sourcing this library? Will it be more secure? Can someone reverse engineer real data knowing the models and algorithms?

A: DataCebo’s Synthetic Data Vault resource includes a number of modeling techniques and algorithms. Making these algorithms public allows for transparency, improved cross-checks from the community, and enhancements to underlying methods to enable more privacy. These algorithms are then applied to data by a data controller in a private setting, to then train a model. One outcome of this approach is that the models themselves are not public.

There are also some privacy-enhancing techniques that are added during the training process. These techniques, while described in the literature, are not part of the open-source library.

Knowing these techniques in of themselves may not lead to reverse engineering, as there is a sufficient amount of randomness involved. It is, however, an interesting question that the community should think about.

Our new SDMetrics release involves evaluation methods for the synthetic data on a variety of axes. These metrics are about quality of synthetic data, efficacy of synthetic data for a specific task, and a few privacy metrics.

We feel that it’s especially important for these metrics to be open-source, as it allows standardization of assessment in the community. The creation of synthetic data—and the synthetic data itself—is ultimately going to be in a “behind the wall” setting. Because of that dynamic, we wanted to create a standard that everyone can refer to when someone references the metric they used to evaluate their (walled-off) data. People can go back to SDMetrics to look at the code underneath the metric, and hopefully have more trust in the metrics being used.

Add to the discussion