Unlocking the potential of synthetic data: A business game-changer

Understand the potential applications and limitations of generating synthetic data using AI models.

Chat with MarTechBot

The idea of “synthetic data,” or artificially generated information, has recently caused a stir. Data is a huge asset for businesses in this age, and knowledge often provides a decisive competitive edge. The notion of easily obtaining free data has sparked extravagant claims and controversy.

But, as your mom probably told you — if something seems too good to be true, it is.

However, the reality is a little more nuanced with synthetic data. While we certainly can’t stop collecting data and “just ask the model,” some fascinating middle-ground uses of AI-generated data exist. And judicious use of this data can help drive your business forward. In this situation, there’s no free lunch, but there is at least the possibility of a complementary side or two.

To better understand the opportunities opening up with synthetic data, I will introduce you to three primary modes you can use to generate the new data. These aren’t the only ones available, but are the most common approaches today.

1. Direct querying 

The first mode is the one people most commonly associate with the idea of synthetic data — and that is direct querying. When you first used ChatGPT or one of the other AI chatbots — there was probably a point when you said to yourself, “Wait a second. I can interview this just like I would a research respondent,” and tweak the system prompt (“You are a Gen Z participant who is passionate about RPGs…”) and proceed with asking the question.

Working with this kind of data can quickly become problematic or un-insightful because training datasets can be old. Responses can be biased or have inappropriate viewpoints that can easily bubble up. Additionally, a large chunk of the training data for these models comes from services like Reddit, which can have spicier takes than you’d want in your own data.

Beyond those red flags, the main issue with this kind of data is that it is boring. By its very nature, it produces plausible answers based on the amalgam of all its training. Therefore, it has a tendency to produce obvious answers — the very opposite of the kind of insight we are usually looking for. While direct questioning of the LLMs can be interesting, large-scale generation of synthetic data in this way is likely not the best solution.

Dig deeper: AI in marketing: Examples to help your team today

2. Data augmentation 

We can move beyond data querying through the second mode, which is using the models to extract data from data that you bring to them — often called data augmentation. This method uses the reasoning and summarization power of the LLMs. Still, rather than basing the output solely on the original training data, you leverage models to help analyze your own data to generate perturbation of it as if it were original data.

The process looks something like this. First, you must know the data you are bringing to the table. Perhaps it’s data sourced from an internal system, primary research, a trusted third-party supplier or from segmentation or appended desirable behaviors. After understanding the source of your data, you can then use the LLM to analyze and provide more data with compatible characteristics.

This approach is far more promising and provides you with control you cannot get from the LLMs on their own.

Many in the martech industry might be thinking, “Like look-alikes?” and you would be correct. The new models allow us to generate look-alikes in a way that we have never been able to do before. This allows augmenting or generating data that stays consistent and comparable with the known data we already have.

Often, having a volume of data like this is helpful when testing systems or exploring some of the fringes a system might need to handle. It could also be used to provide truly anonymous data for demonstrations or presentations. Avoid the circular thinking of “Let’s generate a ton of data and analyze it,” when you are better off simply analyzing the root data. 

3. Data retraining 

Finally, the third mode of generating synthetic data is retaining a model to represent the data we have directly. The “holy grail” approach of taking a model and doing custom fine-tuning on a data set has been around for a long time but, until recently, has simply taken too many resources and been far too expensive to be a reasonable option for most. 

But technologies change. The prevalence of smaller but high-performance models (i.e., LLaMA, Orca and Mistral) together with recent revolutionary approaches to fine-tuning (i.e., Parameter Efficient Fine Tuning, or PEFT, and the LoRa, QLoRa and DoRa sisters) means that we can effectively and efficiently produce highly customized models trained on our data. These are likely to be the techniques that truly make synthetic data shine — for the near future at least.

While there is no free lunch, and the dangers of bias, boredom and circular thinking are very real — the opportunities of synthetic data make it highly compelling. And when leveraged correctly, it can create efficiencies and exponential possibilities. 



Dig deeper: How to make sure your data is AI-ready

Email:


Opinions expressed in this article are those of the guest author and not necessarily MarTech. Staff authors are listed here.


About the author

Chris Robson
Contributor
In his role as Human8's Senior Director, Data Science, Chris is charged with driving the growth of Gongos’ analytics and data science capability by expanding the methods, tools, and techniques that we bring to our clients. Chris is an acknowledged expert in research methodology and data science and is a well-known figure in the Insights Industry. He strongly believes in the importance of solid methodology combined with a laser focus on the business problem.

Prior to joining Human8, he was co-founder and Principal at Deckchair Data, an Insights and Analytics consultancy. Prior to founding Deckchair, he was Chief Innovation Officer and Head of Research Science for ORC International. Before that, he was Co-Founder of Parametric Marketing, a boutique analytics and methodology consultancy.

He has held various senior technical and marketing positions at small and huge companies, ranging from being VP Engineering at an analytics start-up to managing a global team of over a hundred software developers at Hewlett-Packard.

A Mathematician by training, Chris confesses to being a total geek and is never happier than when he is elbows-deep in data.

Fuel for your marketing strategy.