Simulating lousy conversations: Q&A with Silvio Savarese, Chief Scientist & Head of AI Research at Salesforce

One of the big use cases for LLMs and AI agents is customer service. Many of those interactions happen by phone, which means your customer service bots need to understand voice interactions. If you’ve ever answered a phone on behalf of an organization, you know that those voice interactions are messy—hostile, interrupted, full of background noise, and unpredictable. But Salesforce is working on simulating that messiness so their voice agents respond better to real-life phone calls. We reached out to Silvio Savarese, Chief Scientist & Head of AI Research at Salesforce, to ask how they are creating eVerse, a simulation tool that battle-tests AI agents without angering real customers.

—-----------

Q: Some may consider AI voice agents and simulated voice training environments as an over-engineered solution to a problem handled reasonably well by phone button menus. Why go through the trouble of giving this use case to AI?

Silvio Savarese: Phone menus work fine for simple, scripted interactions—"press 1 for balance"—but they collapse when customers face complex, multi-step problems that don't fit the script. And from a user experience perspective, it’s not ideal.

What agents enable us to do is capture the nuance in human language. And that nuance isn’t just in the nature of the request. I might have a hard time articulating my issue. Or I might need some clarifying questions.

The reality is that there are a lot of edge cases, where the press of a button is simply not enough. And this is also why many people pick up the phone in the first place.

This is why simulation environments like eVerse are so important. We are able to create synthetic representations of many different edge cases so that we can give the best experience for the customer.

And then of course, if a human is still needed, the conversation can be seamlessly transferred, while retaining all of the context collected up to that point.

Q: How did you determine the aspects of real conversations that would be present in simulations? How do you simulate those aspects without the agent being able to anticipate the nature of the problem? Put differently, why simulate a windy conversation instead of just engineering wind-mitigation into the agent?

SS: That’s a great question, and it goes to the heart of our simulation environment framework. Synthetic data generation can be extremely varied, and create scenarios that we wouldn’t even think about, and at scale.

And because the synthetic data is generated separately from the training data of the agent, we ensure that the agent doesn’t “anticipate” the nature of the problem.

So while a windy day may be something that our agent should be able to handle, what we are really training the agents for is handling unpredictable scenarios.

These scenarios may include noise, like wind. But it could be a different type of noise, or language-related, or something different altogether.

Different businesses can also have more specific challenges they deal with. Think about the type of challenges you may have when ordering in a drive-through, or if you’re in a busy airport trying to change a flight.

What synthetic data generation allows you to do is take a small amount of sample data and extrapolate it into many different permutations.

As we go through the eVerse simulation loop and handle all possible corner cases, these simulation environments will no longer be useful as they will have served their purpose.

Q: What’s the difference between the simulation training data and the agent training data? How do you ensure that the agent data isn’t contaminated?

SS: Let me clarify the distinction. LLMs are pre-trained on vast amounts of general data, but most of that isn't relevant to the specific enterprise scenarios we're testing. Agents are sophisticated frameworks built around those LLMs—and with our latest Agentforce capabilities, we can now dial agents to be more deterministic or more creative depending on the use case.

The simulation data is fundamentally different from LLM pre-training data. We take small amounts of real enterprise data and use it to generate realistic synthetic scenarios that would be nearly impossible for LLMs to have encountered during pre-training. This ensures the agent isn't just memorizing—it's learning to generalize.

Q: How do you identify and fix gaps in the simulation’s abilities?

SS: Identifying and fixing gaps is really the heart of eVerse. After we simulate large volumes of synthetic scenarios, we measure agent responses through what we call benchmarking. Some methods are quantifiable—did the agent take the correct action or not? Others are qualitative—was the simulated customer satisfied with the outcome? We also use human annotators to validate agent responses for certain critical scenarios.

The feedback we collect from assessing how well agents handle these simulated scenarios is what drives continuous improvement in agent performance. And critically, edge cases aren't finite—they evolve as customer behavior, regulations, and business rules change. Just like flight simulators remain essential even for experienced pilots, eVerse becomes more valuable as agents scale, providing a safe environment to test changes in cases where the cost of production failure is too high.

Q: People are not always the kindest or most considerate to customer service agents. How do you include that in the simulation while preventing the AI agent from cursing back at the simulation?

SS: Since this is a simulation, there are no real humans that may have their feelings hurt by a rogue agent.

If an agent indeed has a response that is inappropriate in certain situations, this is exactly the environment where we want to discover this behavior and correct it.

We also want to make sure that agents can handle tough situations in the best way possible. Building empathy and helping to diffuse the conversation is an important part of customer service.

Overall, situations where agents may “curse back” can be detected either by having humans in the loop to assess if the agent is not responding appropriately or by having judge agents/models trained on sentiment detection that can automatically detect if an agent is not responding appropriately.

Q: On the flip side, you mention “Move 37” from the Go match between Go master Lee Seidol and AlphaGo in your blog post on synthetic data. This move surprised and baffled the Go experts, but was very effective. How do you ensure that the simulation remains rooted in real-world human interactions and doesn’t throw out baffling moves that make sense based on training data?

SS: “Move 37” is a fascinating example of simulation environments. Go is the world’s most ancient board game that is still being played in its original form, and it has a stupendous amount of possible moves.

And in thousands of years of human play, no one has ever made that infamous “Move 37.” And indeed, that move baffled Lee Sedol and people watching the game.

But something interesting happened after that match. Rather than trying to mitigate AlphaGo from creating baffling moves, Go players today leverage AI to learn and improve their own game.

Many players now see AI not as a competitor to be defeated, but as a tool to help them become better players.

I think this is exactly the potential of AI in business scenarios as well. It is a tool that can help improve salespeople, service people, and many other functions in an organization.

Also it’s important to establish guardrails that enforce that agents stay within proper trusted boundaries and don’t come up with some off-chart behavior; this can be enforced by either, again, using judge agents/models or by using determinism as we are currently doing in the new release of Agentforce.

Q: You’re partnering with UCSF Health to test this in a medical/billing environment. How are you simulating the complexity of the terminology and how is that trial faring?

SS: The healthcare space is extremely important, and a huge opportunity for AI to help alleviate some of the pressure that physicians and other workers within that ecosystem face.

With UCSF Health, we are starting with billing use cases first as it is a big pain point for patients. There are so many different systems that need to be accessed in order to provide an answer, and many times, the knowledge of how to get that information is trapped in people’s heads. These are the subject matter experts.

Our pilot with UCSF Health is showing really promising results. By creating a Learning Engine with eVerse, AI agents bring in humans to intervene when they don’t know the answer. So instead of hallucinating and making up an incorrect answer, a human is able to step in and “teach” the AI the correct way to handle a certain situation.

Industry data suggests that 60-70% of inbound calls to healthcare contact centers are routine inquiries, which can be fully automated and handled by AI. For the remaining 30-40% of more complex cases, eVerse plays a key role—continuously improving its performance through human-in-the-loop feedback and gradually expanding coverage.

The results we’re seeing show that we are able to move the needle from the 60-70% range to 84-88% coverage. What this means is that the new skills that human experts are teaching AI agents can be generalized and retained by the Learning Engine to improve coverage, and by doing so, relieving the pressure from the humans to focus on the most complex tasks.

Add to the discussion