Research

Lattice: Building Self-Correcting Guardrails for Conversational Agents

Feb 20, 2026·7 min read

Enterprise guardrails are often shipped as static defenses: a handful of regexes, keyword blocks, or LLM judges tied to known failure modes. They perform well early on but fall behind as attack vectors evolve. Users find novel ways to phrase the same requests, stretch attacks across turns, or exploit integrations the system wasn't designed to defend. The misses aren't dramatic, but they carry real costs: an unauthorized refund processed through a tool call, a policy document leaked turn by turn, or a competitor comparison that legal would never approve. The challenge is keeping protections accurate and performant as usage patterns evolve.

In Building Brand-Optimized Contact Center Agents, we explored how normal-seeming customer support conversations can lead to high-risk behavior and outlined first-line mitigations enterprises can use to defend against it, including input filtering and tool restrictions. This post introduces Lattice, the system we built to learn and maintain guardrails automatically. To see where Lattice fits, we first examine how enterprise guardrails make decisions in production.

How do enterprise guardrails make decisions in production?

Guardrails generally operate alongside the main model, each targeting a specific risk. Some use LLMs to make subjective assessments, like whether a query could lead to sensitive competitor comparisons. Others are deterministic, enforcing hard constraints through pattern matching, allowlists and denylists, schema validation, and tool-call restrictions. Together, they determine whether a response is delivered or blocked.

It’s tempting to add every guardrail you can think of. But guardrails aren’t free. Each one increases costs and risks blocking “safe” queries. Conversely, removing too many in pursuit of cost savings risks collapsing coverage. The real problem isn’t “add guardrails vs. don’t”—it’s finding the smallest set that materially reduces risk and keeping it current as usage evolves. That tension motivated Lattice: a shift toward guardrails as a living system, continuously tested and updated from production data, rather than locked into design-time assumptions.

Lattice Architecture

Lattice works in two phases: self-construction and self-improvement. During construction, it uses labeled conversations to identify the smallest set of guardrails needed to protect against known failure modes. Once deployed, Lattice supports self-improvement. It watches for suspicious patterns in real usage, stress-tests them through adversarial scenarios, and determines whether they represent genuine threats. As new failure modes emerge, the guardrail set evolves: refined where it’s too blunt, expanded where coverage is missing, and pared back when protections are no longer needed.

Self-Construction: Learning guardrails from real conversations

Guardrail creation is fundamentally an optimization problem: maximize detection of harmful behavior while minimizing false positives. Our approach starts with labeled conversations, each tagged as either safe or guardrail-triggering. Using these, Lattice generates realistic variations of each conversation, tweaking the wording, changing customer personas, and letting risky intent surface gradually over multiple turns.

It then proposes a set of guardrails and immediately tests them, surfacing where they miss real risks or fire incorrectly. Over successive iterations, Lattice tracks two types of errors: harmful behavior that slips through and valid queries that get blocked. These errors drive targeted updates. When Lattice encounters a new attack pattern, it introduces a new guardrail. When several guardrails overlap, it consolidates them into a broader one. When guardrails fire too often, their scope is tightened or they’re removed.

The system keeps what works and discards what doesn’t, autonomously retaining only changes that measurably improve the F1 score. The result is a set of battle-tested guardrails that deliver strong coverage without redundancy. Using just 100 conversations, Lattice achieves an F1 score of 91%, outperforming NeMo (87%) and LlamaGuard (66%).

Self-Improvement: Learning from failures in the wild

After construction, design-time assumptions inevitably start to break down. To keep up, Lattice adds an improvement loop that executes offline on unlabeled production data. It starts with a coarse, over-cautious “general safety” guardrail. This guardrail acts as a noisy first pass, using an LLM-based judge to flag conversations that might be risky. When the general guardrail fires but none of the specific guardrails respond, the system treats it as a signal that coverage may be missing.

From there, Lattice stress-tests the flagged interaction. It extracts the inferred attack goal and runs a multi-turn adversarial beam search, expanding a conversation tree (width k, depth d) with varied personas, phrasings, and slow-roll intent. Each leaf is then labeled as a successful bypass, a correctly blocked attack, or a false alarm.

Those labeled leaves provide the signal Lattice needs to improve. Offline, it updates the guardrail set by broadening policies that should have fired but didn’t, tightening rules that over-triggered, adding new guardrails for genuinely novel patterns, and consolidating redundant ones. Updates are only deployed if they improve performance; any change that lowers the F1 score is automatically rolled back.

In production, we lack labels for real user conversations, making it difficult to measure end-to-end performance directly. Instead, we evaluate improvements on a separate holdout set of labeled conversations. On this offline benchmark, the updated system delivered a 7% improvement over the original.

Safety is a system

AI is already taking actions and speaking on behalf of organizations. But many teams still treat safety like something you configure once and move on from. That mindset breaks down as soon as systems move into production. Lattice is part of Distyl’s effort to treat safety as a living system, something you build, test, monitor, and improve over time, just like any other critical infrastructure. Today’s AI systems are powerful and confident, but not infallible. The real question isn’t whether they’ll make mistakes. It’s whether you’ll catch them quickly, learn from them, and adapt before the consequences compound. That kind of responsiveness doesn’t just happen. It requires continuous performance monitoring. We already track latency, uptime, and reliability by default. Guardrails should live in that same loop. Safety isn’t just a feature you ship. It’s part of how the system performs.

Read the full paper, “Lattice: Generative Guardrails for Conversational Agents” to learn more.

Lattice: Building Self-Correcting Guardrails for Conversational Agents

How do enterprise guardrails make decisions in production?

Lattice Architecture

Self-Construction: Learning guardrails from real conversations

Self-Improvement: Learning from failures in the wild

Safety is a system

Related articles

A Systems View of the Space

Distyl Takes #1 Spot on BIRD Benchmark (Leading Text-to-SQL Benchmark)

OpenAI & Distyl: BIRD Benchmark Leadership