When I built the conversational agent for Ubunifu Madness, my NCAA prediction platform, I had a specific problem: the model knew enough about basketball to fabricate plausible-sounding facts. It would confidently report scores that never happened. It would invent player statistics. None of this was malicious. It was the model doing what models do, which is generate plausible continuations of text.

The solution I landed on was structural rather than prompted.

The structural pattern

The agent had access to seven custom tools, each backed by a deterministic database query:

  • getTeam(name): looks up a single team
  • getRecentGames(team, limit): pulls recent results from the database
  • getRanking(team, system): returns ranking from one of several published systems
  • predictMatchup(home, away): runs the prediction model, returns a probability
  • getFeatureBreakdown(home, away): explains what the model weighted
  • getPlayerStats(player, team): pulls box-score-level data
  • compareTeams(a, b, metric): head-to-head comparison on a chosen metric

The system prompt was simple: any claim about teams, players, games, or predictions must be backed by a tool call. If you cannot make a tool call to verify a claim, you must say so explicitly.

That's the grounding rule. Phrased that way it sounds like a content policy. In practice it shapes the agent's behavior more than any other instruction.

What it changed

The agent became more cautious, more honest, and considerably less impressive in the surface-level demo. It also became actually useful, because the things it said were true.

A user could ask "how did Gonzaga do last week?" and the agent would call getRecentGames("Gonzaga", 7), receive structured data, and answer with the actual results. A user could ask "who would win between Houston and Duke?" and the agent would call predictMatchup and return the model's probability with the contributing factors from getFeatureBreakdown. Every fact was grounded.

When a user asked something the tools couldn't answer ("what's Coach Calipari's career record against Kentucky?"), the agent would say so, instead of confabulating a number. This is the part most users found surprising.

Where this approach breaks down

The grounding pattern works well for narrow domains where most useful queries can be reduced to a small set of database operations. Basketball stats fit this perfectly. Conversations about strategy, narrative, or context ("tell me about the rivalry between these two programs") fall outside the tool boundary, and the agent has to either decline or rely on its own knowledge with an explicit caveat.

I haven't fully solved that boundary. For now, the agent is honest about it: anything that isn't backed by a tool gets a hedge. It makes the agent less smooth, but it makes it trustworthy.

What the 2026 tournament showed

The 2026 NCAA tournament was the agent's first live test. The goal, for this season and for the long arc of the project, is the same: gather the performances, look at where the agent and model did well or didn't, and feed that back into next year's features and tools.

CorrectAccuracy
Men (agent)46/6373.0%
Men (model)43/6368.3%
Women (agent)49/6377.8%
Women (model)48/6376.2%

Headline accuracy clusters around 70-78%, which is in the neighborhood of where you'd expect a calibrated, regularization-friendly ensemble to land for a 63-game tournament with this many close matchups. The point isn't the headline number. The point is the postmortem: which seeds did the model overrate? Which features under-weighted late-season form? Where did the grounding rule force the agent to hedge in ways that turned out to be correct? Those are the questions that drive what gets re-engineered between this March and the next.

A word on why the agent and the model disagree. The model produces one input: a win probability for a matchup. The agent surfaces that probability faithfully, because the grounding rule forbids it from inventing or distorting the number. But the agent's final pick isn't a passthrough of the model. It reasons over the model's probability alongside the other grounded tool results, recent form, rankings, the strength-of-schedule context, and lands its own call. That extra reasoning is why the agent's record edges out the raw model's here: not because it embellishes, but because it weighs more of the evidence the tools put in front of it.

So the grounding rule didn't make the agent a better forecaster by itself. It made the agent honest, every claim tied to a tool, no invented statistics, and an honest agent that reasons over real numbers turns out to beat one that's free to make things up.

I'll write more about this soon: specifically about how the seven tools were scoped, where the grounding rule needs nuance for queries that synthesize across multiple tools, and how I'd extend the pattern for domains where the answer depends on context the database doesn't capture.