Anti-hallucination through tool grounding

A small experiment in forcing every claim from an LLM agent through a deterministic database query.

When I built the conversational agent for Ubunifu Madness, my NCAA prediction platform, I had a specific problem: the model knew enough about basketball to fabricate plausible-sounding facts. It would confidently report scores that never happened. It would invent player statistics. None of this was malicious. It was the model doing what models do, which is generate plausible continuations of text.

The solution I landed on was structural rather than prompted.

The structural pattern

The agent had access to seven custom tools, each backed by a deterministic database query:

lookup_team: a team's Elo, record, conference, seed, Four Factors, and advanced stats
get_matchup_prediction: the ensemble's win probability for a matchup, with a confidence flag and a plain-language explanation
get_conference_info: conference strength metrics and the teams in a conference
get_top_teams: the top teams by Elo, optionally filtered to one conference
get_todays_scores: live scores and results for a given date
get_upset_candidates: games where the lower-seeded team has a real chance of winning
build_bracket: all sixty-three tournament games under a chalk, balanced, or chaos strategy

The system prompt's first rule was blunt: only state facts that come directly from your tool results. If a tool did not return a specific piece of data, do not guess or infer it.

That's the grounding rule. Phrased that way it sounds like a content policy. In practice it shapes the agent's behavior more than any other instruction.

What it changed

The agent became more cautious, more honest, and considerably less impressive in the surface-level demo. It also became actually useful, because the things it said were true.

A user could ask "who would win between Houston and Duke?" and the agent would call get_matchup_prediction, then state the model's probability along with the factors behind it, since both come back from that one call. Ask it for the day's upset picks and it calls get_upset_candidates; ask how strong a conference is and it calls get_conference_info. Every fact was grounded in a tool result.

When a user asked something the tools couldn't answer ("what's Coach Calipari's career record against Kentucky?"), the agent would say so, instead of confabulating a number.

Where this approach breaks down

The grounding pattern works well for narrow domains where most useful queries can be reduced to a small set of database operations. Basketball stats fit this perfectly. Conversations about strategy, narrative, or context ("tell me about the rivalry between these two programs") fall outside the tool boundary, and the agent has to either decline or rely on its own knowledge with an explicit caveat.

I haven't fully solved that boundary. For now, the agent is honest about it: anything that isn't backed by a tool gets a hedge. It makes the agent less smooth, but it makes it trustworthy.

What the 2026 tournament showed

The 2026 NCAA tournament was the agent's first live test. The goal, for this season and for the long arc of the project, is the same: gather the performances, look at where the agent and model did well or didn't, and feed that back into next year's features and tools.

	Correct	Accuracy
Men (agent)	46/63	73.0%
Men (model)	43/63	68.3%
Women (agent)	49/63	77.8%
Women (model)	48/63	76.2%

Headline accuracy clusters around 70 to 78 percent, a little below the model's roughly 80 percent on its 2023-to-2026 holdout. That is about what you'd expect from single-elimination games stacked with close matchups. The point isn't the headline number. The point is the postmortem: which seeds did the model overrate? Which features under-weighted late-season form? Where did the grounding rule force the agent to hedge in ways that turned out to be correct? Those are the questions that drive what gets re-engineered between this March and the next.

A word on why the agent and the model disagree. The model produces one input: a win probability for a matchup. The agent surfaces that probability faithfully, because the grounding rule forbids it from inventing or distorting the number. But the agent's final pick isn't a passthrough of the model. It reasons over the model's probability alongside the other grounded tool results, recent form, rankings, the strength-of-schedule context, and lands its own call. That extra reasoning is why the agent's record came out a touch ahead of the raw model's here, three games on the men's side and one on the women's. At sixty-three games each, a gap that small is well inside the noise; I read it as the agent doing no worse while weighing more of the evidence, not as proof it forecasts better.

So the grounding rule didn't make the agent a better forecaster. It made it honest: every claim tied to a tool, no invented statistics. On accuracy it came out even with the raw model, inside the margin you'd expect at this sample size. What grounding bought was trust, not points.

I'll write more about this soon: specifically about how the seven tools were scoped, where the grounding rule needs nuance for queries that synthesize across multiple tools, and how I'd extend the pattern for domains where the answer depends on context the database doesn't capture.