Building a procurement AI agent under governance constraints

A case study from four months building a production AI system inside a Microsoft-native enterprise.

I spent the summer of 2025 building a production AI agent for the procurement department at SSA Marine. Sixteen weeks, one architectural pivot in the third week, and at the end of it: a coordinator agent that answers procurement questions for stakeholders across the company, and when it can't, drafts an internal email to the procurement team on the user's behalf and holds it until the user has explicitly approved what's about to be sent.

This case study is the long version of that story. Not the version polished for a résumé bullet, but the version where I describe the decisions I made, the ones I wish I'd made differently, and the architectural patterns I'd carry into any production AI system I built today.

I'm writing it for engineers who are about to build something similar and want a concrete account from someone who recently did.

What the procurement team actually needed

Before any of the technical decisions, the work had a shape. SSA Marine's procurement department was the chokepoint for a familiar pattern of inefficiency. People across the company kept asking the same questions ("what's the threshold for vendor approval on a purchase over $50K?", "do we need a competitive bid for this category?", "how do I expense a software subscription?"), and the answers lived in long policy documents scattered across SharePoint. Most of those questions could be answered from the docs if you knew where to look. The procurement team got stuck answering them anyway, because most stakeholders didn't.

The other half of the workflow was the escalation path: what was supposed to happen when the answer wasn't in the docs. The existing process: stakeholders sent a message to the procurement inbox or pinged someone in Teams, those messages piled up, and questions sat unanswered for days because nobody had a fast, structured way to surface what was being asked, by whom, and with what context. The friction wasn't malicious. It was just unmediated.

The shape of the project, as I understood it walking in: build a system that can answer the policy questions when the answers exist, and escalate cleanly to the procurement team when they don't, drafting the escalation email on the stakeholder's behalf, with the question and the context the agent had already gathered, and routing it through human approval before anything actually went to procurement. Both halves had to live inside the company's existing infrastructure (Microsoft, SharePoint, Teams, Entra ID for identity).

The non-negotiable constraint, which I underestimated: emails could not be dispatched autonomously. Every escalation email had to be reviewed and explicitly approved by the user (the stakeholder asking the question) before it landed in a procurement inbox. This wasn't a nice-to-have, and it wasn't something a thoughtful prompt could solve. It was a governance line the agent could never cross.

Why the build started in Microsoft Copilot Studio

The case for Copilot Studio was real, and most engineers in this position would have made the same call.

The platform offered native Teams integration, which mattered because the company lived in Teams. It had Entra ID SSO essentially for free. It indexed SharePoint documents in minutes. The RAG functionality worked beautifully out of the box: answers were grounded, retrieval was reasonable, and the system felt usable from the second week.

I did not start in Copilot Studio because I was naive about its limitations. I started there because, given a four-month timeline, the platform offered the fastest path to a working RAG-based system that integrated with the company's actual environment. Building all of that custom (the SSO integration, the Teams interface, the document indexing) would have eaten the bulk of the internship. Copilot Studio gave me those for free, a sound engineering trade given the constraints.

By the end of week two, the policy Q&A flow was working. I had stakeholders interacting with a real system. The trajectory looked good.

The wall

Then I tried to build the escalation flow.

The requirement, restated: when the RAG layer couldn't ground an answer, the agent had to draft an escalation email to the procurement team on the stakeholder's behalf, carrying forward the original question and whatever context the agent had already gathered. The stakeholder could then approve the draft (which dispatched it to procurement), edit it (which required the agent to incorporate the changes and re-present), or revise it via further conversation (which required the agent to maintain the context of the original draft alongside the user's feedback).

The conversational state required to support this is not exotic. Any web app with a form, a preview step, and a "go back and edit" button handles it routinely. But Copilot Studio's approval flows were forward-only, built around "approve, reject, escalate" rather than "edit, regenerate, edit again." There was no first-class loop primitive for the iterative draft-and-revise pattern, and the platform's branching logic wasn't designed to express the kind of multi-turn back-and-forth where the same conceptual artifact (a draft email) gets refined across several conversational exchanges.

I spent two days trying to make it work. The closest I got involved encoding draft state in the conversation history as serialized text and re-parsing it on every turn. This worked in the way a dam built out of sandbags works: it would hold up under demo conditions and fail silently the first time something unexpected happened in production. I knew that, the moment I wrote it, and the question I had to answer was whether I was willing to ship it.

I wasn't.

The decision

The decision, framed as a choice: I could descope the email approval workflow to something simpler (a one-shot draft with no revision capability) and ship the original architecture on schedule. Or I could abandon Copilot Studio and rebuild the system from scratch in a custom architecture, accepting that I'd lose the two-plus weeks I'd already invested and the free integrations the platform provided.

I chose the rebuild.

A few things factored into that decision. The HITL workflow wasn't an extra feature; it was the load-bearing requirement that made the system trustworthy enough to deploy. Descoping it would have meant shipping an agent that could draft an escalation email but couldn't really collaborate on it, which, from a stakeholder's perspective, would have been only marginally better than writing the email from scratch.

I also believed I could rebuild. Not blindly. I'd worked enough with FastAPI, Python, and LangGraph in personal and academic contexts to have a sense of the lift, and I'd designed enough enterprise architecture at LTIMindtree to know what the integration patterns looked like. The rebuild was hard, but it was hard in a way I could see from the start.

The thing I'd say to anyone facing a similar decision: be honest about what you're losing in the rebuild, and don't pretend the original choice was wrong. I wasn't wrong to start in Copilot Studio. The information I had at the time made it the right call. What I learned during the prototype was that the platform couldn't support the project's structurally hardest requirement, and that was new information that justified a new decision.

The architecture I built instead

I rebuilt the system as a coordinator agentic architecture on Azure, with LangGraph as the orchestration layer. The shape:

A top-level coordinator agent inspected each incoming question and routed it to one of two specialized sub-agents based on intent and context. A RAG sub-agent handled the common case (policy questions whose answers existed somewhere in SharePoint) with retrieval against an Azure AI Search index of procurement documents. When the RAG sub-agent couldn't ground an answer (low retrieval confidence, irrelevant chunks, or the user explicitly asking for human help), the coordinator handed the conversation off to an email sub-agent, which drafted an escalation email to the procurement team and managed the multi-turn approval state I couldn't express in Copilot Studio. The coordinator stayed responsible for the conversation as a whole, preserving context across the handoff so the email sub-agent didn't have to re-elicit the original question.

The RAG pipeline used semantic ranking, hybrid search (combining vector similarity with keyword matching), and a chunking strategy that respected document hierarchy: splitting on section boundaries rather than fixed token windows, preserving tables as atomic units, and including parent-section context in chunk metadata. None of these choices were exotic. All of them mattered for retrieval quality on documents whose structure carried meaning.

The backend was FastAPI. The frontend was a React/TypeScript Teams tab application, the part of the rebuild where I lost the most relative to Copilot Studio, since I now had to host and maintain the Teams app myself instead of getting it as a platform feature.

The most important architectural choice in the system was in email dispatch: emails were not sent by the agent. They were sent by Azure Logic Apps, which the agent could trigger via tool call only after the user had explicitly approved a finalized draft. The Logic App was the dispatcher. The agent could request dispatch but could not perform it. This distinction matters more than it sounds, and I'll come back to it.

The infrastructure layer connected Azure App Service for the FastAPI backend, Azure API Management for the public surface, Entra ID for authentication, and an admin dashboard surfacing health checks, metrics, and LangGraph node traces. I wrote the documentation as I went (an indexing guide, a prompt library, architecture diagrams) partly because I knew I was going to hand the system off in three months, and partly because writing the documentation forced me to clarify my own thinking about decisions I might otherwise have made on intuition.

The rebuild was where the summer actually went: something like ten weeks coding the coordinator and its sub-agents in LangGraph and delivering the whole thing as a Microsoft Teams app, so the procurement team could use it where they already worked. It shipped at the end of the internship, with about a week of buffer for stakeholder testing.

The reliability patterns worth carrying forward

Three patterns from this system would carry into any production AI agent.

The first is architectural HITL enforcement. The agent could not send email. Not because the prompt told it not to, but because the agent did not have access to an email-sending tool. The only tool available was a "request dispatch" tool that called Azure Logic Apps with a finalized draft and the user's explicit approval flag. If a clever prompt injection somehow convinced the model to "send" an email, the architecture would still not let it. There was no surface where the LLM could circumvent the human approval step, because the LLM was never the one performing dispatch.

This pattern generalizes. Any time an AI system has access to a high-stakes action (sending email, executing a transaction, updating a record), the question to ask is: is the constraint enforced by the prompt, or by the architecture? If the answer is "the prompt," the constraint is theatrical. The model will not always do what the prompt says, and you cannot rely on prompt-based constraints for governance-critical behavior. Architecture is the only line that holds.

The second pattern is Pydantic schemas as tool-call gates. Every time the agent attempted to invoke a downstream tool (request a Logic App dispatch, query Azure AI Search with a constructed filter, call any internal service), the parameters were validated against a Pydantic schema before the call was made. If the LLM produced parameters that didn't conform to the schema (wrong types, missing fields, malformed values), the call was rejected before it left the agent.

The validation went beyond type checking. I wrote domain-specific validators on top of the Pydantic types: recipient addresses had to match an approved internal directory, certain procurement-specific fields had to satisfy business-logic constraints, draft email subjects had to match length and content rules. The principle: every boundary where the LLM produces structured output is a potential failure point. The validation layer at that boundary is what turns a probabilistic system into a system you can deploy in front of real users.

The third pattern is redacted audit logging at every state transition. Every action the agent took (every retrieval, every tool call, every approval decision) was logged with full context, with sensitive fields (personal data, internal financial figures) redacted at the logging layer. This served both governance (procurement leadership wanted full traceability of agent decisions) and debugging (when something went wrong, we could reconstruct exactly what the agent had done). The pattern: treat logging as a first-class architectural concern, not a thing you bolt on after the fact.

The thread tying all three patterns together: the difficult part of production AI is not the model. It is the discipline of treating every boundary between the model and the rest of the system as a place to enforce correctness explicitly. Capability is plentiful. Discipline is scarce.

What I'd do differently

Several gaps are worth naming plainly.

I didn't build a systematic evaluation framework. I tested RAG quality manually, running curated question-answer pairs through the system and iterating on chunking, retrieval, and prompts based on the results. This worked well enough to ship, but it didn't give me automated metrics, and it didn't give stakeholders the kind of quantitative confidence that an LLM-as-judge eval pipeline would have. If I built this system again, I would set up evaluation from day one: a curated eval set, automated retrieval metrics (precision, recall, MRR), and answer-quality scoring. I would treat the eval framework as part of the deliverable, not as a follow-up.

I hardcoded the prompts. They lived in source code. Changing them required a code deployment, which meant the iteration loop on prompts was slower than it should have been. The right pattern is a versioned prompt configuration layer where prompts can be edited, deployed, and rolled back independently of code, and where new versions can be A/B tested against old ones. I knew this at the time and didn't build it because I was trading off scope; in retrospect, the scope I should have cut elsewhere.

I didn't build streaming responses for the chat interface. Users waited for full responses, which felt slower than they needed to. This is a small but real polish gap. Server-sent events for token streaming is a well-understood pattern and it would have made the system feel more responsive.

I didn't get to do real load testing. The user base was small enough during the internship that we could observe performance qualitatively, but I never had the chance to see how the system behaved under sustained concurrent load. If I were doing this again, I'd want a load test in the staging environment before declaring the system production-ready.

What this taught me

A few things I think about now that I wouldn't have articulated before this project.

The first is about prototyping order. I built the RAG functionality first because it was the visible feature. The HITL workflow was structurally harder, but I treated it as a follow-on. By the time I discovered the platform couldn't support it, I had already built the wrong foundation. The lesson, generalized: prototype the structurally hardest workflow first, not the easiest. The easy parts will work in any architecture you choose. The hard parts will fail in some architectures and not others, and you need to know which one you're in before you commit.

The second is about the relationship between AI capability and AI reliability. The capability layer of these systems is extraordinary and improves quickly. The reliability layer (the validation, the architectural constraints, the governance, the audit trails) is what determines whether you can actually deploy the capability in front of real users. Most of the engineering effort in a serious production AI project goes into the second layer, not the first. The model is the easy part now. The system around the model is the work.

The third is about how to make architectural decisions under uncertainty. I picked Copilot Studio with incomplete information. I changed my mind in week three when I had better information. I did not get the original decision wrong; I just got new evidence and updated. Treating that as a failure of judgment would have been wrong. Treating it as good engineering (which is what it was) is part of how I want to approach every architectural choice I make from now on.

I left SSA Marine in September 2025 with a system that worked, documentation that explained how to maintain it, and a much clearer sense of what production AI engineering actually requires. The handoff went to the team that owned procurement systems internally; I don't know its current operational status today.

What I do know is that the patterns I built into it (architectural enforcement, validation gates, redacted logging, the refusal to let prompt-based constraints carry governance weight) are the patterns I'd build into the next system, and the one after that. That's the part of the work that travels.