AI/NLP Chatbot Platform on Azure OpenAI

Context & Problem

I was brought in to design and deliver an AI-powered support system for a client whose customer support team was handling around 1,400 tier-1 queries per day — password resets, account status checks, product eligibility questions, and documentation lookups. These queries were low-complexity but consumed a significant proportion of the team's capacity. First response times averaged just over two minutes during business hours; outside office hours, customers waited until the next morning.

The business had three existing knowledge sources that contained the answers to most of these queries:

A SharePoint documentation site (~4,200 pages of product guides and FAQs)
A Salesforce CRM with each customer's account history, open tickets, and product subscriptions
A legacy knowledge base built on classic ASP, maintained by the product team, containing internal operational procedures

The ask was a conversational interface that could answer from all three sources without a human in the loop for routine queries.

Constraints

EU data residency was non-negotiable. Customer data and conversation logs could not leave Azure EU regions. This ruled out several third-party chatbot SaaS platforms immediately.
No dedicated ML team. The client had strong .NET engineers but no ML expertise. Any solution requiring model training or fine-tuning would have needed external resource the project budget didn't allow.
Integration with Salesforce was required for context. A generic Q&A bot that couldn't see "this customer has an outstanding billing dispute" would produce inappropriate responses. CRM data had to be part of every conversation.
Cost had to scale predictably. Per-message pricing from SaaS vendors was a commercial risk at 1,400 queries/day. The cost model needed to be predictable and tied to infrastructure rather than volume.

Options Considered

Third-party chatbot SaaS (Drift, Intercom AI, Zendesk AI): Fast time to market, but data residency controls were opaque, per-message pricing was expensive at scale, and none offered the deep Salesforce + custom knowledge base integration the client needed. Rejected.

Custom fine-tuned NLP model: Full control, no per-token cost. But fine-tuning requires labelled training data (we had none), ML expertise (the team had none), and ongoing model maintenance. An 18-month build estimate to get to production quality made this impractical. Rejected.

Azure OpenAI with RAG (Retrieval-Augmented Generation): GPT-4o stays in the EU data boundary (Azure Switzerland North and Sweden Central). No model training required. Retrieval over the existing knowledge sources handles freshness without retraining. Cost is infrastructure-based. The pattern was proven but required careful prompt engineering and retrieval design. Selected.

Architecture Decision

The core pattern: before calling GPT-4o, retrieve relevant context from the three knowledge sources and include it in the system prompt. GPT-4o synthesises an answer from the retrieved context and the customer's account data. If the model's confidence (proxied by a structured output field) falls below a threshold, the conversation is routed to a human agent.

The system had four distinct pipelines:

Ingestion pipeline — ran nightly, chunked and embedded SharePoint pages and knowledge base articles, stored vectors in Azure AI Search.
Context assembly — at query time, retrieved the top-K relevant chunks from AI Search (semantic + keyword hybrid search) and fetched the customer's relevant CRM data from Salesforce via REST API.
Inference — constructed a structured system prompt containing: customer tier, open tickets, retrieved KB chunks, and conversation history. Called GPT-4o with structured output to get both an answer and a confidence signal.
Routing — answers above the confidence threshold were returned to the Angular widget. Below threshold, a Service Bus message triggered a Salesforce ticket creation with the full conversation context pre-filled.

Implementation Highlights

Semantic Kernel as the orchestration layer. Rather than hand-rolling the prompt assembly and tool-call logic, we used Semantic Kernel's plugin model. Each knowledge source became a Kernel plugin with explicit input/output contracts. This made unit testing the retrieval logic straightforward and made it easy to add a new plugin (the legacy knowledge base) without touching the inference pipeline.

Chunking strategy mattered more than we expected. Our initial 2,000-token chunks with 10% overlap produced poor retrieval — long FAQ articles had multiple topics per chunk and the retrieval was noisy. Moving to a semantic chunking approach (splitting on heading boundaries in the SharePoint HTML) with smaller ~400-token chunks nearly doubled retrieval precision. This was the single most impactful technical decision after the initial architecture.

Streaming via SSE. Waiting 3–5 seconds for a complete GPT-4o response produced a poor UX. We added Server-Sent Events between the ASP.NET Core API and the Angular widget so tokens streamed as they were generated. Perceived response time dropped from ~4 seconds to a near-instant first-token.

Structured output for confidence scoring. We configured GPT-4o to return a JSON object with the answer text and a confidence field ("high" / "medium" / "low"). The model was instructed to return "low" when the retrieved context didn't contain a clear answer. This was imperfect but significantly better than routing based on keyword heuristics. Roughly 18% of queries were routed to human agents — below the 25% target.

Azure Service Bus for audit and analytics. Every conversation — regardless of outcome — was published as a message to a Service Bus topic. A separate consumer wrote to a Log Analytics workspace. This gave the client a full audit trail for compliance and the product team a data source for identifying gaps in the knowledge base.

Outcome

Three months after go-live:

40% of tier-1 queries resolved without human handoff.
Average first-response time outside business hours: from next-morning to under 10 seconds.
Human agent quality improved. When a query was escalated, the Salesforce ticket was pre-filled with the conversation history, the customer's account context, and the KB articles that had been retrieved. Agents reported significantly less time spent on context-gathering.
Zero data residency violations. All data stayed within Azure EU boundaries throughout, which satisfied the compliance team's requirements.

What I'd Do Differently

Build an evaluation framework before writing production code. For the first six weeks, we were shipping prompt changes and chunking updates essentially blind — checking quality by manually reviewing 50 conversations after each change. A proper evaluation dataset of 200–300 labelled query/response pairs with automated scoring would have made this a data-driven process instead of a manual one. This is now the first thing I build on any AI project.

Involve the support team earlier in the prompt design. We designed the initial system prompt with the product team. The support team — who knew which query patterns were most common and most mishandled — weren't consulted until week three. Their input in the first week would have saved two full prompt redesign cycles.

Plan for knowledge base drift. The SharePoint site had 4,200 pages when we ingested it. By month three it had grown to 4,800, and some of the older pages were outdated but still being retrieved. We hadn't built a staleness signal into the ingestion pipeline. Adding a last_modified weight to the retrieval ranking and flagging pages not updated in 18 months became a significant post-launch backlog item.