Skip to main content

Embedded AI Architecture

1. What this document is about

This document addresses the architecture and operation of AI capabilities embedded natively within multi-tenant SaaS platforms — not external AI consumption through third-party APIs or standalone AI services used by customers.

Embedded AI means the platform itself uses AI models to enhance, automate, or enable core product features. Examples: intelligent search within the platform, automated data classification, predictive analytics on tenant data, content generation for users, or AI-powered workflow automation.

Where this applies:

  • You're building AI features that operate on tenant data stored in your platform
  • AI capabilities are part of your product's value proposition
  • You own the model lifecycle, prompt orchestration, and output quality
  • You must enforce multi-tenant isolation at the AI layer
  • Compliance, auditability, and cost control are your responsibility

Where this does not apply:

  • Customers bring their own AI models or API keys
  • AI is a pass-through service with no platform-side data processing
  • Single-tenant or on-premise deployments where isolation is architectural, not logical

2. Why this matters in real systems

Embedded AI surfaces when product requirements demand capabilities that rule engine, queries, or workflows cannot provide: semantic understanding, genration, classification at scale, or adaptive behavior.

Typical pressures that force this architecture:

  • Scale pressure: A SaaS platform with 10,000 tenants needs to run AI operations across millions of records daily. Native approaches (one API call per record) create cost explosions and quota exhaustion. You need batching, caching, model selection by workload, and token budgeting per tenant.

  • Isolation pressure: Tenant A's prompts, context, and outputs must never leak into Tenant B's results. Shared model infrastructure introduces risk: prompt injection across tenants, data bleed in vector indexes, cross-tenant inference via side channels. Multi-tenancy at the AI layer is harder than at the database layer.

  • Governance pressure: Regulators ask: "Where did this AI-generated decision come from? What data was used? Who approved it?" You need audit trails for every inference, input provenance tracking, model version pinning, and explainability hooks.

  • Cost pressure: Azure OpenAI charges per token. A single tenant running unconstrained queries can bankrupt the month's AI budget. You need quota enforcement, cost attribution per tenant, fallback to cheaper models, and rate limiting that doesn't break UX.

  • Latency pressure: Users expect sub-second responses. AI calls add 200ms-2s per operation. You need prompt optimization, streaming responses, edge caching of embeddings, and hybrid architectures where AI is invoked only when necessary.

What breaks when this is ignored:

  • Cross-tenant data leakage: A tenant's PII appears in another tenant's AI-generated content because vector indexes weren't properly partitioned.
  • Runaway costs: A single tenant with a malicious or buggy integration burns $50k in tokens over a weekend.
  • Compliance failure: An auditor asks for the exact model version and input data used to generate a specific output 6 months ago — you have no audit trail.
  • Model drift undetected: Output quality degrades silently over weeks because no one monitors success rate, hallucination frequency, or latency percentiles.
  • Vendor lock-in: Your entire AI architecture assume Azure OpenAI's specific API contracts, making migration to another provider a 6-month rewrite.

Why simpler approaches stop working:

  • Stateless API calls: No session context, no cost tracking, no tenant isolation, no audit trail.
  • Shared prompts for all tenants: One tenant's injection attack poisons results for others; no per-tenant customization or compliance rules.
  • Synchronous, one-at-a-time inference: Latency compounds; batch-eligible workloads waste quota; no parallelization.
  • Manual model management: Updating model versions across hundreds of features becomes a deployment nightmare; rollback is impossible.

3. Core concept (mental model)

Think of embedded AI as a regulated pipeline with four critical boundaries:

[Tenant Request] 

[Isolation Layer] ← tenant_id, RBAC/ABAC, data residency check

[Orchestration Layer] ← prompt assembly, context injection, model routing

[Inference Layer] ← model invocation, quota enforcement, response streaming

[Observability Layer] ← logging, cost attribution, quality metrics

[Response] → tenant-scoped, auditable, rate-limited

Each boundary enforces a contract:

  1. Isolation Layer: Ensures tenant_id is immutable and propagated through every component. No request proceeds without proven tenant identity. This is where RBAC/ABAC rules determine if the user can invoke AI features and what data they can access.

  2. Orchestration Layer: Assembles prompts dynamically using tenant-specific templates, retrieves context from tenant-scoped data stored (RAG), applies compliance filters (PII masking), and selects the appropriate model based on workload type (cheap model for drafts, expensive model for final outputs).

  3. Inference Layer: Invokes the model with token budgets, timeout enforcement, and retry logic. Implements circuit breakers per tenant to prevent one tenant's failures from cascading. Handles streaming for UX, batching for background jobs.

  4. Observability Layer: Captures every inference: input tokens, output tokens, latency, model version, success/failure, cost. Exports to monitoring systems. Enables cost chargeback, quality dashboards, and incident forensics.

Mental model for multi-tenancy: Every AI operation is a scoped transaction. The tenant_id is the partition key for data, prompts, quotas, and audit logs. If you lose the tenant_id at any point, you have a security incident.

Mental model for cost control: Treat tokens like database queries — measure them, budget them, optimize them. Every prompt is a cost center. Every cache hit is a cost save.

Mental model for compliance: AI outputs are artifacts that must be retained, versioned, and traceable to inputs. If you can't reproduce an output from 6 months ago, you can't defend it in an audit.


4. How it works (step-by-step)

Step 1 — Request initiation with tenant context

A user action trigger an AI feature (e.g. "Generate summary of this document").

What happens:

  • API gateway receives request with authentication token
  • Middleware extracts tenant_id and user_id from token
  • RBAC/ABAC check: Does this user have permission to invoke AI features?
  • Data residency check: Is this tenant restrcted to specific Azure regions?

Why this exists: Without tenant context attached immediately, downstream services can't enforce isolation or quotas.

Invariant: tenant_id is immutable and propagated via correlation headers or context objects through every service call.


Step 2 — Quota and rate limit enforcement

Before processing begins, check tenant's AI quota.

What happens:

  • Query Redis cache: ai_quota:{tenant_id}:monthly_tokens_used
  • Compare against: ai_quota:{tenant_id}:monthly_tokens_limit
  • If exceeded, return 429 Too Many Requests with reset timestamp
  • Increment a provisional token estimate (actual deduction happens post-inference)

Why this exists: Prevents runaway costs. A tenant with unlimited access could consume $100k in tokens before anyone notices.

Assumption: Token estimation is pessimistic (uses max possible tokens for worst-case prompt). Actual usage is reconciled after inference.


Step 3 — Prompt orchestration and context retrieval

Assemble the prompt dynamically using tenant-specific data.

What happens:

  • Load prompt template from configuration: ai_prompts:{feature_name}:template
  • Retrieve tenant-specific context via RAG:
    • Query Azure AI Search with filter: tenant_id eq '{tenant_id}'
    • Retrieve top K embeddings from tenant-partitioned index
    • Fetch actual documents from Azure Blob Storage (tenant-scoped container)
  • Apply PII filters if LGPD/GDPR flags are set for tenant
  • Inject system message with compliance rules: "Never include PII in output"
  • Assemble final prompt: system message + RAG context + user input

Why this exists: Generic prompts produce generic results. Context from tenant's own datamakes AI outputs relevant and accurate.

Invariant: RAG queries must include tenant_id filter. No cross-tenant data access.


Step 4 — Model selection and routing

Choose the appropriate model based on workload characteristics.

What happens:

  • Check feature flag: ai_model_routing:{tenant_id}:{feature_name}
  • Route based on criteria:
    • Draft generation: gpt-4o-mini (cheap, fast)
    • Final production output: gpt-4o (expensive, high quality)
    • Batch processing: Asynchronous queue, use gpt-4o-mini with longer timeout
  • If tenant has custom model preference (enterprise tier), honor it

Why this exists: One model doesn't fit all workloads. Cost optimization requires routing low-stakes tasks to cheaper models.

Operational risk: Model deprecations. Azure OpenAI retires models with 6-month notice. You need a mapping layer that abstracts model names.


Step 5 — Inference with observability

Invoke the model and capture telemetry.

What happens:

  • Call Azure OpenAI API with:
    • model: selected model version
    • messages: assembled prompt
    • max_tokens: canculated from quota headroom
    • temperature, top_p: from feature configuration
    • user: set to tenant_id for Azure's abuse monitoring
  • Stream response chunks if real-time UX is needed
  • Capture OpenTelemetry span:
    • ai.model.name
    • ai.model.version
    • ai.input.tokens
    • ai.output.tokens
    • ai.response.latency_ms
    • ai.cost.usd
    • tenant_id, user_id, feature_name

Why this exists: Observability is not optional. Without telemetry, you can't debug failures attribute costs, or detect drift.

Invariant: Every inference generate a structured log entry with cost and performance metrics.


Step 6 — Response validaiton and post-processing

Ensure output meets safety and compliance requirements.

Wha happens:

  • Run output through content safety filter (Azure Content Safety API)
  • Check for PII leakage (regex patterns, NER models)
  • Validate output structure if schema is expected (JSON, XML)
  • If validation fails, log failure, redact output, return error

Why this exists: Models hallucinate, leak training data, or generate harmful content. Validation is defense-in-depth.

Trade-off: Adds 50-200ms latency. Acceptable for high-stakes features, skip for low-risk drafts.


Step 7 — Audit logging and cost reconciliation

Persist the inference event for compliance and cost attribution.

What happens:

  • Write audit record to SQL (tenant-partitioned table)
INSERT INTO ai_inference_log (
tenant_id, user_id, feature_name, model_version,
input_hash, output_hash, input_tokens, output_tokens,
cost_usd, latency_ms, timestamp, correlation_id
)
  • Update Redis quota counter: ai_quota:{tenant_id}:monthly_tokens_used += actual_tokens
  • Publish event to Azure Service Bus for analytics pipeline

Why this exists: Regulators require audit trails. Finance requires cost chargeback per tenant.

Retention policy: 2 years minumum for LGPD/GDPR compliance.


Step 8 — Return response to user

Stream or return the final output

What happens:

  • If streaming: Send Server-Sent Events (SSE) to frontend
  • If synchronous: Return JSON payload
  • Include metadata: model_version, cost_tokens, cache_hit (if applicable)

Why this exists: Transparency builds trust. Users understand when AI is being used and can report issues.


5. Minimal but realistic example

Scenario: A tenant wants to generate a summary of uploaded documents using RAG.

// Service class: AIDocumentSummarizer.cs
public class AIDocumentSummarizer
{
private readonly IAzureOpenAIClient _openAIClient;
private readonly IAzureAISearchClient _searchClient;
private readonly IQuotaService _quotaService;
private readonly IAuditLogger _auditLogger;
private readonly ILogger<AIDocumentSummarizer> _logger;

public async Task<SummaryResult> GenerateSummaryAsync(
string tenantId,
string userId,
string documentId,
CancellationToken cancellationToken)
{
// Step 1: Check quota
var quotaCheck = await _quotaService.CheckAndReserveAsync(
tenantId,
estimatedTokens: 2000,
cancellationToken);

if (!quotaCheck.IsAllowed)
{
throw new QuotaExceededException(quotaCheck.ResetDate);
}

// Step 2: Retrieve document context via RAG
var searchResults = await _searchClient.SearchAsync(
indexName: "documents",
searchText: documentId,
filter: $"tenant_id eq '{tenantId}'",
top: 3,
cancellationToken);

var contextChunks = searchResults.Results
.Select(r => r.Document.GetString("content"))
.ToList();

// Step 3: Assemble prompt
var systemMessage = "You are a document summarizer. Output concise summaries. Never include PII.";
var userMessage = $"Summarize the following document:\n\n{string.Join("\n\n", contextChunks)}";

var messages = new List<ChatMessage>
{
new ChatMessage(ChatRole.System, systemMessage),
new ChatMessage(ChatRole.User, userMessage)
};

// Step 4: Invoke model with observability
using var activity = Activity.Current?.Source.StartActivity("AI.Inference");
activity?.SetTag("tenant_id", tenantId);
activity?.SetTag("feature", "document_summary");

var startTime = DateTime.UtcNow;

var response = await _openAIClient.GetChatCompletionsAsync(
deploymentName: "gpt-4o-mini", // Cost-optimized model
new ChatCompletionsOptions
{
Messages = messages,
MaxTokens = 500,
Temperature = 0.3f,
User = tenantId // For Azure's abuse monitoring
},
cancellationToken);

var latency = DateTime.UtcNow - startTime;
var choice = response.Value.Choices[0];
var summary = choice.Message.Content;

// Step 5: Calculate actual cost
var inputTokens = response.Value.Usage.PromptTokens;
var outputTokens = response.Value.Usage.CompletionTokens;
var costUsd = CalculateCost(inputTokens, outputTokens, "gpt-4o-mini");

// Step 6: Reconcile quota
await _quotaService.ReconcileUsageAsync(
tenantId,
actualTokens: inputTokens + outputTokens,
reservationId: quotaCheck.ReservationId,
cancellationToken);

// Step 7: Audit log
await _auditLogger.LogInferenceAsync(new InferenceAudit
{
TenantId = tenantId,
UserId = userId,
Feature = "document_summary",
ModelVersion = "gpt-4o-mini-2024-07-18",
InputTokens = inputTokens,
OutputTokens = outputTokens,
CostUsd = costUsd,
LatencyMs = (int)latency.TotalMilliseconds,
CorrelationId = Activity.Current?.Id
});

// Step 8: Return result
return new SummaryResult
{
Summary = summary,
ModelVersion = "gpt-4o-mini-2024-07-18",
TokensUsed = inputTokens + outputTokens,
CostUsd = costUsd
};
}

private decimal CalculateCost(int inputTokens, int outputTokens, string model)
{
// gpt-4o-mini: $0.15/1M input, $0.60/1M output (as of Jan 2025)
return (inputTokens * 0.15m / 1_000_000) + (outputTokens * 0.60m / 1_000_000);
}
}

How this maps to the concept:

  • Isolation: tenant_id is used to filter search results and set User field
  • Quota enforcement: Pessimistic reservation before inference, reconciliation after
  • Orchestration: Prompt assembly with RAG context from tenant-scoped index
  • Observability: OpenTelemetry activity, structured audit log, cost calculation
  • Cost control: Uses cheaper model (gpt-4o-mini) for summarization workload

6. Design trade-offs

ApproachWhat you gainWhat you give upWhen to choose
Synchronous inferenceSimple request/response flow, immediate feedback, easier debuggingHigher latency, blocking threads, quota exhaustion on spikesLow-volume features, interactive UX where user waits
Asynchronous queue-basedBatching, cost optimization, backpressure handling, retry logicComplexity (queue management), delayed results, polling UXBatch operations, non-interactive workflows, cost-sensitive workloads
Shared model deploymentLower infrastructure cost, simpler operationsCross-tenant isolation risk, noisy neighbor problem, quota conflictsEarly-stage products, trusted tenants only
Per-tenant model deploymentPerfect isolation, custom model tunning per tenant10x infrastructure cost, operational overhead, slower rolloutsEnterprise tier, regulated insdustries, contractual isolation requirements
Prompt caching50-90% cost reduction on repeated prompts, faster responsesCache invalidation complexity, stale results risk, storage costHigh-repetition workloads (same docs, same queries), RAG with static corpora
Streaming responsesBetter UX (perceived speed), lower timeout riskComplexity in error handling, partial result management, harder to validateReal-time features, long-form generation, chatbots
Batch API (Azure OpenAI)50% cost reduction vs real-time API24-hour max latency, no streaming, requires job managementNightly data processing, report generation, non-urgent tasks
Vendor abstraction layerModel portability, multi-cloud strategy, easier migrationAbstraction tax (10-20% dev overhead), lowest-common-denominator APIRegulated industries avoiding lock-in, multi-region deployment
Direct Azure OpenAI SDKFull features access, optimal performance, simpler codeVendor lock-in, migration cost if switching providesStartups, single-cloud strategy, speed over portability

Implicit acceptance with embedded AI:

  • You own the output quality: Users blame you, not OpenAI, if the AI hallucinates or produces garbage.
  • You own the cost: A bug in prompt orchestration can burn $10k overnight.
  • You own compliance: If a model leaks PII, you face LGPD/GDPR fines, not the model vendor.
  • You own the latency: If Azure OpenAI has an outage, your plantform's AI features are down.

7. Common mistakes and misconceptions

No tenant_id propagation through AI pipelines

Why it happens:

  • Developers treat AI calls like stateless functions and forget to pass tenant context.

Problem:

  • Cross-tenant data leakage. A RAG query without tenant_id filter returns results from all tenants.

How to avoid:

  • Make tenant_id a required parameter in every AI-related service method. Use middleware to inject into correlation context. Add integration tests that verify isolation.

Synchronous AI calls in request/response cycles

Why it happens:

  • Easiest to implement. Just await the OpenAI call and return.

Problem:

  • 95th percentile latency balloons to 5+ seconds when the model is slow or quota-throttled. Users abandon the page.

How to avoid:

  • Use async/background jobs for non-interactive AI. For interactive features, implement optimistic UI updates and streaming.

No cost attribution per tenant

Why it happens:

  • "We'll worry about costs later."

Problem:

  • Cannot identify which tenants are driving costs. Cannot implement tiered pricing or quota enforcement. Finance cannot chargeback AI spend.

How to avoid:

  • Log every inference with tenant_id and cost_usd. Build dashboards from day 1. Implement quota enforcement before general availability.

Storing raw prompts and outputs without PII filtering

What it happens:

  • Audit logs need full input/output for debugging.

Problem:

  • Logs become a LGPD/GDPR liability. An attacker who gains access to logs can exfiltrate PII.

How to avoid:

  • Hash inputs an outputs in audit logs. tore only metadata (token counts, model version, timestamps). Retain raw data in a separate, encrypted, access-controlled store with short retention (30 days).

No circuit breaker per tenant

Why it happens:

  • Shared infrastructure, no isolation of failure domains.

Problem:

  • One tenant with a runaway loops (infinite retries) exhausts quota for all tenants. Cascade failures.

How to avoid:

  • Implement per-tenant circuit breakers. After N failures in M seconds, return 503 Service Unavailable for that tenant only. Other tenants unaffected.

Hardcoding model names in application code

Why it happens:

  • model: "gpt-4 is simple and works.

Problem:

  • When Azure deprecates gpt-4, you need to deploy code changes across all services. No A/B testing of models. No gradual rollout.

How to avoid:

  • Store model name in configuration (feature flag, database). Map logical model IDs (summarization_model) to physical models (gpt-4o-mini). Uodate mapping without code changes.

No validation of AI outputs

Why it happens:

  • "The model is smart, it won't produce bad outputs."

Problem:

  • Models hallucinate, leak training data, ignore instructions. One hallucinated SQL query in a generated report can cause a security incident.

How to avoid:

  • Always validate outputs. For structured data, parse and schema-validate. For text, run content safety checks. For code, use static analysis. Log validation failures.

Ignoring token economics in prompt design

Why it happens:

  • "More context is better."

Problem:

  • Embedding entire documents in prompts wastes 80% of tokens on irrelevant content. Costs explode.

How to avoid:

  • Use RAG to retrieve only relevant chunks. Limit context to top-K embeddings. Experiment with chunk sizes (256 tokens vs 512 tokens). Monitor cost-per-inference and optimize aggressively.

8. Operational and production considerations

What to monitor

Per-tenant quotas:

  • ai_tokens_used_today vs ai_tokens_limit_daily
  • Alert when tenant exceeds 80% of quota
  • Dashboard showing top 10 tenants by token usage

Model performance:

  • P50, P95, P99 latency per model and feature
  • Success rate (2xx vs 4xx vs 5xx responses from Azure OpenAI)
  • Token efficiency: output_tokens / input_tokens ratio

Cost metrics:

  • Total AI spend per day, per tenant, per feature
  • Cost-per-request trend over time
  • Quota burn rate (time until tenants hit limits)

Quality signals:

  • User feedback (thumbs up/down on AI outputs)
  • Retry rate (how often do users regenerate outputs?)
  • Validation failure rate (malformed outputs)

Infrastructure health:

  • Azure OpenAI quota utilization (you have regional quotas)
  • Circuit breaker trip rate per tenant
  • RAG index lag (how stale are embeddings?)

What degrades first

Under load:

  • Azure OpenAI quota exhaustion → 429 errors → circuit breakers trip → AI features unavailable
  • RAG search latency increases → total request latency spikes → timeouts

Under cost pressure

  • Tenants hit quota limits → cannot use AI features until next billing cycle
  • Shared quota depleted by one tenant → all tenants affected

Under model drift:

  • Output quality declines silently → users lose trust → support tickets increase
  • New model version has different token pricing → costs spike unexpectedly

What becomes expensive

Storage:

  • Audit logs with full prompts/outputs grow to terabytes
  • Vector indexes for RAG (Azure AI Search charges per GB and per query)

Compute:

  • Per-tenant model deployments (if you go that route) cost $500 - $5000/month per tenant
  • Real-time embeddings generation for every document upload

Network:

  • Streaming responses to thousands of concurrent users
  • Large payload (images, documents) sent to Azure OpenAI

Operational risks

Model deprecation: Azure OpenAI deprecates models with 6-month notice. You need a migration plan: update model mappings, regression test all features, monitor for quality degradation.

Quota exhaustion: Regional quotas are shared across all tenants. One tenant's spike can exhaust quota. Mitigation: Distribute tenants across multiple reigons, implement aggressive rate limiting.

Data residency violations: LGPD/GDPR requires data to stay in specific regions. If a tenant's data is processed by an Azure OpenAI instance in the wrong region, you're non-compliant. Mitigation: Enforce region selection at the tenant level, block cross-region calls.

Prompt injection attacks: Malicious users craft inputs that manipulate the model into ignoring system instructions. Mitigation: Validate inputs, use prompt sandboxing, monitor for anomalous outputs.

Cost runaway: A bug in retry logic or a tenant with a scripted attack burns $50k in tokens overnight. Mitigation: Hard caps per tenant, kill switches, alerting on cost spikes.

Observability signals

Latency anomalies:

  • Sudden increase in P95 latency → model is slow or quota-throttled
  • Check Azure OpenAI status page, inspect retry logs

Cost spikes:

  • Daily spend increases 3x → investigate which tenant, which feature
  • Correlate with deployment changes (new prompt template?)

Quality degradation:

  • User feedback score drops from 4.2 to 3.1 → model update broke something
  • A/B test rollback, inspect changed outputs

Isolation breaches:

  • RAG query returns results with multiple tenant_id values → index misconfiguration
  • Immediate incident response, audit all recent queries

9. When NOT to use this

Do not use embedded AI when:

  • When you have <100 users total: Embedded AI infrastructure is overkill. Use a simple API wrapper around OpenAI with no tenant isolation. Focus on product-market fit, not architecture.

  • When AI is a "nice-to-have" feature: If the product works perfectly without AI, don't embed it. Users will tolerate external AI integrations (e.g., Zapier + OpenAI) until demand justifies native support.

  • When compliance requirements prohibit cloud AI: Some industries (defense, healthcare in certain jurisdictions) require on-premise models. SaaS platforms cannot meet these requirements without self-hosted infrastructure.

  • When latency budgets are <100ms: AIinference takes 200ms - 2s eveb with optimized prompts. If your feature needs sub-100ms responses (e.g., autocomplete), use traditional ML models or heuristics.

  • When you cannot afford $10k+/month in AI costs: Embedded AI at scale is expensive. If your total revenue is <$50k/month, your cannot sustainably operate AI features with thousands of users.

  • When your team has no ML/AI expertises: Operating embedded AI requires understanding model behavior, debugging hallucinations, optimizing prompts, and managing model lifecycles. If your purely backend/froend engineers, outsource to AI-as-a-service platform (e.g., customers bring their own API keys).

  • When data isolation is impossible: If your architecture has shared databases without row-level security, or shared blob storage without container-level isolation, you cannot safetly embed AI. Fix the data architecture first.

  • When vendor lock-in is unacceptable: If your business strategy requires multi-cloud or model portability, the overhead of abstraction layers (20-30% dev time tax) may not be worth it. Consider AI-as-a-service offerings with standardized APIs.


10. Key takeaways

  • Tenant isolation at the AI layer is harder than database isolation. Every component—prompts, RAG indexes, quotas, audit logs—must enforce tenant_id scoping. One missing filter causes data leakage.

  • Cost control is not optional. Without per-tenant quotas, one user can bankrupt your AI budget. Implement token budgeting, rate limiting, and cost attribution from day 1.

  • Compliance requires full traceability. LGPD/GDPR audits will ask: "What model version generated this output? What input data was used? Where is it stored?" If you cannot answer, you fail the audit.

  • Model deprecation is a when, not if. Abstract model names from code. Store mappings in configuration. Test new models before Azure forces the upgrade.

  • Observability is your defense against silent failures. Models degrade, costs spike, and latency increases without error messages. Monitor token usage, latency percentiles, and output quality continuously.

  • Streaming improves perceived performance more than faster models. A 2-second response feels instant if tokens appear progressively. Invest in SSE or WebSockets before optimizing prompts.

  • Prompt engineering is cost engineering. Every unnecessary token in a prompt costs money at scale. Optimize context retrieval, chunk sizes, and system messages aggressively.


11. High-Level Overview

Visual representation of the end-to-end embedded AI flow, highlighting tenant-scoped isolation boundaries, quota enforcement, prompt orchestration with RAG, controlled model invocation, audit persistence, and optional asynchronous processing for batch and embedding workloads.

Scroll to zoom • Drag to pan
Embedded AI in Multi-tenant SaaS — High-level FlowEmbedded AI in Multi-tenant SaaS — High-level FlowTenant Data PlaneAsync Workload PlaneAzureSQL(tenantpartition/RLS)BlobStorage(tenantcontainers)VectorIndex(AISearch)filter:tenant_idRedis(quota+caching)ServiceBus(queue)Worker(batch/embeddingjobs)User/ClientAPIGateway(AuthN/AuthZ+Correlation)AIFeatureAPI(Tenant-scoped)IsolationLayer(tenant_idpropagationRBAC/ABAC,residency)OrchestrationLayer(prompttemplates+RAGassemblyPIImasking+policyrules)InferenceLayer(modelrouting+budgetstimeouts,retries,streaming)ObservabilityLayer(metrics,traces,logscostattributionpertenant)Audit&ComplianceStore(immutableinferenceloghashes,retention)ModelProvider(AzureOpenAI)Invariant:tenant_idneverlost.Iftenant_idismissing->incident.Costcontrol:per-tenantbudgets,ratelimits,killswitch.Isolationrisk:vectorindexmustbepartitionedorstrictlyfilteredbytenant_id.Compliance:reproducibilityviamodelversionpinning+inputprovenance+retentionpolicies.RequestGenerate summaryForward(token,correlation-id)EnforcetenantscopeCheckquotareserveestimateAllow/DenyProceed(tenant_idlocked)RAGsearch+tenant_idfiltertop-Kchunksfetchdocs(tenantcontainer)contentprompt+rules(modelpolicy)Invoke(max_tokens,user=tenant_id)Stream/Result(usagetokens)Emittelemetry(tokens,latency,model,feature,tenant)Persistauditrecord(input/outputhashes,modelversion,correlation)enqueue(non-interactive)embedding/batchdequeueupsertembeddings(tenantpartitioned)batchmetricsbatchauditentriesvalidatedresponse(schema/PII/contentsafety)Response(model_version+tokens_used)plantuml-src TLRRRjj647tdLmpyqCQm5IL1dWGqWEB4gT1SN2jjW4A0iI97qbRBNMPtAKKf2lGZ-eNzafxPfKNAJLziaNEVENE6KKV3zBfUSbLnHTCPQKllMXFrAB9LDj9SgJdzyzVVzADURqQ6TspeshXzKOIjjetoggQmKPNRQxkcbJA11nBlMbkLpblszFJvG33rtkq_RdIvjHm2FHs8Nl5A8ODhP-DxLJDKjJ83UUdgnbcsSHuFIDvp6PLTcs7OlLVD9rt53Jrx_cJmlbPVtY6SMlDSVsLwzkH9KHH71tJsIq2UOxeo6kxFI0Kgqy-XplHcHczKvBqwBEoZQHitxyVozpUwe2ldFHiLjRFdoLmrUhq_iPxHDQlOUjHrCuE3ZwdDer2wXgkZqKbMi-2oJtgh3knXb6Tpfojgl6lKEaaNzlRbz6eynPzBT2NeYcrvo2vrS4EF7tovuH3zgLVugfj8aUi6yJYWeDlf6vW7hfS6rTxCPcYXj7CDMUECBW_aMyCXHt5mUvAuNR574dmCKRiA-04gOlQnRAirnx2mKTUCjq4oZrupVY0_LZNqkXB4sKaHIqnhfvRQw7Ze8uXv2NDKLueRuzRmNxeGIKN8bcqgki6aSnkxt9VXP4vjfIDzXucY6Lf93VFeF2E2hkisgYMqzB4-15dOZGeRpXK0Cpq6nDN9GFFCwPMAYcwCifpqAZmsyaH_54J3NBuANEO_lprEdW3dg2N4-FRjF8S9dysfsKlZbYbfmBstBB5MIcCVkvajeNLgzoj-EaypM_4NGVYCvgmmsVE5NMaJsK_eYB_aOiVbgOTRhZHw_kXpwr3X1PMgt1pdw4LO_7dIa6aus98-ERyrJbM3dgWaUD2FESQkqVINRGx2BNSjU529E4oWMAfORiQSQ4w0zpkcdKtsIUL1EkyIIc-yss67n3vFu4F3TZhBTWd8HP7fWKQZ5tdLQKAtZ8n2NDYpDumsOv4ej7Mj_E6io3gTDdIldTyhNybqt9Rj9QPp992HhiuBiKZwM63elxOhx1Xt_QT46eM8H2Kr5afN6owtb7g_iE01T8i96UaQkHHPgtSvDI1pSDuhjeVYw4is6KAqe6JSXG7X65TkGLH5qX1LJ5-g1b6419E5lUYrCs0AqHcuZQuP_KJbfhNRq3iI94AuOioAAbU6Ru3sl4XQ0ryY4eRk_QIjXAIZiulCK4VooRHrNcGrCSZl9q3y3bCGHVNbBWq4ozoYVpyS2pelil8mpZpn55fuom5t2_QjN1ZA7Z33FX9W9rEijR2iOU6gm_tmXQem853sPSxe4gr8b-BodgW6dXAfo8Imo5geBJq3mox12GUlZMCGAlvHnqiBctFVYOax1Hko_1xubf_AT5lNgBZ9NNty-0Ls2z7OfcsZHrRmYSQ3OwFUCQZXk5pZj6n5o5FA-mNJYfDfqRteaTCsw72aettu1ii9u0Q6kOqf3dLyFnJVzoOhv2T0HAxAS1Xv1NVAQ30lG8vLmNT54004dCTQZN7lnXs-A2W04x2vNyXkysNVZsQfltTTVm6yFDSxG4XMsJfGWSTNL2ItIXSPW5FukB5nKlGhO_69vR5c8JukPgkUQc51jGv1-2kXhvG33oMsC8BlXn4oqLx9uPCYl3EJ0bVl_iEkExsNHQ8eem5AG72hZQ6mr-ZS_tiMmky_IhmEsucik5mDBLU3wXP1bpmS890a5rsNqHmwMa3FbuU-mF_4MxeONIq1ymdjR_44D8RrhjfIT-T_fnMTm9iQRIrwLLnGMWQXWvs40drUz2Swqm6GqKV_5m00?>Embedded AI in Multi-tenant SaaS — High-level FlowEmbedded AI in Multi-tenant SaaS — High-level FlowTenant Data PlaneAsync Workload PlaneAzureSQL(tenantpartition/RLS)BlobStorage(tenantcontainers)VectorIndex(AISearch)filter:tenant_idRedis(quota+caching)ServiceBus(queue)Worker(batch/embeddingjobs)User/ClientAPIGateway(AuthN/AuthZ+Correlation)AIFeatureAPI(Tenant-scoped)IsolationLayer(tenant_idpropagationRBAC/ABAC,residency)OrchestrationLayer(prompttemplates+RAGassemblyPIImasking+policyrules)InferenceLayer(modelrouting+budgetstimeouts,retries,streaming)ObservabilityLayer(metrics,traces,logscostattributionpertenant)Audit&ComplianceStore(immutableinferenceloghashes,retention)ModelProvider(AzureOpenAI)Invariant:tenant_idneverlost.Iftenant_idismissing->incident.Costcontrol:per-tenantbudgets,ratelimits,killswitch.Isolationrisk:vectorindexmustbepartitionedorstrictlyfilteredbytenant_id.Compliance:reproducibilityviamodelversionpinning+inputprovenance+retentionpolicies.RequestGenerate summaryForward(token,correlation-id)EnforcetenantscopeCheckquotareserveestimateAllow/DenyProceed(tenant_idlocked)RAGsearch+tenant_idfiltertop-Kchunksfetchdocs(tenantcontainer)contentprompt+rules(modelpolicy)Invoke(max_tokens,user=tenant_id)Stream/Result(usagetokens)Emittelemetry(tokens,latency,model,feature,tenant)Persistauditrecord(input/outputhashes,modelversion,correlation)enqueue(non-interactive)embedding/batchdequeueupsertembeddings(tenantpartitioned)batchmetricsbatchauditentriesvalidatedresponse(schema/PII/contentsafety)Response(model_version+tokens_used)plantuml-src TLRRRjj647tdLmpyqCQm5IL1dWGqWEB4gT1SN2jjW4A0iI97qbRBNMPtAKKf2lGZ-eNzafxPfKNAJLziaNEVENE6KKV3zBfUSbLnHTCPQKllMXFrAB9LDj9SgJdzyzVVzADURqQ6TspeshXzKOIjjetoggQmKPNRQxkcbJA11nBlMbkLpblszFJvG33rtkq_RdIvjHm2FHs8Nl5A8ODhP-DxLJDKjJ83UUdgnbcsSHuFIDvp6PLTcs7OlLVD9rt53Jrx_cJmlbPVtY6SMlDSVsLwzkH9KHH71tJsIq2UOxeo6kxFI0Kgqy-XplHcHczKvBqwBEoZQHitxyVozpUwe2ldFHiLjRFdoLmrUhq_iPxHDQlOUjHrCuE3ZwdDer2wXgkZqKbMi-2oJtgh3knXb6Tpfojgl6lKEaaNzlRbz6eynPzBT2NeYcrvo2vrS4EF7tovuH3zgLVugfj8aUi6yJYWeDlf6vW7hfS6rTxCPcYXj7CDMUECBW_aMyCXHt5mUvAuNR574dmCKRiA-04gOlQnRAirnx2mKTUCjq4oZrupVY0_LZNqkXB4sKaHIqnhfvRQw7Ze8uXv2NDKLueRuzRmNxeGIKN8bcqgki6aSnkxt9VXP4vjfIDzXucY6Lf93VFeF2E2hkisgYMqzB4-15dOZGeRpXK0Cpq6nDN9GFFCwPMAYcwCifpqAZmsyaH_54J3NBuANEO_lprEdW3dg2N4-FRjF8S9dysfsKlZbYbfmBstBB5MIcCVkvajeNLgzoj-EaypM_4NGVYCvgmmsVE5NMaJsK_eYB_aOiVbgOTRhZHw_kXpwr3X1PMgt1pdw4LO_7dIa6aus98-ERyrJbM3dgWaUD2FESQkqVINRGx2BNSjU529E4oWMAfORiQSQ4w0zpkcdKtsIUL1EkyIIc-yss67n3vFu4F3TZhBTWd8HP7fWKQZ5tdLQKAtZ8n2NDYpDumsOv4ej7Mj_E6io3gTDdIldTyhNybqt9Rj9QPp992HhiuBiKZwM63elxOhx1Xt_QT46eM8H2Kr5afN6owtb7g_iE01T8i96UaQkHHPgtSvDI1pSDuhjeVYw4is6KAqe6JSXG7X65TkGLH5qX1LJ5-g1b6419E5lUYrCs0AqHcuZQuP_KJbfhNRq3iI94AuOioAAbU6Ru3sl4XQ0ryY4eRk_QIjXAIZiulCK4VooRHrNcGrCSZl9q3y3bCGHVNbBWq4ozoYVpyS2pelil8mpZpn55fuom5t2_QjN1ZA7Z33FX9W9rEijR2iOU6gm_tmXQem853sPSxe4gr8b-BodgW6dXAfo8Imo5geBJq3mox12GUlZMCGAlvHnqiBctFVYOax1Hko_1xubf_AT5lNgBZ9NNty-0Ls2z7OfcsZHrRmYSQ3OwFUCQZXk5pZj6n5o5FA-mNJYfDfqRteaTCsw72aettu1ii9u0Q6kOqf3dLyFnJVzoOhv2T0HAxAS1Xv1NVAQ30lG8vLmNT54004dCTQZN7lnXs-A2W04x2vNyXkysNVZsQfltTTVm6yFDSxG4XMsJfGWSTNL2ItIXSPW5FukB5nKlGhO_69vR5c8JukPgkUQc51jGv1-2kXhvG33oMsC8BlXn4oqLx9uPCYl3EJ0bVl_iEkExsNHQ8eem5AG72hZQ6mr-ZS_tiMmky_IhmEsucik5mDBLU3wXP1bpmS890a5rsNqHmwMa3FbuU-mF_4MxeONIq1ymdjR_44D8RrhjfIT-T_fnMTm9iQRIrwLLnGMWQXWvs40drUz2Swqm6GqKV_5m00?>