The Internal AI Stack: What It Looks Like When You Build It Properly
In brief
Most teams are still at the ChatGPT-tab stage. A mature internal AI stack has five layers. Here's what each layer does, how they connect, and the metrics that tell you it's working — versus the metrics that tell you you're still paying 400k tokens to get started.
Contents
Most AI rollouts look like this: someone on the team discovers Claude is useful, tells their manager, the manager approves a company account, the team starts using it for drafting things. After a few months, usage is uneven — a handful of people use it heavily, most use it occasionally for simple tasks, and no one is quite sure what value they're actually getting.
This isn't a failure. It's Phase 1. Individual discovery is how every AI rollout starts.
But there's a significant gap between Phase 1 (individuals using AI as a better search engine) and a mature internal AI stack (AI with structured access to company data, enforced permissions, and reusable workflows that run across the team). Crossing that gap is an engineering problem as much as it is an organizational one.
Here's what the mature version looks like, layer by layer.
The five-layer stack
┌─────────────────────────────────────┐
│ Layer 5: UI │
│ Claude Code (dev), Claude Chat │
│ (everyone else) │
├─────────────────────────────────────┤
│ Layer 4: Skills │
│ Reusable workflows stored as │
│ markdown, invoked by natural │
│ language, published to team plugin │
├─────────────────────────────────────┤
│ Layer 3: Internal MCP Server │
│ Single routing layer with ACL │
│ middleware, shaped responses, │
│ audit logging │
├─────────────────────────────────────┤
│ Layer 2: Data Access │
│ Live API (real-time) + Data │
│ Warehouse (aggregates/history) │
├─────────────────────────────────────┤
│ Layer 1: Pre-Fetch Cache │
│ Nightly cron job, session-start │
│ context ready before user appears │
└─────────────────────────────────────┘
Each layer handles a specific concern. Remove any one of them and the others don't work as well.
Layer 1: Pre-Fetch Cache
The pre-fetch cache is the foundation no one talks about until they get the bill.
Every morning at 3 AM, a cron job runs. It fetches everything your agents will need at session start — pipeline state, meeting notes from yesterday, open ticket summaries, call summaries from the past week — and writes it to a cache table.
When users start their sessions at 8 AM, their agents read from cache. Session initialization goes from "call 6 APIs and wait 8 seconds" to "read one database row." Token cost goes from 400,000 tokens per session (for native connectors pulling everything live) to near zero on startup.
This layer is invisible when it's working. It's very visible when it's not — slow sessions, high token costs, unreliable startup.
What to monitor:
- Cache generation time (did the cron finish before 6 AM?)
- Cache hit rate per session (are sessions actually reading from cache or falling back to live?)
- Cache staleness at session start (how old is the data when users start their day?)
Layer 2: Data Access (Live API + Warehouse)
The data access layer has two paths, and both are necessary.
The warehouse path handles aggregates, historical data, and cross-system joins. "What's our renewal rate by industry this quarter?" "Which customers have both open tickets and renewals in the next 90 days?" These queries touch large datasets across multiple systems. The warehouse handles them in a single fast query.
The live API path handles anything where freshness matters more than cost. "Did Acme's payment clear?" "What's the status of this open ticket?" "Is this deploy still in progress?" These need current answers — the warehouse might be hours stale.
The right warehouse for AI agent workloads is often different from your BI warehouse. BigQuery is excellent for large-scale analytics and works well in GCP. For AI agent queries — many small lookups, sub-second latency requirements, flat per-month pricing — StarRocks is often a better fit at roughly $200-400/month regardless of query volume.
What to monitor:
- Query latency by source (warehouse vs. live API)
- ETL lag (when did the warehouse last update?)
- Live API error rates (so you know when to expect stale cache reads)
Layer 3: Internal MCP Server
The MCP server is the routing and control layer. Claude makes a single connection here; the server handles everything else.
Every tool call passes through ACL middleware that:
- Identifies the caller (from the authenticated session — not from user input)
- Checks whether the caller's permission group includes access to this tool
- Injects caller identity into every downstream call (so services know who's asking)
- Returns a clean denial for unauthorized tools (and they don't appear in the tool list to begin with)
- Logs every call with the user ID, tool name, parameters, status, and timestamp
The tools themselves are named functions with descriptions. Claude reads the description and decides when to call the tool. The implementation is hidden — it might call the warehouse, it might call a live API, it might call both and join the results. Claude gets the shaped output.
Tool response shaping is where you reclaim the 10-50x token cost difference between raw API responses and clean data. A HubSpot deal object has 40+ fields. Your tool returns 6. You pay for 6.
What to monitor:
- Tool call latency (p50, p95 — where are the slow spots?)
- Permission denial rate (are users hitting tools they shouldn't have? Or tools that should exist for them but don't?)
- Most-called tools (what does your team actually use Claude for?)
- Audit log (every access, for compliance and debugging)
Layer 4: Skills
Skills are the layer that converts institutional knowledge into team capability.
A skill is a SKILL.md file describing a multi-step workflow: what it does, when to invoke it, what tools to call in what order, and what the output should look like. Skills are stored in a plugin folder, published via PR, and available to everyone immediately.
The person who figures out the best way to run a BDR compliance audit or prep for a QBR or pull a deal intelligence report — their knowledge goes into a skill file. Now anyone on the team can run the same workflow with the same quality.
The most valuable skills are the ones that encode decisions, not just steps. The pre-call research brief isn't valuable because it pulls from HubSpot and Gong — any tool can do that. It's valuable because it knows which fields matter, which signals to flag, and how to format output for a 15-minute read before a call.
What to monitor:
- Skill invocation frequency (which skills get used? Which don't?)
- Skill success rate (does the skill complete without error?)
- Skill coverage (what percentage of your team's repeated workflows have a skill?)
Layer 5: UI
The UI layer is the interface through which your team actually uses everything above.
Claude Code is the primary UI for developers and technical users. It has direct access to the internal MCP server, can invoke skills by name, can write and run code, and can operate in longer multi-step sessions with tool use.
Claude Chat (claude.ai or a custom front-end) is the primary UI for everyone else. Sales, finance, CS, operations — they interact through a conversational interface that uses the same MCP server and skills in the background.
For most teams, this means two interface surfaces but one data layer underneath.
What each role gets
The stack's value is that it gives different teams access to different things — shaped to their role, with their permissions, and with their most common workflows automated.
Sales:
- CRM pipeline data in real-time
- Call transcripts and Gong summaries
- Deal intelligence reports on demand
- Pre-call research briefs automatically
- Tools: deal lookup, call history, contact profiles
Finance:
- Accounting reports from QuickBooks
- Billing and AR data from ChargeOver
- Monthly variance analysis vs. forecast
- Revenue recognition reports
- Tools: SQL access (finance_read), billing lookup, report generation
Customer Success:
- Customer health scores
- Open ticket status and history
- Renewal pipeline with risk signals
- QBR prep on demand
- Tools: customer profile, ticket lookup, usage data
Engineering:
- Codebase access via GitHub MCP
- Roadmap and spec docs from Notion
- Deploy status and incident history
- PR summaries and review briefs
- Tools: repo lookup, deploy status, docs search
Executives:
- Cross-functional dashboards
- Pipeline and revenue summaries
- Meeting briefs from Granola
- KPI snapshots
- Tools: broad read access across all tools, SQL access
The same MCP server, the same skills layer, the same cache — but each person's Claude looks different because the tool list is filtered to their permissions.
The metrics that tell you it's working
Cold start token cost. Target: under 10,000 tokens per session startup. If you're over 100,000, you don't have a working pre-fetch cache. If you're over 400,000, you're running native connectors with no shaping.
Query latency. Target: under 2 seconds for 95% of tool calls. If you're regularly at 3-5 seconds, your warehouse query layer needs work — or you're hitting live APIs for things that should be cached.
Cross-system query availability. Can you ask "which customers have an open enterprise renewal and more than two unresolved support tickets?" and get an answer in one query? If not, your warehouse layer isn't joining across systems.
Permission violations. Target: zero. If unauthorized tool access is possible (not just detected), your ACL middleware isn't working correctly. If it's happening and being caught, your permission groups need review.
Skill coverage. What percentage of your team's regularly repeated workflows have a skill? Under 20%: you're at individual discovery. Over 60%: your institutional knowledge is becoming a shared asset. Over 80%: most of your team's repeated work is automated.
The metrics that tell you something is wrong
High session startup cost. You're paying for cold start. Check cache hit rate, check whether pre-fetch is running, check whether tools are shaping responses.
Slow tool calls. Check which tools are slow (warehouse, live API, or MCP overhead). Check warehouse query plans. Check live API rate limiting.
Low skill invocation. Skills exist but nobody uses them. Either discovery is broken (the trigger phrases don't match how people ask) or the output isn't good enough to trust.
High permission denial rate. People are hitting tool walls. Either your permission groups are too restrictive, or you haven't built tools that certain roles actually need.
Token cost spikes. Log token cost per session by user. Spikes usually mean one of: a user who's asking Claude to process a lot of raw data, a skill that's not shaping its tool calls, or a new data source that got connected without thinking about response size.
Where most teams actually are
Most teams are between Layers 1 and 3. They have native connectors (no shaping, no ACL, no audit trail), no pre-fetch cache, and no skills layer.
This gives them an AI that's useful for individuals but doesn't compound across the team. Token costs are high. Session startup is slow. The same workflows get reinvented by each person independently. And there's no access control — everyone gets everything.
The path forward isn't a complete rebuild. It's adding the missing layers in order:
- Add a pre-fetch cache for your top 3 session startup queries (one week of engineering)
- Build an internal MCP server with 3-5 tools and ACL middleware (two weeks)
- Write your first 5 skills for the team's most common workflows (two weeks)
- Add the warehouse query layer for aggregates and cross-system joins (two weeks)
Six to eight weeks of focused engineering. The result is an internal AI stack where every tool call is shaped, permissioned, logged, and fast — and where the team's institutional knowledge compounds instead of staying locked in individuals' heads.