Infrastructure & DeploymentHow It Works

Rate limiting patterns for multi-user Claude apps

In brief

When your app has more than one user, naive retry logic is not enough. Token budgeting, per-user quotas, request queuing, and graceful degradation — in code.

8 min read·

Contents

♡Sign in to save

A single-user app hitting rate limits is annoying. A multi-user app hitting rate limits is a product failure. The difference is not just retry logic — it is designing your application to distribute usage correctly and degrade gracefully when limits are approached.

Anthropic's rate limits: what you are actually working around

Anthropic enforces limits at two levels: requests per minute (RPM) and tokens per minute (TPM). These limits vary by model, account tier, and change over time as Anthropic updates its capacity. Always check the current values in the Anthropic console under "Limits" or in the Anthropic rate limits documentation before hardcoding any numbers.

As of early 2026, tiers range from the default (limited RPM/TPM for new accounts) to higher production tiers granted after a usage and payment history is established. New accounts are restricted until Anthropic has established trust. If you are building a multi-user application and hitting limits in testing, the first step is usually requesting a tier increase through the console, not redesigning your architecture.

The limits that matter most in practice:

TPM (tokens per minute) — the one most apps hit first. Long system prompts × many concurrent users = TPM exhaustion.
RPM (requests per minute) — less commonly the bottleneck, but relevant if your app makes many short calls rather than fewer long ones.

Design your application knowing that these limits are per API key. If you need more headroom without a tier upgrade, you can use multiple keys on separate accounts — but Anthropic's terms require each to belong to a separate legal entity, so this is not a general workaround.

Why application-layer rate limiting is still required

Even after a tier upgrade, you still need application-layer rate limiting. Anthropic's limits protect Anthropic's infrastructure. Your application needs limits to protect your own budget and to ensure fair access across users. These are different problems.

When you have 500 users hitting your app simultaneously, exponential backoff does not help. You cannot tell 400 users to wait 16 seconds. You need to:

Prevent rate limit exhaustion in the first place
Queue requests intelligently when limits are approached
Give users informative feedback, not spinner timeouts

Token budgeting

The most effective preventive measure is setting token budgets per request.

const MAX_TOKENS_PER_REQUEST = 1024

const response = await anthropic.messages.create({
  model: 'claude-opus-4-5',
  max_tokens: MAX_TOKENS_PER_REQUEST,
  messages: conversation,
})

Tighter budgets mean more headroom before TPM limits hit. For conversational apps, you rarely need 4096 tokens per response — 1024 is sufficient for most turns.

Per-user rate limiting at the application layer

Before requests even reach Anthropic, throttle at the application layer. Supabase + a simple counter works well:

// lib/rateLimit.ts
import { createClient } from '@/lib/supabase/server'

const USER_LIMIT = 20  // requests per hour
const WINDOW_MS = 60 * 60 * 1000

export async function checkUserRateLimit(userId: string): Promise<{
  allowed: boolean
  remaining: number
  resetAt: Date
}> {
  const supabase = await createClient()
  const windowStart = new Date(Date.now() - WINDOW_MS).toISOString()

  const { count } = await supabase
    .from('api_usage')
    .select('*', { count: 'exact', head: true })
    .eq('user_id', userId)
    .gte('created_at', windowStart)

  const used = count ?? 0
  const remaining = Math.max(0, USER_LIMIT - used)
  const resetAt = new Date(Date.now() + WINDOW_MS)

  return { allowed: remaining > 0, remaining, resetAt }
}

export async function recordUsage(userId: string, tokens: number) {
  const supabase = await createClient()
  await supabase.from('api_usage').insert({
    user_id: userId,
    tokens_used: tokens,
    created_at: new Date().toISOString(),
  })
}

The table:

create table api_usage (
  id uuid primary key default gen_random_uuid(),
  user_id uuid references auth.users(id),
  tokens_used integer not null,
  created_at timestamptz default now()
);

create index on api_usage(user_id, created_at);

Wrapping your route handler

// app/api/chat/route.ts
import { checkUserRateLimit, recordUsage } from '@/lib/rateLimit'
import Anthropic from '@anthropic-ai/sdk'

export async function POST(request: Request) {
  const { userId, messages } = await request.json()

  const rateCheck = await checkUserRateLimit(userId)
  if (!rateCheck.allowed) {
    return Response.json(
      {
        error: 'Rate limit reached',
        resetAt: rateCheck.resetAt,
        message: `You have used your hourly limit. Try again at ${rateCheck.resetAt.toLocaleTimeString()}.`
      },
      { status: 429 }
    )
  }

  const anthropic = new Anthropic()
  const response = await anthropic.messages.create({
    model: 'claude-opus-4-5',
    max_tokens: 1024,
    messages,
  })

  // Record actual token usage
  await recordUsage(userId, response.usage.input_tokens + response.usage.output_tokens)

  return Response.json({ content: response.content[0] })
}

Request queuing for burst scenarios

For apps where simultaneous requests are common (e.g., a classroom tool, a team product), a simple queue prevents pile-ups:

// lib/queue.ts — in-memory queue (use BullMQ for production)
type QueuedRequest = {
  resolve: (value: unknown) => void
  reject: (reason: unknown) => void
  fn: () => Promise<unknown>
}

class SimpleQueue {
  private queue: QueuedRequest[] = []
  private running = 0
  private maxConcurrent = 3  // tune to stay within your tier's RPM

  async add<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push({ resolve: resolve as (v: unknown) => void, reject, fn: fn as () => Promise<unknown> })
      this.drain()
    })
  }

  private async drain() {
    if (this.running >= this.maxConcurrent || this.queue.length === 0) return
    const item = this.queue.shift()!
    this.running++
    try {
      const result = await item.fn()
      item.resolve(result)
    } catch (err) {
      item.reject(err)
    } finally {
      this.running--
      this.drain()
    }
  }
}

export const anthropicQueue = new SimpleQueue()

Graceful degradation

When limits are hit despite queuing, tell users clearly:

// components/ChatInput.tsx
if (error.status === 429) {
  const resetTime = new Date(error.resetAt).toLocaleTimeString()
  setErrorMessage(`You've reached your request limit. It resets at ${resetTime}.`)
  return
}

Never show a raw "429 Too Many Requests" error. Users think the app is broken. A clear message about limits — and when they reset — keeps trust intact.

Monitoring usage in production

Track a few key metrics:

Average tokens per request (spot bloated prompts)
Requests per user (identify power users and abuse)
Rate limit hits per hour (signals you need a higher tier)
P95 response latency (correlates with token count)

Supabase makes this straightforward since you already have api_usage with token counts. A simple dashboard query:

select
  date_trunc('hour', created_at) as hour,
  count(*) as requests,
  sum(tokens_used) as total_tokens,
  avg(tokens_used) as avg_tokens
from api_usage
where created_at > now() - interval '24 hours'
group by 1
order by 1;

Rate limiting is unglamorous but it is what separates toys from products that survive their first week of real traffic.

Try this today: go to console.anthropic.com → Settings → Limits and note your current RPM and TPM ceilings. Then estimate your app's peak concurrent users × average tokens per request. If the math puts you within 2× of the limit under normal load, you are too close and need either a tier upgrade or tighter per-request budgets before you hit traffic.