Deploying a Claude application: from localhost to production

Getting Claude working locally is one thing. Shipping it to real users is another. The gap is not about code — it is about secrets management, rate limits, cost controls, error handling, and observability. Here is what to sort out before you deploy.

Secrets and environment variables

Never hardcode your API key. This seems obvious, but it is the most common mistake in Claude apps shipped by first-time builders.

The right pattern:

import os
import anthropic

# Load from environment — never hardcode
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

For deployment platforms:

Vercel: Settings → Environment Variables → add ANTHROPIC_API_KEY
Railway: Variables tab in your service settings
Fly.io: fly secrets set ANTHROPIC_API_KEY=sk-...
AWS/GCP/Azure: Use their secrets manager services, not env vars directly for production

Never commit .env files. Add .env, .env.local, .env.production to .gitignore before your first commit, not after.

Rate limits and what happens when you hit them

The Anthropic API has rate limits: requests per minute (RPM), tokens per minute (TPM), and tokens per day (TPD). Your tier determines the limits. When you exceed them, you get a 429 RateLimitError.

Retry with exponential backoff:

import anthropic
import time

def call_with_retry(client, max_retries=5, **kwargs):
    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt  # 1, 2, 4, 8, 16 seconds
            print(f"Rate limited. Waiting {wait}s...")
            time.sleep(wait)
        except anthropic.APIStatusError as e:
            if e.status_code >= 500:
                # Server error — retry
                if attempt == max_retries - 1:
                    raise
                time.sleep(2 ** attempt)
            else:
                raise  # 4xx errors: don't retry

For TypeScript:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

async function callWithRetry(params: Anthropic.MessageCreateParamsNonStreaming, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create(params);
    } catch (err) {
      if (err instanceof Anthropic.RateLimitError) {
        if (attempt === maxRetries - 1) throw err;
        const wait = Math.pow(2, attempt) * 1000;
        await new Promise(r => setTimeout(r, wait));
        continue;
      }
      if (err instanceof Anthropic.APIError && err.status >= 500) {
        if (attempt === maxRetries - 1) throw err;
        await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
        continue;
      }
      throw err; // don't retry 4xx
    }
  }
}

Cost controls

Without controls, a single runaway request or an attacker hammering your endpoint can generate a large unexpected bill.

Controls to put in place before launch:

1. Set max_tokens tightly. Default to the smallest value that covers your actual outputs. If your app generates summaries under 300 words, use max_tokens=512, not max_tokens=4096.

2. Rate-limit your own users. Implement per-user request throttling before Claude calls. Use Redis or an in-memory counter:

import redis
import time

r = redis.Redis()

def check_rate_limit(user_id: str, limit: int = 20, window_seconds: int = 60) -> bool:
    key = f"ratelimit:{user_id}"
    pipe = r.pipeline()
    pipe.incr(key)
    pipe.expire(key, window_seconds)
    count, _ = pipe.execute()
    return count <= limit

3. Set Anthropic spend limits. In the Anthropic console, set a monthly spend limit. This is a hard stop — requests fail once you hit it, but you won't get a surprise bill.

4. Log token usage per request. Capture response.usage.input_tokens and response.usage.output_tokens and store them. You need this data to understand costs by user, by route, and over time.

Observability: what to log

Log enough to debug problems without logging sensitive user data:

import logging
import time

logger = logging.getLogger(__name__)

def logged_claude_call(user_id: str, route: str, **kwargs):
    start = time.time()
    try:
        response = client.messages.create(**kwargs)
        duration_ms = int((time.time() - start) * 1000)
        logger.info({
            "event": "claude_call_success",
            "user_id": user_id,
            "route": route,
            "model": kwargs.get("model"),
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "duration_ms": duration_ms,
        })
        return response
    except Exception as e:
        logger.error({
            "event": "claude_call_error",
            "user_id": user_id,
            "route": route,
            "error_type": type(e).__name__,
            "error_message": str(e),
        })
        raise

Do not log user message content unless you have a clear business reason and appropriate user consent. Log metadata instead.

The pre-launch checklist

Before your first real users:

API key in environment variable, not code
API key never committed to git (check your history)
Retry logic with exponential backoff in place
Per-user rate limiting implemented
max_tokens set to realistic values
Spend limit set in Anthropic console
Token usage logged per request
Error responses to users are friendly, not raw API errors
Tested what happens when the API is down (graceful degradation)
Tested with your actual production environment variables

The Claude API is reliable, but building on any external API means planning for the moments it is not.

Deploying a Claude application: from localhost to production

Secrets and environment variables

Rate limits and what happens when you hit them

Cost controls

Observability: what to log

The pre-launch checklist

Further reading

What to read next