Deploying a Claude application: from localhost to production
In brief
Environment variables, rate limits, error handling, costs, and the things that bite you on your first production deploy. A practical checklist.
Contents
Getting Claude working locally is one thing. Shipping it to real users is another. The gap is not about code — it is about secrets management, rate limits, cost controls, error handling, and observability. Here is what to sort out before you deploy.
Secrets and environment variables
Never hardcode your API key. This seems obvious, but it is the most common mistake in Claude apps shipped by first-time builders.
The right pattern:
import os
import anthropic
# Load from environment — never hardcode
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
For deployment platforms:
- Vercel: Settings → Environment Variables → add
ANTHROPIC_API_KEY - Railway: Variables tab in your service settings
- Fly.io:
fly secrets set ANTHROPIC_API_KEY=sk-... - AWS/GCP/Azure: Use their secrets manager services, not env vars directly for production
Never commit .env files. Add .env, .env.local, .env.production to .gitignore before your first commit, not after.
Rate limits and what happens when you hit them
The Anthropic API has rate limits: requests per minute (RPM), tokens per minute (TPM), and tokens per day (TPD). Your tier determines the limits. When you exceed them, you get a 429 RateLimitError.
Retry with exponential backoff:
import anthropic
import time
def call_with_retry(client, max_retries=5, **kwargs):
for attempt in range(max_retries):
try:
return client.messages.create(**kwargs)
except anthropic.RateLimitError as e:
if attempt == max_retries - 1:
raise
wait = 2 ** attempt # 1, 2, 4, 8, 16 seconds
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
except anthropic.APIStatusError as e:
if e.status_code >= 500:
# Server error — retry
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
else:
raise # 4xx errors: don't retry
For TypeScript:
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function callWithRetry(params: Anthropic.MessageCreateParamsNonStreaming, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await client.messages.create(params);
} catch (err) {
if (err instanceof Anthropic.RateLimitError) {
if (attempt === maxRetries - 1) throw err;
const wait = Math.pow(2, attempt) * 1000;
await new Promise(r => setTimeout(r, wait));
continue;
}
if (err instanceof Anthropic.APIError && err.status >= 500) {
if (attempt === maxRetries - 1) throw err;
await new Promise(r => setTimeout(r, Math.pow(2, attempt) * 1000));
continue;
}
throw err; // don't retry 4xx
}
}
}
Cost controls
Without controls, a single runaway request or an attacker hammering your endpoint can generate a large unexpected bill.
Controls to put in place before launch:
1. Set max_tokens tightly. Default to the smallest value that covers your actual outputs. If your app generates summaries under 300 words, use max_tokens=512, not max_tokens=4096.
2. Rate-limit your own users. Implement per-user request throttling before Claude calls. Use Redis or an in-memory counter:
import redis
import time
r = redis.Redis()
def check_rate_limit(user_id: str, limit: int = 20, window_seconds: int = 60) -> bool:
key = f"ratelimit:{user_id}"
pipe = r.pipeline()
pipe.incr(key)
pipe.expire(key, window_seconds)
count, _ = pipe.execute()
return count <= limit
3. Set Anthropic spend limits. In the Anthropic console, set a monthly spend limit. This is a hard stop — requests fail once you hit it, but you won't get a surprise bill.
4. Log token usage per request. Capture response.usage.input_tokens and response.usage.output_tokens and store them. You need this data to understand costs by user, by route, and over time.
Observability: what to log
Log enough to debug problems without logging sensitive user data:
import logging
import time
logger = logging.getLogger(__name__)
def logged_claude_call(user_id: str, route: str, **kwargs):
start = time.time()
try:
response = client.messages.create(**kwargs)
duration_ms = int((time.time() - start) * 1000)
logger.info({
"event": "claude_call_success",
"user_id": user_id,
"route": route,
"model": kwargs.get("model"),
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"duration_ms": duration_ms,
})
return response
except Exception as e:
logger.error({
"event": "claude_call_error",
"user_id": user_id,
"route": route,
"error_type": type(e).__name__,
"error_message": str(e),
})
raise
Do not log user message content unless you have a clear business reason and appropriate user consent. Log metadata instead.
The pre-launch checklist
Before your first real users:
- API key in environment variable, not code
- API key never committed to git (check your history)
- Retry logic with exponential backoff in place
- Per-user rate limiting implemented
-
max_tokensset to realistic values - Spend limit set in Anthropic console
- Token usage logged per request
- Error responses to users are friendly, not raw API errors
- Tested what happens when the API is down (graceful degradation)
- Tested with your actual production environment variables
The Claude API is reliable, but building on any external API means planning for the moments it is not.
Further reading
- Secure deployment — security practices for production deployments