Infrastructure & DeploymentHow It Works

Your first Claude API call: what you actually need to know

In brief

The official quickstart gets you to 'Hello, world.' This gets you to understanding why Claude gave you a worse answer than the web app — and exactly how to fix it.

7 min read·API

Contents

♡Sign in to save

This is not a rehash of the docs. The docs are fine. This is the stuff that trips people up in the first week — the mental models that make the API click, the errors you will hit, and the patterns worth getting right from the start.

Prerequisites: You have an Anthropic API key. You know how to make HTTP requests. Everything else is covered here.

The request structure

Every Claude API request is a POST to https://api.anthropic.com/v1/messages. The payload has three required fields: model, max_tokens, and messages.

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain what a transformer is in two sentences."}
    ]
)

print(message.content[0].text)

import Anthropic from '@anthropic-ai/sdk'

const client = new Anthropic() // reads ANTHROPIC_API_KEY from env

const message = await client.messages.create({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  messages: [
    { role: 'user', content: 'Explain what a transformer is in two sentences.' }
  ]
})

console.log(message.content[0].text)

The messages array

The messages array is a conversation history. Each entry is a turn: user or assistant. You build up the array yourself — the API is stateless and has no memory between requests.

messages = [
    {"role": "user", "content": "What is the capital of France?"},
    {"role": "assistant", "content": "Paris."},
    {"role": "user", "content": "And the population?"},
]

The array must always start with a user turn and alternate user/assistant. The API will error if you send two consecutive turns from the same role. If you are maintaining a chat history in your app, you are responsible for ensuring this alternation is correct.

The system prompt

The system prompt sets context and instructions that apply to the whole conversation. It goes in a separate top-level system parameter — not in the messages array.

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="You are a technical writer. Be precise and concise. Use code examples when relevant.",
    messages=[
        {"role": "user", "content": "Explain embeddings."}
    ]
)

The system prompt is not charged differently from message content — it consumes input tokens just like everything else.

Streaming

For anything user-facing, stream. Waiting for the full response before rendering is a bad user experience — responses that take more than 2 seconds to begin rendering have measurably lower completion rates in user-facing interfaces. Streaming starts delivering text within the first 200–500ms of generation, regardless of how long the full response takes.

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a haiku about APIs."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

const stream = await client.messages.stream({
  model: 'claude-sonnet-4-6',
  max_tokens: 1024,
  messages: [{ role: 'user', content: 'Write a haiku about APIs.' }]
})

for await (const chunk of stream) {
  if (chunk.type === 'content_block_delta' && chunk.delta.type === 'text_delta') {
    process.stdout.write(chunk.delta.text)
  }
}

Tokens and max_tokens

max_tokens caps the output, not the total context. If your input is 2000 tokens and max_tokens is 1024, the model will stop generating at 1024 output tokens — but your total context can be much larger.

Each model has a context window limit (the combined input + output). Claude Sonnet 4.6 supports 200k input tokens. If your input exceeds the limit you get a context_window_exceeded error. If the model reaches max_tokens before finishing its response, the stop_reason in the response will be max_tokens rather than end_turn — worth checking if truncated responses are a problem in your app.

A rough guide to token counts: 1 token ≈ 4 characters in English. 1000 tokens ≈ 750 words.

Errors you will hit

401 Unauthorized — your API key is wrong or missing. Check ANTHROPIC_API_KEY in your environment. Do not hardcode the key.

429 Too Many Requests — you have hit a rate limit. The response includes Retry-After header. Implement exponential backoff: wait, retry, wait longer, retry. The SDK has built-in retry logic you can configure:

client = anthropic.Anthropic(max_retries=3)

529 Overloaded — Anthropic's servers are under load. Treat like a 429 — wait and retry.

InvalidRequestError — you sent a malformed request. Common causes: messages array starts with assistant, consecutive same-role messages, max_tokens set to 0, or a model name that does not exist. Read the error message — it is usually specific.

Structured output

Claude does not have a native structured output mode, but it follows instructions reliably. Ask for JSON explicitly and tell it the shape:

message = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system="Always respond with valid JSON. No explanation, no markdown — raw JSON only.",
    messages=[{
        "role": "user",
        "content": 'Extract: {"name": string, "email": string, "company": string} from: "Hi, I'm Alex Chen from Vercel, alex@vercel.com"'
    }]
)

import json
data = json.loads(message.content[0].text)

For production use, wrap the parse in a try/except and consider asking Claude to double-check its output before responding. Alternatively, use a library like Instructor that wraps the API and handles retries on parse failure.

Cost awareness from day one

Track tokens in every response:

print(message.usage.input_tokens, message.usage.output_tokens)

Input and output tokens are priced separately — output is more expensive. Long system prompts that repeat across every request add up fast. If you are sending the same multi-paragraph system prompt with every call, look at prompt caching early. It is a single parameter change and can cut costs significantly for stateless workloads with repeated context.

What to set up before you build anything larger

Before your API usage grows:

Separate API keys per environment — development, staging, production. Revoke them independently if compromised.
Log input/output tokens per request — you want this data when usage spikes unexpectedly.
Set a spending limit in the Anthropic console — stops runaway costs during testing bugs.
Handle errors with retries — any network-facing code needs backoff logic. The SDK does this if you configure it.

The rest — batching, caching, routing between models — you can add when you have a reason to. Get the basics right first.