Write system prompts that produce consistent, production-quality behavior. Apply chain-of-thought prompting to reasoning tasks. Structure prompts for JSON output and validate results. Build an evaluation harness for testing prompt quality.
1. The System Prompt Is Your API Contract
Before writing a single line of code, the most important decision in a Claude-powered application is the system prompt. It establishes the model's role, constraints, output format, and behavior for the entire interaction. Treat it like a function signature: be explicit about inputs, outputs, and constraints.
WEAK (inconsistent behavior):
"You are a helpful customer service bot."
STRONG (consistent, predictable behavior):
"""You are a customer service agent for Acme Corp's online store.
Your role:
- Answer questions about orders, shipping, and returns
- Provide accurate information from the knowledge base you're given
- Escalate to human support for billing disputes and technical issues
Constraints:
- Never speculate about information not provided in the context
- If you don't know something, say "I don't have that information"
and provide the contact link: [email protected]
- Respond in under 100 words unless a longer answer is clearly needed
Output format: Plain text, friendly and professional tone."""
2. Chain-of-Thought for Reasoning Tasks
import anthropic
client = anthropic.Anthropic()
# WITHOUT chain-of-thought — often wrong on multi-step problems
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=100,
messages=[{"role": "user",
"content": "A train leaves Chicago at 2pm going 60mph. "
"Another train leaves Detroit (280 miles away) at 3pm "
"going 80mph toward Chicago. When do they meet?"}]
)
# Might just guess: "Around 5:20pm"
# WITH chain-of-thought — shows work, higher accuracy
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=400,
system="Think through problems step by step before giving your final answer.",
messages=[{"role": "user",
"content": "A train leaves Chicago at 2pm..."}]
)
# Shows: "Train 1 travels for t hours. Train 2 travels for (t-1) hours..."
# Then solves correctly.
3. Structured JSON Output
import json
from pydantic import BaseModel
class ProductReview(BaseModel):
sentiment: str # "positive", "negative", "neutral"
rating: int # 1-5
key_issues: list[str]
would_recommend: bool
def analyze_review(review_text: str) -> ProductReview:
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Use cheaper model for structured extraction
max_tokens=300,
system="""Analyze product reviews and extract structured information.
Respond with valid JSON matching this schema:
{
"sentiment": "positive" | "negative" | "neutral",
"rating": 1-5 integer,
"key_issues": ["issue1", "issue2"],
"would_recommend": true | false
}
Nothing before or after the JSON. Just the JSON object.""",
messages=[{"role": "user", "content": review_text}]
)
# Parse and validate
data = json.loads(response.content[0].text)
return ProductReview(**data)
# Test
review = "Great build quality but the battery only lasts 4 hours. Disappointed."
result = analyze_review(review)
print(result.sentiment) # "negative"
print(result.rating) # 2
print(result.key_issues) # ["short battery life"]
4. Building an Eval Harness
Every production prompt needs systematic evaluation. Here's a minimal eval harness:
from dataclasses import dataclass
from typing import Callable
@dataclass
class EvalCase:
input: str
expected_sentiment: str
expected_would_recommend: bool
eval_cases = [
EvalCase("Amazing product! Best purchase this year.", "positive", True),
EvalCase("Arrived broken. Won't buy again.", "negative", False),
EvalCase("It's okay. Does what it says.", "neutral", None),
]
def run_eval(eval_cases):
correct = 0
for case in eval_cases:
result = analyze_review(case.input)
sentiment_correct = result.sentiment == case.expected_sentiment
rec_correct = (case.expected_would_recommend is None or
result.would_recommend == case.expected_would_recommend)
if sentiment_correct and rec_correct:
correct += 1
else:
print(f"FAIL: {case.input[:50]}...")
print(f" Expected: {case.expected_sentiment}, got: {result.sentiment}")
print(f"\nAccuracy: {correct}/{len(eval_cases)} = {correct/len(eval_cases):.0%}")
run_eval(eval_cases)
Add 10 more eval cases to the harness, including edge cases (mixed positive/negative, foreign language reviews, very short reviews like "ok"). Measure accuracy. Then modify the system prompt to improve accuracy on failing cases. Does improving one case type hurt another? This is prompt engineering's version of overfitting.