Prompt Engineering Is Software Engineering: Patterns That Work

Why "Just Ask It" Is Not a Strategy

When LLMs first became accessible to developers, many assumed the interface was simple: write a question, get an answer. This worked for demos but fails in production. Building reliable LLM-powered applications requires systematic prompt engineering — the art and science of writing instructions that consistently elicit the behavior you want.

Prompt engineering is software engineering. It has idioms, patterns, anti-patterns, and testing methodologies. The gap between an amateur's prompt and a professional's prompt can be the difference between a system that works 60% of the time and one that works 95% of the time.

The System Prompt Is Your API Contract

Every production LLM application has a system prompt — the persistent instruction set that frames every user interaction. Think of it as the constructor for your LLM object. A good system prompt:

Defines the model's role and persona precisely: "You are a customer service agent for Acme Corp. You help customers with billing, shipping, and returns. You do not provide technical support."
Specifies the output format: "Always respond in JSON with keys: {answer: string, confidence: 'high'|'medium'|'low', sources: string[]}"
Establishes constraints: "Do not speculate about matters not in the provided context. If you don't know, say so."
Provides relevant context: background information, terminology, tone guidelines

A weak system prompt produces inconsistent behavior. A strong one creates a reliable interface.

Chain-of-Thought: Show Your Work

Chain-of-thought (CoT) prompting instructs the model to reason step by step before producing a final answer. The two canonical approaches:

Zero-shot CoT: "Think step by step." — simply appending this instruction to a prompt dramatically improves performance on reasoning tasks.
Few-shot CoT: Provide examples of step-by-step reasoning before the actual question. The model learns the desired reasoning style from examples.

CoT works because it forces the model to allocate computation to intermediate reasoning steps rather than jumping directly to an output. The scratchpad improves accuracy on arithmetic, logic, and multi-step inference tasks. The tradeoff: CoT uses more tokens (and therefore costs more).

Structured Output: JSON Mode and Schema Validation

If your application needs to parse the model's output programmatically, you need structured output. The naive approach (asking for JSON and hoping) fails too often. The robust approaches:

JSON mode: Most major APIs (OpenAI, Anthropic, Google) support forcing JSON output. Enable it.
Schema in the prompt: Include an explicit JSON schema in your prompt. "Respond with JSON matching this schema: {name: string, score: number, reasons: string[]}"
Pydantic/structured extraction: Libraries like instructor and outlines constrain model output to match a Pydantic schema, using constrained decoding or retry loops.
Validation + retry: Validate parsed output against your schema; if validation fails, retry with the error message in the prompt.

Retrieval Augmentation: Giving the Model Facts

LLMs have training data cutoffs and no access to your proprietary information. RAG (Retrieval-Augmented Generation) solves this by retrieving relevant documents and including them in the prompt context. The pattern:

system: You are a helpful assistant. Use the provided documents to answer questions.
user: [retrieved documents]
---
Question: {user_question}

Key prompt engineering considerations for RAG: tell the model explicitly to only use the provided context; instruct it to say "I don't know" when the context doesn't answer the question; include citation instructions if attribution is important.

Few-Shot Examples: The Most Underused Technique

Providing 3-5 examples of the input-output mapping you want is often the single most effective prompt improvement. Few-shot examples teach the model your format, style, and edge case handling in a way that instructions alone often can't. Guidelines: use diverse examples that cover different cases; use your actual data (not synthetic examples); include at least one "hard" example where the correct behavior is non-obvious.

Anti-Patterns to Avoid

Vague instructions: "Be professional" is meaningless. "Use formal language, avoid contractions, respond in 2-3 sentences" is actionable.
Overloaded prompts: Asking the model to do 12 things simultaneously reduces performance on each. Break complex tasks into steps.
No output format specification: If you don't specify format, you'll get unpredictable format.
Testing on two examples: A prompt that works on your two test cases may fail on production inputs. Evaluate on at least 50 diverse examples before shipping.
Treating prompts as write-once: Production prompts need version control, testing, and iteration. Treat them like code.

Testing Your Prompts

The professional approach to prompt engineering treats prompts as code that requires testing:

Build an eval suite of 50-200 input examples with expected outputs
Score each output (exact match, LLM-as-judge, or human review depending on the task)
Run the eval suite before and after every prompt change
Track scores over time; treat prompt regressions as bugs