Prompt Engineering That Actually Works - Part 2, Systems and Testing

This is Part 2. If you haven't read Part 1, go do that first - it covers the five foundational techniques (negative constraints, instruction placement, inline instructions, imperative framing, and few-shot examples). This part gets into the system-level patterns: structuring complex reasoning, defining roles, managing temperature, output templates, and the one thing nobody does but everyone should.

6. Break complex tasks into explicit steps - models can't plan

I'm going to say something controversial: most models are terrible at planning. They can execute brilliantly when told what to do, but "figure out the right approach" is where things go sideways. This is chain-of-thought prompting, and it matters enormously for smaller models.

Jarvis's intent classifier was originally just "analyse the user's message and pick the right handler." The 4B model would classify "what's the weather?" as a CHAT intent and just... answer with a guess. Helpful. Once I broke it into explicit steps - extract the topic, check if it matches a tool category, assign confidence - it started nailing intent classification even on the tiny model. Same technique, applied to Iris's delegation logic:

Bad: "Analyse the user's request and delegate to the appropriate specialist."

Good: "Follow these steps:

Identify what the user is asking for (research, planning, code, writing, or general knowledge).
Check if this genuinely requires a specialist or if you can answer directly.
If delegation is needed, pick the single most relevant agent from the roster.
Write a clear, self-contained prompt for the specialist that includes all necessary context.
If the user didn't name a specific agent, do NOT add extra agents."

The bad version says "figure it out." The good version says "here's exactly how to think about this." Bigger models can often derive good processes on their own. Smaller models skip steps, make assumptions, or - my personal favourite - invent unnecessary complexity. I've had a 30B model decide to delegate a simple greeting to three specialists simultaneously. Mate, the user said "hello."

Key insight: the steps should follow the order the model needs to think in. Step 2 (checking if delegation is even needed) comes before step 3 (picking an agent) because you want the model to consider "do I even need to delegate?" before it commits to a delegation path. If you put step 3 first, the model is already shopping for agents and is less likely to conclude "actually, I don't need one." It's like putting the biscuit aisle before the checkout - you're going to buy biscuits.

7. Set the role with specificity, not vagueness

"You are a helpful assistant" is the prompt engineering equivalent of a CV that says "hard worker and team player." Tells me absolutely nothing. I've seen that line in about ten thousand system prompts and it does precisely zero work every single time.

Bad: "You are Iris, a helpful AI assistant." Good: "You are Iris, a personal AI life management assistant. You are helpful and deeply personalised. You remember context from previous conversations and aim to be genuinely useful in managing your user's life."

Better: "Key traits:

Be concise and direct but thorough
Prefer action over explanation
Remember and reference previous context when relevant
When you don't know something, say so honestly
Never use emdash
Never moralise or lecture"

The "key traits" block is powerful because it's a list of specific, testable behaviours. "Be concise" alone is vague, but combined with "prefer action over explanation" and "never moralise," it paints a specific personality. The model isn't guessing what kind of assistant to be - it has a behavioural checklist.

The "never moralise or lecture" and "never use emdash" rules are negative constraints (technique #1 from Part 1) applied to personality. Every model has default behaviours it falls back to when it's not sure what to do. Claude's defaults include hedging, offering balanced perspectives, and an almost romantic relationship with em dashes. Explicitly prohibiting these defaults forces the model out of its comfort zone into the personality you actually want. It's like training a dog not to jump on guests - the dog knows how to not jump, it just needs to be told.

8. Temperature is a tool, not a setting

Most people set temperature once and forget about it. That's like using one knife for everything - technically works, but you're making life harder than it needs to be.

Temperature 0 (or very low): deterministic output. Use this for tasks where correctness matters more than creativity - tool use decisions, structured data extraction, classification, routing logic. Jarvis's intent classifier and Iris's parent agent both run near-zero for routing decisions. You want the same delegation every time given the same input. Creativity in your routing layer is not the kind of surprise you want.

Temperature 0.7-0.9: creative but coherent. Use this for writing, brainstorming, and conversational responses. The specialist agents that write content benefit from some randomness because it produces more natural, varied language. Without it, every response has the same cadence and you start to feel like you're talking to a corporate FAQ.

Temperature 1.0+: chaos mode. Useful for creative fiction, poetry, or brainstorming where you want genuinely unexpected connections. Not useful for anything that needs to be reliable. I set this once on Jarvis's routing layer by accident and it started delegating cooking questions to the code engineer. Fun times.

The meta-point: if you're running a multi-agent system, different agents should have different temperatures. Your routing agent should be near-zero. Your creative writer should be 0.8. One size does not fit all.

9. Structure your output format and the model will fill it

If you want structured output, give the model a template. Do not describe the structure and hope for the best. I learned this the hard way in Jarvis after parsing approximately one million broken JSON responses from local models that thought "JSON" was more of a suggestion than a format.

Bad: "Return the status, agent name, and output." Good: "Return in this exact format: SUBAGENT_RESPONSE agent_id={id} agent_name="{name}" status={status} {output}"

The model will fill in the template slots like a form. This works because template completion is a much simpler task than "understand a description of a format and generate it from scratch." The template acts as a scaffold that constrains the output into the right shape. Smaller models especially benefit - they might botch a freeform structured output but reliably fill in a template. It's like the difference between "draw me a house" and "colour in this colouring book." Same output, very different success rate.

This technique scales to complex outputs too. The parallel sub-agent results use a line-based format with consistent prefixes (PARALLEL_RESULTS, VALIDATION_ERRORS, TAKEOVER_REQUIRED). Each prefix is a signal the consuming code can parse reliably. The model produces this format consistently because the examples in the tool definition show exactly what it should look like.

10. The secret nobody talks about: test your prompts like you test code

This is the one that separates people who do prompt engineering from people who do prompt guessing.

Every prompt in Iris has automated tests. Not just "does the tool return something" but "does the response contain the expected markers" and "does it NOT contain things it shouldn't."

When we added the takeover directive, we wrote a test that verifies the prompt appears in the output when all agents fail. We also wrote a test that verifies the directive does NOT appear when agents succeed. That negative test is just as important - without it, you might have a takeover directive that fires every time, successful or not. Ask me how I know.

Prompt engineering without testing is vibes-based development. You change a word, it seems to work in the three examples you tried, you ship it, and then it breaks on the fourth edge case at 2am. Automated tests catch regressions immediately. When you change "you may answer yourself" to "you MUST answer the following prompts yourself directly," you can verify the behaviour change with a test run, not gut feeling and crossed fingers.

The real secret

There's no magic. I know that's anticlimactic. Sorry.

Every technique above is about reducing ambiguity. That's it. The gap between a small model and a large model isn't intelligence - it's the ability to handle ambiguity. A large model can take a vague instruction and figure out what you probably meant. A small model takes that same vague instruction and makes a coin flip.

Negative constraints remove ambiguous interpretations. Inline instructions remove "which rule applies here?" Imperative framing removes "is this a suggestion or a requirement?" Examples remove "what does good output look like?" Step-by-step instructions remove "what order should I think in?"

Strip away enough ambiguity and a 30B model starts producing output that looks a lot like a 200B model. Not because it got smarter, but because you stopped making it guess. The model was always capable - you were just giving it bad directions and blaming it for getting lost. I proved this to myself when Jarvis's 4B local model started outperforming cloud 30B models on specific tasks, purely because the prompts were tighter. The 30B model had more raw capability but worse instructions. It's humbling, honestly. And a little bit funny.