Prompt Engineering That Actually Works - Part 1, The Fundamentals

Iris runs its specialist agents on open-source models - 30B to 235B to 1T parameters, hosted through Chutes AI. These aren't frontier models. They don't have the raw reasoning capacity of Claude Opus or GPT-5.3. But with the right prompting, they produce output that's genuinely hard to distinguish from something much larger. Which is good, because I'm not made of money and Opus tokens aren't cheap.

Everything in this post comes from building two multi-agent systems - Jarvis and Iris - with Claude, Codex, and Jarvis itself as my co-pilots. Jarvis is my local-first AI assistant (Python/FastAPI, 35+ tools, intent routing, RAG, the works) and it's where a lot of these techniques were first stress-tested. When you're running a 4B parameter Qwen model on your laptop and trying to make it route intents, use tools, and not embarrass itself, you learn prompt engineering fast. Or you learn to drink. Preferably both.

Claude helped me understand why certain prompts worked and others didn't - turns out the AI that processes your prompts has useful opinions about what makes a good one. Codex helped me batch-test variations across commits. Jarvis was the guinea pig - the place where every technique got thrown at tiny local models first, and if it survived a 4B model on an M-series Mac, it'd work anywhere. Between the four of us - plus the dozens of specialist agents running through all of them, from Jarvis's local intent classifiers to Iris's cloud-hosted researchers and engineers - we've iterated on hundreds of prompt variations and these are the patterns that consistently made the difference.

Most prompt engineering advice online is generic fluff. "Be specific." "Provide context." Wow, groundbreaking. That's like telling someone who can't cook to "use ingredients." And then there's the other end of the spectrum - people who write "build me a billion-dollar SaaS product" and wonder why the output is rubbish. Mate, the model is good, it's not a venture capitalist. Here's what actually moves the needle.

1. Tell the model what NOT to do - it works embarrassingly well

I spent way too long politely asking models to be concise. "Please keep it short." "Be brief." The model would nod (metaphorically) and then write me four paragraphs starting with "Absolutely! Great question!"

Turns out, negative constraints are way more powerful than positive ones. When you say "write a concise response," the model interprets "concise" however it feels like. When you say "Do NOT exceed three sentences. Do NOT use filler phrases like 'Sure thing' or 'Absolutely'," you've drawn a line in the sand.

Real example from Iris's system prompt:

Bad: "Keep replies short and to the point." Good: "No filler intros (for example: 'Sure thing', 'Absolutely', 'Understood'). Do not narrate what you will do; do it and report the concrete result."

The bad version is a suggestion. The good version is a checklist. Every model, from a 7B to Opus, processes the good version more reliably because there's nothing left to interpret. You've turned vibes into a pass/fail test.

I asked Claude why this works so well (yes, I asked the AI about itself - we live in strange times). Language models generate tokens one at a time, and at each step they're weighing "does this next word fit the instructions?" A vague instruction like "be concise" creates a soft bias. A specific prohibition like "do NOT use filler intros" creates a hard gate. The token "Sure" at the start of a response gets actively suppressed instead of just mildly discouraged. Once I understood that, I went through every vague rule in both Jarvis and Iris and rewrote them as specific prohibitions. Immediate improvement across all agents - even Jarvis's little 4B Qwen model stopped opening with "Of course!" I felt stupid for not doing it sooner.

2. Put the most important instructions last

Here's a fun one. You write a beautiful 3,000-token system prompt. You put your most important rule on line 4. The model completely ignores it. You add the same rule at the very end. Suddenly it works perfectly. Welcome to the "lost in the middle" problem.

Models have a well-documented attention pattern where the very beginning and very end of a prompt get the most attention. Everything in the middle? Might as well be terms and conditions - nobody's reading that.

Iris exploits this deliberately. The system prompt builds up in layers - base personality, then memories, then skills, then agent roster. But the voice mode instructions (the most critical constraints when active) go dead last:

"VOICE MODE IS ACTIVE - OVERRIDE ALL PREVIOUS FORMATTING RULES"

This isn't just me being dramatic with caps lock. It's positioned at the end specifically because that's where the model's attention is strongest. The voice constraints (no markdown, no lists, natural numbers) need to override the default formatting rules that appear earlier. By placing them last, they win the attention competition.

If you have a system prompt longer than about 2,000 tokens: essential identity up top, context and reference material in the middle, critical behavioural rules at the bottom. Think of it as a sandwich where the bread matters and nobody cares about the lettuce.

3. Inline instructions beat system prompt rules - this one changed everything

This is the single most impactful technique I've found, and it's the one most people miss. I almost want to charge for this section.

When you put a rule in the system prompt, the model has to remember it across the entire conversation and apply it in the right context. When you put an instruction inline - right next to the data it governs - the model can't miss it because it's literally right there.

Real example from the sub-agent pipeline. The system prompt says: "Present the sub-agent's output directly. Do not rewrite or pad it." Nice rule. Compliance rate? About 60%. The model would cheerfully ignore it and re-summarise everything.

The tool result says, right next to the specialist's output: "This output was already streamed directly to the user. Do NOT repeat, rephrase, or re-summarise it. Simply confirm the task was completed."

Same intent. Nearly 100% compliance. Even on smaller models. I wanted to scream when I realised how simple the fix was. Weeks of tweaking system prompt wording when I could have just... put the instruction next to the data.

Why? The inline instruction is in the model's immediate attention window when it's deciding what to do with that specific piece of data. The system prompt rule requires the model to recall and apply something from 3,000 tokens ago while processing new input. That's like asking someone to remember a Post-it note from last Tuesday while they're in the middle of a conversation. Good luck.

Use this everywhere. Don't just give the model data - give the model data plus instructions about what to do with it, right there together. Glue them at the hip.

4. Imperative framing, not descriptive framing

Stop being polite to your AI. I'm serious.

Descriptive: "The response should be formatted as a bulleted list with no more than five items." Imperative: "Format your response as a bulleted list. Maximum five items. No exceptions."

Descriptive: "It would be helpful if you could check for existing reminders before creating a new one." Imperative: "Before creating a reminder, check existing ones to avoid duplicates."

See the difference? The first version reads like a suggestion from a passive-aggressive manager. The model treats it as guidance that can be weighted against other factors ("well, they said it would be helpful, but I have other ideas..."). Imperative framing reads like a command. The model treats it as a constraint that must be satisfied.

This matters most for smaller models. Jarvis's intent classifier runs on a 4B model locally - it cannot afford to misinterpret a suggestion as optional. A 70B+ model might follow both versions equally well. A 7B-30B model will reliably follow the imperative version and roll the dice on the descriptive one. If you're trying to get a smaller model to perform like a bigger one, switch every "should" to a direct command. Every "it would be helpful if" becomes "do this." You're not being rude. It's a language model. It doesn't have feelings. (Probably.)

5. Give examples - models learn from patterns, not your essays

This is called few-shot prompting, and it's the closest thing to a cheat code in prompt engineering. I genuinely don't understand why more people don't do this.

Instead of writing a paragraph explaining the format you want (which the model will half-read and then freestyle), just show it:

Bad: "When reporting weather, include the temperature, conditions, and any relevant details in a natural sentence."

Good: "When reporting weather, follow this format: Example: 'Currently 12 degrees and cloudy in London, with light rain expected after 3pm.' Example: 'Sunny and 28 degrees in Sydney. UV index is high - worth grabbing sunscreen.' Now respond to the user's weather query in the same style."

The model extracts the pattern from the examples - conversational tone, specific numbers, practical advice - far more reliably than it infers those qualities from a description. Language models are fundamentally pattern completion engines. Showing a pattern and asking them to continue it is literally what they were built for. Describing a pattern and asking them to generate it from scratch adds an extra abstraction step where half the details get lost.

Two to three examples is the sweet spot. One example might be treated as a coincidence. Four or more starts eating into your context window for diminishing returns. It's like seasoning - enough to taste, not so much you ruin it.

That's the first five. These are the fundamentals - the techniques that give you the biggest bang for the least effort. In Part 2, we get into the system-level stuff: how to structure complex tasks, set roles that actually work, use temperature strategically, and why testing your prompts is the thing that separates prompt engineering from prompt gambling.