Skill Graphs - Reflections on Building a Knowledge Engine With AI

The skill graph system is live. Time to step back and look at the bigger picture.

How we built this

Let me be upfront: I did not write this. Not a single line.

The idea was mine. The architecture decisions were mine. The steering, the "no, do it this way," the "this is ugly, fix it" - that was all me. But the actual implementation? That was Claude and Codex.

Claude (Anthropic) handled the bulk of the implementation: data model, services, the retrieval pipeline, API controllers, form requests, validation, and most of the frontend work - the graph management UI, dialogs, error handling, the visual graph editor. Claude one-shotted the graph management UI from a written plan. One prompt, complete implementation, all tests passing. That still catches me off guard.

Codex (OpenAI) helped with the initial planning, scoping, and early implementation groundwork.

My role was product owner, architect, and quality control. I decided what to build and why. I reviewed every piece of output, caught bugs (like the duplicate slug issue), and pushed for improvements. I wrote the actual knowledge content for the seed graphs because that has to come from a human brain.

The whole feature - from arscontexta's post to a fully working system with two populated graphs, interactive visualisation, full CRUD, and intelligent context injection - took about 10 days. For one person steering two AI agents. That is genuinely wild.

I am not going to pretend I built this the traditional way. But I will say this: knowing what to build, how it should work, and when the output is wrong is still the hard part. The AIs are fast, but they need direction.

What worked

Wikilinks are magic. Writing [[container-networking]] in a node and having the connection appear automatically is the single best design decision. It makes graph building feel like writing, not like drawing diagrams in a tool. The graph grows organically as you add content, like a garden that plants its own seeds.
Hybrid scoring pulls its weight. Combining keyword matching with meaning matching catches both obvious queries ("Docker") and lateral ones ("how do I isolate my services?"). Neither approach alone would be sufficient.
The budget system works. Capping at 14 nodes and 2,500 tokens keeps context injection lean. Iris gets enough knowledge to be helpful without being overwhelmed. Like packing a carry-on: constraints force you to bring only what matters.
Shadow mode is invaluable for testing. Being able to log what would be injected without actually changing Iris's behaviour means we can experiment safely. Like a flight simulator - practice without consequence.
The visual editor makes the abstract tangible. Seeing your knowledge as an interactive constellation map turns an abstract concept ("connected knowledge") into something you can point at and say "this is what Iris knows about DevOps."

What needs work

Smarter embeddings - The current embedding model is basic. Better models would improve the meaning-matching step, especially for nuanced queries.
Score tuning - The 55/35/10 split between keyword, meaning, and type scoring is an educated guess. Need to analyse the retrieval logs and see where it falls short. Like tuning a radio - close to the right frequency, but not perfectly dialled in yet.
Trigger improvements - Nodes have trigger keywords, but matching is exact. Adding fuzzy matching (so "k8s" matches "Kubernetes") would improve retrieval.
Analytics dashboard - We log everything about every retrieval, but there is no UI to visualise it yet. Which nodes get used most? Which queries find nothing? Where are the gaps in the graph? The data exists, the dashboard does not.
Multi-graph queries - Currently, only the default graph gets queried. Ideally, Iris could pull from multiple graphs at once - your DevOps runbooks and your general knowledge graph, merged intelligently.

The bigger picture

Before skill graphs, Iris was like a brilliant friend who knows everything about everything but has no idea what you specifically care about or how you prefer to work.

After skill graphs, Iris is like a brilliant friend who has spent years working alongside you. She knows your preferred tools, your workflows, your domain expertise, and your personal preferences. When you ask a question, she does not just give a generic answer - she gives your answer, grounded in your knowledge.

The graph structure is what makes this possible. It is not just about storing knowledge (a database can do that). It is about storing the connections between pieces of knowledge. Those connections are what let Iris reason coherently about complex topics instead of treating each fact as an isolated island.

This is the foundation for the broader vision: an AI that genuinely knows you, learns your patterns, and improves as you add more to your knowledge graphs. Not a generic assistant, but your assistant.

What inspired this

Full credit to @arscontexta for the spark. Their post about knowledge graphs for AI agents was the catalyst. The idea was not entirely new - Zettelkasten, Maps of Content, and wiki-style linking have been around for decades - but seeing it applied specifically to AI agent context was the "aha" moment.

The realisation was simple: if structured, connected knowledge makes humans think better (that is the whole premise of tools like Obsidian, Roam, and Notion), then it should make AI agents think better too. Same principle, different brain.