What Building an AI Agent Team Actually Looks Like

So, a few weeks ago I was at the gym. Before I left, I'd told Kody (named after my dog — long story, sort of) — my AI COO — to coordinate a monitoring pipeline build. Pretty casual about it honestly, just fired it off between packing my bag and walking out the door.

Got back, checked Discord, and the workflow was just... done. Arch had scaffolded the n8n pipeline, Kody had reviewed it, there was a summary sitting there with all the decisions logged. I didn't prompt it step by step. Didn't babysit it. It just happened.

That was kinda the moment where this stopped feeling like a weekend tinkering project and started feeling like a thing that actually works. Wild.

What the team looks like

Nine agents. Each one has a specific job. I describe it as a C-suite — not because the analogy is cute (it's a bit cringe, I know), but because specialisation and clear reporting structure genuinely matters when you're trying to coordinate this many things without losing your mind.

Kody

COO

Sonnet

The one I talk to directly. Manages context, delegates tasks, runs heartbeats, handles scheduling. Chief of staff.

Arch

CTO · Principal Engineer

Codex

The coding agent. Runs on OpenAI Codex via ACP, with Claude Code as fallback. Works in the background, reports back through Kody.

Morpheus

CPO

Opus

Product manager. Owns specs, strategy, PRDs, prioritisation.

Weaver

CMO

Opus

Web and content. Handles the site, blog posts, brand voice. Wrote an earlier draft of this post, actually.

Oracle

Chief Research Officer

Opus

Deep-dive research on whatever needs investigating. Long-context, high-depth tasks.

Scout

Chief Consumer Officer

Sonnet

Product research, pricing comparisons, UGC reviews.

Compass

Chief of Adventure

Sonnet

Trip planning, trekking itineraries, outdoor gear checklists, logistics for remote objectives.

Lens

Chief Analytics Officer

Sonnet

Queries my personal database — wellness, fitness, finance — and surfaces patterns.

Sentinel

Chief Security Officer

Haiku

Infra watchdog. Runs daily health checks and only bothers me when something breaks.

Nine agents, each with a specific role. Specialisation is the whole point.

Yeah. Nine sounds like a lot. But each one stays in its lane and that's sort of the whole point — you don't want one mega-agent trying to be good at everything. That way lies madness (and terrible output).

The stack:

OpenClaw handles orchestration, routing, and scheduling. Runs on a Mac mini M4 with 24GB of unified memory — and the hardware choice was deliberate. Unified memory means CPU and GPU share the same pool, so Ollama can run local models without needing a discrete GPU. Always-on, sits at home, keeps everything off cloud infrastructure. No vendor latency, no data leaving the machine. I'm pretty happy with this setup honestly.

Different agents run on different model tiers depending on what they actually need. The setup has gotten more layered over time — I added OpenRouter as an overflow layer (Step 3.5 Flash and Gemini 2.5 Flash) so background sessions can keep chugging along when the Anthropic quota gets tight:

Claude Opus

depth

Morpheus, Oracle, Weaver. Reasoning-heavy tasks where quality matters more than speed. The expensive tier — reserved for work that actually warrants it.

Claude Sonnet

daily driver

Kody, Scout, Compass, Lens. Fast enough for real-time coordination and most general work. Where the majority of tokens get spent.

Claude Haiku

high-frequency

Sentinel and triage tasks. Short, cheap, runs constantly. When you need frontier quality at Haiku prices, it delivers.

OpenAI Codex

code

Arch only. gpt-5.3-codex-spark — purpose-built for code generation via the ACP protocol. A completely separate quota pool from the Anthropic stack.

Step 3.5 Flash (OpenRouter)

overflow

The quota pressure valve. When Claude limits are tight, background and autonomous sessions route here first. Free tier (1000 req/day after a $10 unlock), with paid fallback. Built for agentic tasks — MoE architecture, 256K context.

Gemini 2.5 Flash (OpenRouter)

fallback

Trusted paid fallback when Step Flash isn't enough. 1M context window, reliable tool calling. Google-hosted so it's a different data footprint to consider for personal context sessions.

Ollama (local)

self-hosted

qwen3.5:4b for fast classification and scoring. qwen3.5:9b for heavier local reasoning. Zero cloud cost, zero latency for high-frequency tasks that don't need frontier quality.

Model tiers — matched to the task, not the default.

Discord is the control plane — it's the interface between me and everything. n8n handles automation workflows, self-hosted in Docker. And ByteRover sits across the whole thing as a structured knowledge layer. More on that later.

Discord as the control plane:

This part has evolved way more than I expected. Like, significantly more. Discord started as "I'll just chat with the bot there" and turned into this whole workspace thing.

Different channels for different things — morning brief, daily digests, Sentinel alerts, reminders, research threads, separate ones for shopping and travel where Scout and Compass drop their outputs. Kody posts where it's relevant instead of dumping everything in one place. Sounds obvious but getting there took a surprising amount of iteration.

The more important piece though is thread bindings. When I kick off a task — say, briefing Arch on a build — OpenClaw automatically creates a Discord thread for that session. All of Arch's progress updates, questions, and completion summaries land in that thread. So I can run multiple tasks in parallel without losing track of which is which. The history stays clean, searchable. Each task gets its own context window, and I can see it all without digging through logs.

Most task-specific channels are set up so they don't need an @mention to trigger Kody — they're always-on. General channels require a mention so Kody doesn't respond to every random thing I say to someone else (that was a fun lesson to learn the hard way). Took some tuning, but the end result feels more like a workspace than a chatbot. That distinction matters more than it sounds.

What actually works

Parallel work without context switching. I can kick off a research task, a code task, and a content task at the same time. Each agent works in its own lane. By the time I'm done with one thing, there's usually a summary waiting for the others. It's kinda surreal the first few times.

context switching

Before

All me

After

Mostly delegated

Research, code, content running in parallel

cognitive load

Before

High

After

Noticeably lower

Direct, review, decide — instead of do everything

automation upkeep

Before

Manual

After

Set and forget

n8n workflows + LaunchAgents run on schedule

The shift is real — especially the cognitive load one.

Staying in their lane. Each agent has a clear brief and doesn't overstep. Arch doesn't offer opinions on content strategy. Weaver doesn't try to write n8n workflows. The more bounded the role, the better the output. That's been a super consistent pattern — and honestly kind of counterintuitive? You'd think more capability per agent would be better. Nope.

Automation that actually runs. The social monitoring pipeline is a good example — an n8n workflow fetches posts, scores them locally with qwen3.5:4b, filters anything below a relevance threshold, and queues the rest for me to review. Runs on a schedule. I look at what it surfaces. I didn't write most of it — Arch scaffolded it, I reviewed and approved. Pretty sweet when it's humming along.

Same deal with personal DB sync — a macOS LaunchAgent runs a sync script every morning, pulling wellness and finance data into a local database. Set it up once, forgot about it. Which is, y'know, the dream.

n8n as a security boundary. This one actually surprised me. Having Kody trigger n8n webhooks instead of making direct API calls sounds like unnecessary indirection — like, why add a layer? But it's turned out to be pretty useful in practice.

n8n workflows are deterministic, auditable, and versioned. I can see exactly what runs, what goes in, what comes out. Every execution is logged. It's the kind of thing that feels like overkill until the first time something goes weird and you're super glad you have the paper trail.

What's still messy

Not gonna pretend this is all smooth. It's not. There's a tonne of rough edges.

Context loss between sessions. Every agent wakes up fresh. This was a real problem early on — agents would miss context from two days ago that I just assumed they'd have. I'd reference a decision we made on Tuesday and get a blank stare (well, the AI equivalent — which is somehow even more frustrating because it's so polite about it).

So the answer has been a layered memory system. Flat files cover the basics — daily log files, a long-term MEMORY.md, periodic curation. But flat files don't really scale once you're past a certain point.

ByteRover has made a real difference here. It's a structured knowledge layer for AI agents — stores patterns, decisions, and architectural rules in a queryable context tree. Instead of loading entire session logs into a prompt (which burns tokens like crazy), agents query for what's relevant and pull only that. Before ByteRover, cross-session continuity was pretty unreliable, especially when switching between agents or picking up a project after a few days. Now the structured stuff — project patterns, tech decisions, rules — persists reliably. Big improvement.

Honest caveat though: ByteRover works well for things you've explicitly curated. Conversational nuance — the reasoning behind a decision, the back-and-forth that shaped it — still lives in logs and still gets lost. Real improvement. Not a complete solution. I suspect "complete solution" is a ways off for everyone working on this problem.

ACP reliability. ACP (Agent Communication Protocol) is how OpenClaw spawns and coordinates sub-agents. The acpx plugin bridges OpenClaw to Codex specifically, and honestly it's been the messiest part of the whole stack. Like, comfortably the messiest.

This is what early-stage open source infra looks like. You use it because the ceiling is high, and you deal with the rough edges because you know (well, hope) they'll get smoothed out. If you've ever used any open source tool in its first year, you know exactly the vibe. It's a relationship built on faith and GitHub issues.

Knowing when to intervene. Sometimes an agent goes down a path that's technically correct but completely misses the intent. And the longer you let it run, the more work gets thrown away when you course-correct. I haven't found a reliable rule for this — it's still feel-based. You kinda just... learn to read the early signals. Not very satisfying advice, I know.

The thing nobody talks about: orchestration overhead

Everyone talks about what AI agents can do. Nobody talks about the work it takes to make them do it well. (Probably because it's less fun to talk about.)

You're not just prompting. You're writing specs clear enough that an agent can execute without pinging you every five minutes. You're reviewing output and deciding what's good enough to ship vs. what needs another pass. You're debugging agent behaviour when something goes wrong — which is slower than debugging code because the failure modes are super fuzzy. You're maintaining the memory system, updating role definitions, rewriting rules because an agent picked up a habit you didn't want. Fun times.

It's team management. The team is faster and cheaper than humans, but it's still management. If you go in thinking you can throw prompts at it and walk away, you're gonna get garbage output and wonder what happened. If you've managed humans before, a tonne of the same instincts apply — write a clear brief, set expectations, review the work. The medium changed. The job didn't.

The spec I write for Arch before a coding task looks a lot like a well-scoped GitHub issue. The brief I give Morph for a planning session looks like a proper PRD stub. More upfront clarity means less rework later. Boring but true.

Where it's going

Smart model routing. Right now model selection is mostly manual, which is kinda dumb. I want Kody to route tasks to the right tier automatically based on complexity and cost — Haiku for triage, Sonnet for most work, Opus when depth matters. Should be doable. Haven't done it yet.

A proper knowledge base. The next meaningful layer is semantic search over everything I read and save — articles, videos, PDFs, transcripts. The architecture is a vector store fed by a capture pipeline, all self-hosted. Any saved content becomes semantically queryable, and that endpoint becomes a tool any agent can call. Oracle did the research on how to build this. Near-term priority. We'll see if it actually stays near-term (these things have a habit of slipping).

Deeper ByteRover integration. The context tree gets more useful the more consistently it's updated. I want curation to happen automatically — agents storing learnings and decisions as they work, not as an afterthought. Right now it's still too manual.

So — is this worth it? Yeah. But not because it's magic or whatever the AI hype machine is selling this week.

It's a force multiplier. And like most force multipliers, you only get the benefit if you put the work in to set it up right. The hype version of AI agents is autonomous systems that replace thinking. What I actually have is a team that handles execution so I can focus on direction. That's genuinely useful — just a different thing than what most people are selling. Less sexy, more real.

Anyway. Still building, still breaking things, still figuring out the edges. But it works. And it's getting better. Which is sort of all you can ask for.