We Built a 6-Agent Dev Team That Ships Code 24/7
How we replaced traditional development workflows with 6 autonomous AI agents that write code, review PRs, merge, deploy, and QA — running on 30-minute cron cycles with zero human intervention.
We Built a 6-Agent Dev Team That Ships Code 24/7
I woke up this morning to 14 commits across three repositories, two merged pull requests, a deployed API update, and a QA report confirming everything passed. I didn’t write any of it. I didn’t review any of it. I didn’t even know it was happening until I checked my phone.
Six AI agents did all of that while I slept.
This isn’t a concept demo. It’s not a weekend hack. It’s our production development pipeline, and it’s been running autonomously for weeks. Here’s how we built it, what it actually looks like in practice, and the stuff that broke along the way.
The Problem: One Developer, Three Products, Zero Time
I run three active products: a full-stack life management platform (LifeOS), a trading intelligence system (TradeSmartAI), and a local services business app (ProAppliance). Each has its own codebase, its own API, its own frontend. Combined, we’re talking about 50+ backend modules, 692 API endpoints, and enough frontend screens to make your eyes bleed.
I’m also a working appliance installer. I spend my days under kitchen sinks and behind refrigerators. The traditional developer workflow — sit at a desk for 8 hours, write code, review code, deploy code — doesn’t work when your hands are full of copper fittings at 2 PM.
So I built a team that never sleeps.
The Architecture: Who Does What
The system runs on OpenClaw, an open-source agent orchestration platform. Think of it as the operating system for AI agents — it handles scheduling, communication, tool access, memory, and inter-agent messaging. Each agent has a specific role, a specific model, and a specific set of responsibilities.
Here’s the roster:
Goose — The COO
Goose is the executive layer. It coordinates between all the other agents, manages priorities, handles deployment pipelines, and makes judgment calls about what ships and what doesn’t. When a builder finishes a PR, Goose reviews the scope, checks if it aligns with current priorities, and either approves the merge or kicks it back with notes.
Goose runs on Claude Opus and has access to SSH (for deploying to our production server), GitHub (for PR management), and direct messaging channels to every other agent.
Elon — The CTO
Yes, we named our CTO agent after that Elon. The CTO handles architecture decisions, technical debt prioritization, and the PR review pipeline. Every hour, a cron job triggers the CTO to scan all repositories for open pull requests, review the code, post comments on GitHub, and either approve or request changes.
The CTO runs on Claude Opus with a 200K context window — large enough to hold entire codebases in memory while reviewing changes. It maintains a running technical context that carries across sessions, so it remembers last week’s architectural decisions when reviewing this week’s code.
Builder 1 & Builder 2 — The Developers
Two autonomous builder agents pick up tasks from a prioritized backlog and write code. Builder 1 focuses on infrastructure and platform work (OpenClaw itself, DevOps, CI/CD). Builder 2 handles product development (LifeOS features, mobile app screens, API endpoints).
Each builder runs on a 30-minute heartbeat cycle. Every 30 minutes, the builder checks its task queue, picks the highest-priority item, writes the code, creates a feature branch, commits, pushes, and opens a pull request. Then it moves to the next task.
The builders run on GPT-5.3 Codex as the primary model (optimized for code generation) with Claude Opus as a fallback for tasks that need deeper reasoning or larger context.
QA — The Inspector
After a PR is merged and deployed, the QA agent runs verification. It hits API endpoints, checks response codes, validates data shapes, and confirms that the deployment didn’t break anything. If something fails, it posts a detailed report and alerts the CTO.
QA runs on a hourly cycle and can verify endpoints across all three products. It’s caught deployment issues that would have gone unnoticed for hours.
SRE — Site Reliability
The SRE agent monitors infrastructure health — Docker container status, service uptime, SSL certificates, disk space, and API response times. It runs continuous health checks and escalates issues when containers crash or services go unhealthy.
A Day in the Life (Actual Logs)
Here’s what happened in the last 24 hours. This is pulled directly from our session logs and GitHub activity:
11:00 PM (Thursday): Builder 2 picks up B-263 — wiring 12 new API modules into the LifeOS mobile app. The task involves rewriting how the mobile app communicates with the backend, switching from a legacy memory-based system to direct API calls.
11:42 PM: Builder 2 finishes. 17 files changed, +1,009 lines added, -601 removed. Opens PR #10 on the LifeOS repo.
12:00 AM (Friday): CTO’s hourly scan picks up PR #10. Reviews the code, posts 4 non-blocking notes about future improvements, marks it as approved.
12:15 AM: Goose merges the PR, pulls the changes to the production server, rebuilds the Docker container, runs a health check. API returns 200.
12:30 AM: QA verifies the deployment. Hits 8 endpoint categories: tasks (15 returned), habits (5 returned), goals (5 returned), achievements (20 returned), quotes (15 returned), reminders (4 returned), pets (1 returned), bookmarks (8 returned). All pass.
Meanwhile, Builder 1 is working on B-261 — a smart memory compactor that manages how all 17 agents in the organization handle their own memory files. 461 lines of Python, complete with markdown-aware parsing, deduplication, staleness detection, and safety protections.
2:00 AM: Builder 1 opens PR #61 on the OpenClaw workspace repo.
3:00 AM: CTO reviews. Approves with one flag — a hardcoded API token in a spec file that needs removal before merge.
4:00 AM: QA runs verification. Confirms the token issue, posts a detailed review on GitHub. PR blocked until fixed.
All of this happened while I was asleep. The agents identified work, wrote code, reviewed each other’s code, deployed, tested, and caught a security issue — without a single human keystroke.
The Cron Engine: How Autonomy Actually Works
The secret sauce isn’t the AI models. It’s the scheduling. Every agent runs on a cron-based heartbeat — a timer that fires at regular intervals and asks the agent: “Do you have work to do?”
Here’s our actual cron schedule:
- Builders: Every 30 minutes — check task queue, pick up work, write code
- CTO PR Scan: Every hour — scan repos for open PRs, review code, approve or reject
- CTO Deploy: Every 2 hours — merge approved PRs, deploy, run health checks
- QA Verification: Every hour — verify recent deployments, test endpoints
- SRE Health: Every 15 minutes — check container health, service uptime
- Event Dispatcher: Every 15 minutes — route events between agents
This means the entire development pipeline cycles continuously. A task can go from backlog to deployed in under 3 hours with zero human involvement. In practice, most tasks complete in 1-2 cycles.
What Breaks (And How We Handle It)
Let’s be honest — autonomous agents aren’t magic. Here’s what goes wrong:
Context drift. After enough conversation turns, agents start losing track of earlier decisions. We handle this with structured memory files (MEMORY.md) that persist between sessions and a smart compactor that keeps them from growing unbounded.
Scope creep. Give a builder agent a simple task and sometimes it’ll “helpfully” refactor three other files while it’s in there. We enforce scope through detailed task specs and CTO review that checks diff size against expected scope.
Merge conflicts. Two builders working in the same repo can step on each other. Our pipeline enforces a one-PR-per-cycle merge policy — only one PR gets merged per deployment cycle, and the other builder’s PR gets rebased.
Model differences. GPT-5.3 Codex writes code differently than Claude Opus. Switching between them mid-task can introduce inconsistencies. We pin each builder to a primary model and only fall back when the primary is unavailable or rate-limited.
False confidence. AI agents don’t say “I’m not sure about this.” They’ll write code that looks correct, passes basic tests, but has subtle logical errors. That’s why QA verification is non-negotiable — automated endpoint testing catches what code review misses.
The Numbers
Since we started running this pipeline:
- Commits per day: 10-20 across all repos
- PRs opened per day: 3-5
- Average time from task to deployed: 2-4 hours
- Human intervention rate: ~20% of PRs need manual attention
- Build cost: Under $5/day in API costs (most agents run on free-tier models)
That last number is the one that matters. We’re running a development team that ships production code 24/7 for less than the cost of a fancy coffee. The builders use GPT-5.3 Codex (primary) and Claude Opus (fallback). The review agents use Claude Opus. The monitoring agents run on Gemini 3 Pro (free tier).
What This Means
I’m not going to tell you that AI agents will replace developers. That’s a lazy take and it’s wrong. What they replace is the downtime. The hours between when you stop coding and when you start again. The weekends. The nights. The time spent on boilerplate, routine refactoring, and deployment chores.
My agents handle the 80% of development work that’s well-defined and repeatable. I handle the 20% that requires taste, judgment, and understanding the actual business problem. That split lets one person with a day job run three production products simultaneously.
The code is open source. The orchestration platform is OpenClaw. If you want to build something similar, start with one agent doing one thing on a cron cycle. Get that working. Then add another. The complexity is in the coordination, not the individual agents.
Build in public. Ship in your sleep. That’s the whole idea.
This is a build log from AI Tool Dojo. We build AI-powered products and document what actually works. No sponsored takes, no press releases — just real projects from the trenches. Get the weekly briefing for tools worth your time and the ones that aren’t.