Operational Workflow for AI Coding

Vibe coding ships fast for the first hour. After that, it stops shipping anything you understand.

Here is how I drive Claude Code through a long-running project without skill atrophy. It builds on Personal Command Language for AI and Driving Coding Agents from Chat: not how to talk to an agent or run many, but how to ship through one agent over months without becoming dependent on it.

Why I built this

A few months into a long code rewrite I noticed something uncomfortable. The agent’s PRs were merging cleanly. Tests passed. Linter was happy. The code read well on first scan. But if someone had pulled my keyboard away and asked me to write the same module without the agent, I would have stalled in twenty minutes.

That is the failure mode nobody warned me about. It is not that the code is wrong. It is that you slowly stop owning it.

The pattern was always the same. A task would feel hard or unfamiliar or expensive in tokens, and the reflex was to skip a step. Skip the plan. Skip the review. Stub out the test for now. Suppress the lint to unblock. Each one felt cheap. The bill arrived later, in debugging sessions where I could not quite remember why a certain shape was chosen, because I had not been fully there when it was decided.

I wanted a workflow that did three things:

Lets one coding agent ship one PR per session.
Keeps me in the mental model so I can explain every merged hunk in 30 seconds.
Makes the discipline cheaper to follow than to skip.

Eight slash commands, one meta command that walks the rest, and a per-push gate that refuses unfinished work.

The cycle

/session  (meta, walks the rest)

  /orient        message 1: read STATUS, summarize task, list files. No code.
    |
  /teach <x>     optional. Only if the domain concept is unfamiliar. 3 turns: teach, probe, implement.
    |
  /plan-task     plan mode. Propose API, files, tests, risks. No code.
    |
  (execute)      write code. Function by function. Pause for review.
    |
  ci local       format, analyze, layer boundaries, tests, drift check.
    |
  /review        fresh-context review of the branch diff vs main.
    |
  /commit-push-pr   Conventional commit, push, draft PR.
    |
  /retro         capture learning, set the next session's first prompt.

One PR per session. Five hand-edited lines minimum (the anti-atrophy floor). One new domain concept internalized when a /teach was needed.

Here is the full command surface:

Command	When	What it does
`/session`	Session start	Meta. Walks orient through retro.
`/orient`	Message 1	Read STATUS, summarize task, list files, surface risks. No code.
`/teach <concept>`	Before unfamiliar work	3-turn pattern: teach, probe, implement. The anti-atrophy keystone.
`/plan-task`	After orient	Plan mode. Propose API, files, tests, risks, alternatives. No code.
`/review`	Before PR ready	Fresh-context Writer/Reviewer pass on the branch diff.
`/commit-push-pr`	After tests pass	Conventional commit, push, draft PR with template.
`/retro`	Session end	Capture learning, propose LEARNINGS append, set next prompt.
`/refresh-agents-md`	Monthly	Drift detection between AGENTS.md and the code.

What a slash command actually is

Each slash command is a Markdown file in .claude/commands/. When I type /orient in Claude Code, the body of that file gets injected into the agent’s context as its instructions for that turn. No DSL, no plugin system, no SDK. Plain text the agent reads top to bottom.

Here is the body of orient.md, the one that runs first every session:

You are starting a fresh Claude Code session. Orient yourself before
any action.

Procedure:
  1. Read docs/STATUS.md - current phase, next step.
  2. Skim docs/LEARNINGS.md - refresh accumulated rules.
  3. Locate the named task in STATUS.md or the referenced ADR.
  4. Read the most-relevant ADR.
  5. Read .claude/SESSION.md if present (written by SessionStart).

Output format:
  ORIENT
  ======
  Phase:          <name + in-progress/complete>
  Task:           <one sentence>
  Files to touch: <list, paths from repo root>
  ADR ref:        <number + title>
  Open questions: <list or "none">
  Risks:          <relevant in-flight risks>

Guardrails:
  - Do not write code in this turn.
  - Do not propose a plan. That is /plan-task.
  - Do not assume scope. Ask if ambiguous.
  - Cite file paths and line numbers.
  - Surface discrepancies between docs and operator instructions.

That’s the entire pattern. The other commands look the same: a procedure, an output format, a list of guardrails. The hard part is not implementing the commands. The hard part is figuring out what to write in them.

A real session

Here is the shape of a session I ran today.

I open the agent. The SessionStart hook drops a session file with the current phase, git state, tool versions, and the in-flight risks I should be holding in my head.

First message: /session. The meta command walks the cycle and reminds me of the hard rules.

Then /orient. The agent reads the project STATUS file, the relevant ADR, and the current branch state, and produces a structured block: phase, task, files to touch, ADR reference, open questions, risks. No code yet.

The concept I am about to touch is unfamiliar, so I run /teach. The agent does Turn 1 only: explanation, source, stop. I read it carefully. Decide not to probe further.

/plan-task comes next. Public API, file list, test plan, risks ranked by likelihood, three design alternatives. I edit the plan inline, send back “go”.

The agent writes code, pausing for review per unit. Tests pass locally. The linter catches things the analyzer did not bother me with earlier: cross-file doc-comment references it could not resolve, and a const constructor I forgot. We iterate until clean. This bit took longer than I expected.

/review. A fresh-context subagent reads the diff against main as if it had never seen the code. Today: APPROVE, no blockers, five comments parked as followups for a later phase.

/commit-push-pr writes the Conventional Commits message, runs the pre-push gate (eight sequential checks), opens a draft PR with the templated body. I squash-merge and delete the branch.

/retro captures the non-obvious learnings, what we deliberately did NOT do and why, the verbatim first prompt for tomorrow’s session, and proposes a LEARNINGS entry. I accept.

I walk away.

Fifteen minutes later I open the merged PR on GitHub and cold-read it. Every hunk I can explain out loud in 30 seconds stays. Anything I cannot would be reverted. Today nothing got reverted, but the discipline only works if I actually do the check.

About two and a quarter hours including the cold-read. One PR. One concept internalized. One LEARNINGS entry. The numbers do not always look this clean, but the structure is the same.

The keystone: teach, probe, implement

Use this before any non-trivial change touching a concept you have not internalized. Three turns, no shortcuts.

Turn 1: TEACH

I send /teach <concept>. The agent answers in a fixed format:

TEACH: <concept>
================
What it is:      precise definition for my engineering level
How it works:    mechanism, with notation or equations if relevant
Why it matters:  concrete behavior in my project that depends on it
Source:          file:line, paper citation, or ADR

Then stops. No code. No implementation plan.

A filled-in example, for a concept I had to internalize before touching the model-provider layer:

TEACH: streaming-token-types
============================
What it is:
  The taxonomy of events a chat() Stream can emit. In this codebase
  it is a sealed class with exactly three variants:
    1. TextDelta      — incremental text from the model
    2. ToolCallDelta  — partial tool-call payload; may arrive as
                        many deltas sharing one callId
    3. FinishMessage  — terminal event carrying FinishReason + Usage

How it works:
  A discriminated union over a stream of partials. OpenAI chat
  completions SSE, Anthropic messages.stream, and Vertex
  streamGenerateContent all converge on this shape. FinishMessage
  is the only terminal event; consumers must accumulate
  ToolCallDelta fragments with the same callId into a complete
  JSON args object before it arrives.

Why it matters:
  Three behaviors hinge on this being a token-granular sealed union
  rather than a single "final string" response:
    1. An output-safety gate intercepts forbidden patterns BEFORE
       they hit the user's screen. A "give me the final string" API
       would have already leaked.
    2. A tool-call safety gate runs over assembling JSON before the
       call fires, not after.
    3. Cost telemetry needs the terminal event for usage accounting.

  ChatChunk is sealed. Add a fourth variant and every exhaustive
  switch across the codebase breaks at compile time. That is the
  point.

Source:
  packages/shared/lib/src/model_response.dart:1-67
  docs/adr/0015-output-safety-gate.md
  docs/adr/0003-tier-routing-policy.md

Turn 2: PROBE

After reading carefully, I ask:

What is the one thing here that is non-obvious or commonly mis-implemented?

The agent answers that question specifically. Names the failure mode. Cites where it bites, ideally with a real example.

Turn 3: IMPLEMENT

Now I say “implement” or “plan-task first if the implementation is non-trivial”. The agent writes code, function by function, pausing for review.

Cost: 5 to 15 minutes per concept. Payoff: I actually understand my own product. Over months that compounds into the difference between someone who can ship without the agent and someone who cannot.

The hard rules

Five rules. Every session.

1. The 30-second rule

Only merge code I can explain out loud in 30 seconds. If my eyes glaze on a hunk, I rewind or shrink scope. This is the anti-atrophy floor.

2. Session token cap

Sessions cap at about 120k input tokens. Quality drops measurably past that, even within the model’s 200k window. Compact at 60% context. /clear on topic change. New terminal for a new PR.

3. Model discipline

Three tiers:

Haiku 4.5 for doc work, orientation, schema additions, mechanical refactors. Roughly 60 to 80% cheaper than Sonnet on doc-shaped workloads.
Sonnet 4.6 for code edits. The default.
Opus 4.7 only for planning mode and gnarly debugging via /model opusplan, which auto-switches back to Sonnet for execution.

Routine Opus burns roughly 40% in waste. The pricing arithmetic is published; the actual savings I see match the published numbers within rounding.

4. One concern per PR

Refactors land separately from features. Bug fixes do not bundle. The discipline is more important than it sounds: a fresh-context review on a single-concern PR is meaningful; on a ten-concern PR it is theater.

5. No merge without /review

Author and reviewer are the same person on this project. The only protection against rubber-stamping my own writing is a fresh-context subagent that never saw the code being written. Every PR runs /review before squash-merge.

What the harness blocks

Prose in AGENTS.md can be ignored. A pre-tool hook that exits with status 2 cannot. So the things I really need enforced are wired as hooks, not as rules.

Hook	When	What it does
SessionStart	New session	Writes a session file with phase, git, tools, risks
PreToolUse	Every Edit/Write	Blocks edits to generated files and load-bearing invariant files
PostToolUse	After Edit/Write	Runs `dart format` (or your formatter) on the touched file. Advisory.
PreCompact	Before compaction	Snapshots context to a precompact file so post-compact reads from facts, not a degraded summary
SessionEnd	Session close	Appends an audit JSONL entry

Plus a pre-push git hook with eight sequential gates: workflow YAML lint, format check, static analysis, layer-boundary check, tests, codegen drift, markdown lint, dependency CVE scan. The push fails locally if any gate fails. There is no remote CI for this project (private repo, free GitHub plan, manual workflow_dispatch only).

The pre-push gate runs in roughly 60 seconds. Past 90 seconds it would stop being run. 60 is fine.

What it looks like when something is wrong:

$ git push -u origin feat/streaming-decoder

running pre-push hooks
  workflow-lint  ✔
  format         ✔
  analyze        ✗
  layers         ✔
  test           ✗
  drift          ✔
  markdown       ✔
  osv-scanner    ✔

✗ analyze
  packages/shared/lib/src/decoder.dart:42:15
    The named parameter 'streamId' isn't defined.

✗ test
  decoder_test.dart: yields terminal event on stream close
    Expected: FinishMessage(reason: stop, ...)
    Actual:   null

pre-push hook aborted. fix the failures and re-push.
error: failed to push some refs to 'origin'

The push never reaches the remote. The branch stays local until I either fix the failures or explicitly choose to abandon the work. There is no “I will clean this up tomorrow” path.

What I deliberately do not do

Pattern	Why I skip it
Spec-Kit’s full `/specify -> /plan -> /tasks -> /implement` chain	Documented 10x slower than direct prompting for solo work. The ADR + AGENTS.md harness already gives the structure.
Multi-agent orchestration on a single project	Surveys report 40% of multi-agent pilots fail within six months. Subagent-heavy workflows incur 4 to 7x token overhead.
Background or overnight autonomous runs	Published horror stories range from a deleted production database to a $47k single-session bill. No upside for a deliberate rewrite.
Auto-generated AGENTS.md	An early-2026 empirical study found LLM-generated context files degrade performance by 3%. Hand-craft AGENTS.md.
More than three MCP servers	Each MCP adds about 18k tokens of tool definitions per turn. Five MCPs is 90k tokens of pure overhead before any real work.
Code-graph context (Neo4j, Graphify, etc.)	Pay-back floor is around 50k LoC. My project is below.
Custom MCP server “for this project”	Premature until I want the same tool from two different clients. Slash command first; MCP only when the slash-command pain is real.

I tried most of these for at least one session. The reasons I dropped them are personal, not theoretical.

When the workflow breaks

A specific session this bit me

Three PRs into a recent week, Claude Code kept asking me to confirm git commit --amend and git push --force-with-lease. I had set defaultMode: "bypassPermissions" in .claude/settings.local.json early in the project. I thought I was in YOLO mode. What had happened: every time I clicked “Yes, don’t ask again” on a specific prompt, Claude Code had been rewriting the file, replacing the bypass directive with a narrow per-command allow list that drifted across sessions. The bypass had been gone for at least a day. I noticed because I got tired of clicking, not because the harness told me. The fix took 60 seconds. The lesson cost more than 60 seconds of trust: harness state is not write-once, and the workflow only works if you treat its config as something that gets edited by surprise.

Three other failure modes I have hit:

“I knew the task; I skipped /orient”

The orient step is not for the agent. It is for me. Skipping it because the task feels obvious is the most reliable way to start drifting. The act of orienting forces the mental-model load. Cost is five minutes. Benefit is calibration on what is actually open.

“I argued with the agent for 20 minutes”

Arguing pollutes the context. The right move is /rewind (or /clear and reprompt). I lose the argument history and gain a clean slate. Cheaper than trying to talk the agent back on track.

“I forgot to /retro”

The retro is the keystone for cross-session compounding. Skipping it once is forgivable. Skipping it three times in a row means the LEARNINGS file goes stale, and the next session repeats a 4-hour debugging episode someone (me) already solved last week.

Closing thought

None of this is novel. The patterns come from Anthropic’s own docs, a handful of practitioner blogs, and a few painful sessions I should have done differently. What is mine is the specific assembly: which slash commands matter, which hooks enforce what, which rules are load-bearing for me.

One contrarian note before the cross-links: the eight commands above will be wrong for your project. Most of the skill libraries and slash-command packs floating around GitHub are cargo-culted, and the ones that aren’t reflect someone else’s decisions. The harness only earns its keep when it encodes decisions you have actually made. Steal the structure if it helps. Don’t import the choices.

The keystone is /teach. The floor is the 30-second rule. The compounding mechanism is /retro writing to LEARNINGS. The rest is plumbing.

The reason I do it this way is simple. I want the project to still belong to me in six months, not to the agent. The discipline is what makes that possible.