Closedloop.ai

Closedloop.ai Student Workbook

Claude Code One-Day Intensive Student Workbook

A hands-on follow-along workbook with repo-backed examples, file locations, commands, exercises, and desk-reference patterns for the full five-hour Claude Code intensive.

FormatFive-hour live intensive

AudienceIB Global Engineering

OutputReusable operating artifacts

How to Use This Workbook

This workbook turns the one-day agenda and instructor presentation guide into a readable companion. You want more than slide bullets: a practical operating model, concrete artifact templates, and enough explanation to keep applying the ideas after today’s session ends. This is the operating model our team arrived at the hard way; it is what works for us, not a finished science. If a section is already second nature to you, skim it and help the people around you; if something here can be sharpened, tell us.

This session assumes mandatory setup pre-work. You should arrive with Claude Code working, the repository cloned, demo commands verified, editor and terminal ready, and your baseline tool-permission posture understood. Do not spend live time repairing local setup; spend it practicing primitive design, planning, investigation, review, and workflow design.

Course outcomes: By the end of today, you should have a Claude Code primitive kit, Implementation Plan, Explore findings, Request Changes, Review, Workflow, and one Workflow retro. A primitive kit is simply the collection of small, reusable Claude Code building blocks (a tool posture, a command, a skill, a subagent, and a plugin decision) that you craft in Module 1 and reuse for the rest of the day.

Why this pays off: None of these artifacts are paperwork; each one is leverage. A primitive kit stops the team re-solving solved problems. A compact plan lets the model land a change in one pass instead of five. An Explore-findings memo prevents the expensive regression before it ships. Token discipline lowers your cost-to-serve on every run. The return is concrete: fewer wasted sessions, faster time-to-shipped, and reusable assets that compound each time another engineer picks them up.

One running example: A single scenario threads the whole day: an evidence-first triage and fix of a refresh-token rotation bug. The artifacts build on each other by pointing at the ones before them: the Implementation Plan (Module 2) frames the Explore findings and Request Changes (Module 3), which feed the Review (Module 4), which the Workflow (Module 5) ties into one repeatable loop. By the end you have walked a complete evidence-first bug fix, end to end.

ClosedLoop’s operating model (worth absorbing even if you never use the product): agents produce Documents (an Implementation Plan, a Review, and so on) and humans govern them at milestones: Draft → In Review → Approved → Executed → Done. A Workflow is the named sequence that orchestrates those steps with human gates. The artifacts and habits in this workbook use that same nomenclature, so by the end you are already thinking in this operating model.

Course Map

Modules

Start Here: Student Follow-Along Guide
How to Use This Workbook
Module 1: Build Claude Code Primitives
Module 2: Planning and Context Management
Module 3: Intent Recovery and Dynamic Evidence
Module 4: Review, Test, and Verify
Module 5: Workflow Design and Minimal Improvement Loop
Desk Reference and Repo Links
Concrete Examples, File Locations, and Repo Links
Token Efficiency Throughout the Course
How the Workbook Reinforces the Course
Token Savings Field Guide

One-day timeboxes

Module	Time	Hands-on center	Must leave with
1. Build Claude Code primitives	70 min	Create tool posture, command, skill, subagent, plugin note	Claude Code primitive kit
2. Planning and context management	55 min	Convert messy intake into Implementation Plan	Implementation Plan and context map
3. Intent recovery and dynamic evidence	60 min	Recover why from git/issues/logs and build Request Changes	Explore findings and Request Changes
4. Review, test, and verify	60 min	Review diff against brief and produce Review	Findings and Review
5. Workflow and improvement loop	45 min	Design named workflow with handoffs, gates, stop conditions	Workflow and improvement note

The recurring critique pattern

Every module uses the same critique pattern because the course is artifact-first. Do not ask only whether an artifact looks polished. Ask whether it can serve as a downstream input.

What would you improve about this artifact? What would make it better as a downstream input? Could another operator use this without reopening the whole problem?

This pattern is intentionally simple. It works for primitive kits, Implementation Plans, Explore findings, Request Changes, Reviews, and Workflows. The goal is to train you to judge artifacts by operational usefulness rather than by surface completeness.

Start Here: Student Follow-Along Guide

Your goal: Leave with seven small artifacts you can reuse at work: a primitive kit, Implementation Plan, Explore findings, Request Changes, Review, Workflow, and one Workflow retro.

This workbook is the student version of the one-day intensive. It removes instructor-only delivery notes and turns the course into a practical follow-along reference. Keep the GitHub repository open while you read, because the examples, anti-patterns, pre-work, and commands are part of the exercise flow.

Repository: closedloop-ai/claude-code-expert-training
Pre-work checklist: Get your machine ready (~15 min)
Setup lab: Guided walk-through of the same setup
Tool permissions examples: Permissions posture examples
Demo artifacts index: Good vs anti-pattern examples

How to use each module

Read the framing and vocabulary before the live block starts.
Open the linked example and anti-pattern artifacts from the repo.
Build the named artifact for that block.
Use the critique prompt: What would make this better as a downstream input?
Keep your final artifacts short enough for another operator to use without reopening the whole conversation.

Module 1 · 70 minutes

Build Claude Code Primitives

If you have only ever used Claude Code as one long chat box, this is where that habit breaks. The move that separates a casual user from a team-scale operator is small but total: instead of pouring everything into a single endless conversation, you start reaching for the right building block (a command, a skill, a subagent, a hook) for each kind of work. Do not try to memorize the catalog yet. By the end of this module you will have built one of each with your own hands, and the judgment for which to reach for starts to feel obvious once you have made the call a few times.

Module outcome: You leave Module 1 with a Claude Code primitive kit: a safe tool posture, one custom command, one skill skeleton, one subagent, and one plugin decision note. The concept recap comes after the build, not before it.

Follow-along repo links: Primitive framework · Tool and permissions examples · Skill template · Primitive anti-pattern to critique

Hands-on: build the Claude Code primitive kit

Create a four-line tool posture: allowed reads, allowed shell commands, writes requiring approval, and dangerous actions that stay blocked.
Create a custom command at .claude/commands/summarize-failing-test.md.
Create a skill skeleton at .claude/skills/flaky-test-investigation/SKILL.md using the course template.
Create a subagent at .claude/agents/history-investigator.md or .claude/agents/security-reviewer.md.
Write a plugin decision note: not yet, team-local, or package later.
Peer review: could another operator use this kit without reopening the whole problem?

Token-efficient operating habit

Build the smallest useful abstraction. A direct prompt is cheaper than a skill, a skill is cheaper than an always-loaded rule, and a subagent is cheaper than polluting the main thread with broad exploration.

Why primitive design is the first skill

Most failed AI-assisted engineering sessions do not fail because the model is incapable. They fail because the work is routed to the wrong primitive. A developer asks for a broad implementation when the real need is investigation. A team writes a permanent rule into a one-off prompt. A repeated checklist stays trapped in someone’s memory instead of becoming an executable skill. A heavyweight model is used for cheap file discovery, while a complex design decision is handed to a fast model with insufficient reasoning depth.

A Claude Code operating model therefore begins with vocabulary. Not vocabulary for its own sake, but vocabulary that gives the team a shared way to decide where work should live. Once the team can say “this is a command,” “this is a skill,” “this belongs in memory,” “this should be a subagent,” “this is a hook,” or “this needs a headless run,” the tool stops being mysterious and starts becoming an engineering system.

The primitive build lab

Think of Claude Code as a workbench with several surfaces. The interactive session is where a human and model negotiate a task. Tools are the model’s hands: reading files, running commands, searching repositories, editing code, and interacting with configured integrations. Commands are named entry points that standardize common activities. Skills package reusable procedures. Agents and subagents isolate specialized work. Plugins and marketplace content extend what is available across projects or organizations. Headless execution turns the same operating model into automation.

The practical question is not “which feature is coolest?” The practical question is: where should this work be represented so that another engineer can reuse it without rediscovering it?

Primitive	Best use	Team-scale signal	Common mistake
Prompt	One-off instruction, exploration, or clarification	Useful when the work is new, ambiguous, or conversational	Using prompts repeatedly for stable procedures
Command	A named local action or workflow entry point	Useful when the same operation should start the same way every time	Packing too much reasoning or policy into a command that should be a skill
Skill	A reusable procedure, checklist, transformation, or review pattern	Useful when a workflow should be invoked on demand and updated centrally	Putting always-needed facts in a skill instead of memory or project instructions
Agent	A specialized role with its own instructions and tool boundaries	Useful when a class of work needs a consistent expert lens	Creating broad agents with vague responsibilities
Subagent	Isolated investigation or parallel work unit	Useful when exploration would pollute the main context window	Letting broad file reads accumulate in the main conversation
Tool	A capability the model calls: a built-in, a CLI binary run via Bash, or an MCP server’s tool	Useful when work needs real actions: run, read, search, call an API	Writing a “tool definition” for a CLI Claude could just run under permissions
Hook	A shell command wired to a lifecycle event (PreToolUse, PostToolUse, Stop, and others)	Useful when a rule must run deterministically, not only when the model remembers	Using a hook for soft guidance that belongs in CLAUDE.md, or vice versa
Plugin / marketplace package	Reusable bundle of the above (commands, skills, subagents, hooks, tools) distributed beyond one repo	Useful when teams need a shared, versioned extension point	Packaging before the workflow has stabilized
Headless run	Non-interactive execution in CI, scripts, or automation	Useful when the work has clear inputs, outputs, and stop conditions	Automating work that still requires human judgment

Interactive versus headless work

Interactive mode is for discovery, negotiation, and judgment. A human can interrupt, correct assumptions, ask for alternatives, and decide whether the model’s next step is safe. Headless mode, Claude Code’s non-interactive print mode invoked with claude -p (--print), is for work that has already been bounded. It needs clear inputs, allowed tools, expected outputs, and stop conditions. If the task still requires a human to decide what the task is, it is not ready for headless execution.

A good rule is this: interactive sessions produce artifacts; headless sessions consume artifacts. During a live session you might create an Implementation Plan, Explore findings, or Review. Once those artifacts are stable, a headless run can implement a bounded change, run a review, or generate a report from known inputs.

Checkpoint: Before automating a Claude Code workflow, ask whether another operator could execute it from the artifact alone. If the answer is no, the workflow is still too implicit.

Goal mode and debate mode

Two conversational modes matter today. Goal mode is useful when you want the model to drive toward a defined outcome. Debate mode is useful when you want the model to challenge a plan before code is written. Goal mode helps with forward motion; debate mode helps with error prevention. Both are most useful when paired with artifacts.

For example, a developer might ask Claude to draft an Implementation Plan in goal mode. Once the brief exists, the developer can switch into a debate posture: “Challenge this brief. Identify hidden assumptions, underspecified acceptance criteria, and likely regression risks.” The output of the debate should not be a wandering conversation. It should be a better brief.

Model selection for the job

Model selection is part of work primitive design. Cheap, fast models are appropriate for bounded lookup, file discovery, and summarizing known material. Balanced models are appropriate for routine implementation and review. The strongest reasoning models are appropriate for architecture, tricky debugging, multi-file refactors, and decisions where a bad plan is more expensive than slower planning.

Job	Preferred primitive design	Why
Find relevant files	Explore subagent on a fast model	Broad search stays out of the main context window
Design a refactor	Strong reasoning model for planning, then balanced model for execution	The plan is the expensive part to get wrong
Apply a known checklist	Skill or command	The procedure should be stable and repeatable
Review a security-sensitive change	Specialized review agent with high effort	The lens and depth matter more than speed
Generate a one-time explanation	Interactive prompt	The work is conversational and may not need persistence
Run a recurring report	Headless workflow over a stable spec	The inputs, outputs, and schedule are known

Building the Claude Code primitive kit

The primitive kit is the first durable artifact of the day. It is the set of primitives you have actually crafted (a tool posture, a command, a skill, a subagent, and a plugin decision), each written down with explicit boundaries. Give every primitive a compact spec so another engineer can pick it up without guessing: what it does, which mode and model it runs in, what it needs as input, what it produces, and when it should stop. Keep the kit small enough to live in a repo, onboarding guide, or team operating doc, and concrete enough to describe the actual next version of the team’s workflow rather than an aspiration.

Below is a real, working code-review plugin that bundles every primitive in one place. Use the dropdown to walk through each: the plugin manifest that ties them together, a command, a skill, a subagent, a hook, and a bundled tool (a CLI script). Edit any of them and the box turns green when the format is valid.

It is one connected plugin, not six unrelated files: the /review-pr command runs the pr-review skill and delegates to the pr-reviewer subagent, which runs the check-diff tool and the github MCP server. That is how primitives stitch into a workflow. The layout mirrors a real plugin: closedloop-ai/claude-plugins/plugins/code.

A note on tools: you rarely “define” a tool. Claude already uses its built-in tools and any CLI binary on your PATH (run via Bash under your permission rules), so it is usually smart enough to just use them. The two ways to extend the set are to ship a helper script in the plugin (like check-diff.sh) or to add an MCP server, which exposes entirely new tools.

Task: Review a PR for auth regressions
Primitive: security-reviewer agent + /code-review command
Mode: interactive for local development; headless only after the rule set is stable
Model: balanced model for normal review, stronger reasoning for high-risk auth changes
Inputs: diff, Implementation Plan, REVIEW.md, relevant auth rules
Output: findings-first review with severity, evidence, and recommended fix
Stop condition: no important findings or explicit residual risk accepted by human

What good looks like

A good primitive kit has three qualities. First, it is specific. “Use Claude for coding” is not useful; “use an explorer subagent to identify files before reading them into the main session” is useful. Second, it is bounded. Each entry says what the primitive should and should not do. Third, it is teachable. A new engineer should be able to read the kit and make the same primitive design decision as a senior engineer most of the time.

Exercise: Pick five recurring engineering activities from your team. For each, decide whether it belongs as a prompt, command, skill, agent, subagent, plugin, or headless workflow. Then add the model choice, required input artifact, expected output artifact, and stop condition.

Anti-patterns

The mega-prompt: A long prompt that mixes stable policy, one-time instructions, codebase facts, and a workflow checklist. Split it into memory, project instructions, skill, and task prompt.
The everything-agent: A custom agent named “senior engineer” that can do anything. Specialized agents should have a narrow lens, clear tools, and a predictable output shape.
Premature plugin packaging: A workflow is packaged before the team has run it enough times to know its inputs, failure modes, and stop conditions.
Headless ambiguity: A non-interactive run is launched with vague goals and no acceptance criteria. Headless work should consume a brief, not invent one.

From primitives to workflows

Crafting primitives is only half of Module 1. The other half is noticing that primitives are meant to be stitched together. A real task rarely uses one primitive in isolation: an explorer subagent finds the files, a command kicks off the change, a review agent checks the diff, a skill packages the verification. The sequence that connects them is a workflow, and Module 5 is where you design one deliberately.

That immediately raises the question this course spends the rest of the day answering: once you can stitch primitives into a workflow, how do you tell that workflow what to accomplish? A workflow with no clear goal, no boundaries, and no definition of done will wander no matter how good its primitives are. That is exactly the problem Module 2 picks up.

Module recap

Notice what you just did: you stopped asking "how do I word this prompt?" and started asking "where should this work live?" That is the whole shift. Prompt cleverness fades the moment the conversation ends; a primitive you crafted keeps paying out every time you or a teammate reaches for it. Carry that instinct into the rest of the day, because every module from here builds on primitives you now know how to make.

Module 2 · 55 minutes

Planning and Context Management

Module 1 ended on a question: once you can stitch primitives into a workflow, how do you tell it what to accomplish? Planning is the answer. In Claude Code, planning is not a ceremonial step before coding; it is the act of shaping context so the next operator (human, model, agent, or headless workflow) can act without reopening the entire problem.

Module outcome: You create an Implementation Plan with facts, assumptions, open questions, context map, bounded work packages, and acceptance criteria.

Follow-along repo links: One-day intensive source guide · Live agenda · Pre-work checklist

Hands-on: write an Implementation Plan

Choose a task.
Separate facts, assumptions, open questions, constraints, non-goals, and acceptance criteria.
Add a context map with file pointers and evidence commands.
Write a guided /compact focusing on... prompt.

Token-efficient planning habit

Every Implementation Plan should include a context budget: what must load, what can defer, what should delegate, what should not load, and what must survive compaction.

The plan is a compression artifact

Claude Code sessions can accumulate enormous context: pasted requirements, file contents, terminal output, attempted fixes, test failures, screenshots, and corrections from the human. Without deliberate compression, the session becomes expensive and fragile. The model is forced to infer what still matters from a long transcript. Humans are forced to remember why earlier decisions were made. Downstream operators inherit noise instead of a plan.

A useful plan is not a transcript. It is a lossily compressed representation of the work. It preserves the facts, decisions, constraints, risks, and next actions that matter. It discards the conversational path that produced them. That is why this session treats planning as context management.

Separate facts, assumptions, and open questions

The simplest improvement to most AI coding sessions is to stop blending known facts with guesses. Models are very good at continuing a confident narrative. If the prompt says “the auth middleware probably owns refresh-token invalidation,” the model may proceed as if that is true. A disciplined brief separates evidence from inference.

Category	Definition	Example	How to handle it
Fact	A statement backed by direct evidence	`src/auth/middleware.ts` validates JWTs before route handlers run	Can be used directly in the plan
Assumption	A plausible statement not yet proven	Refresh token invalidation is probably handled in the session store	Must be tested or called out as risk
Open question	A decision or unknown that blocks confident execution	Should expired refresh tokens be deleted or retained for audit?	Resolve before implementation or explicitly defer
Constraint	A boundary the solution must respect	Do not change public API response shape	Use as acceptance criteria and review rule

Checkpoint: Find the single most load-bearing assumption in your plan. If it turns out to be wrong, does the whole approach collapse? If so, verify it before writing code, not after.

The Implementation Plan

The Implementation Plan is the central planning artifact. It should fit on one or two pages, but it should be complete enough for another operator to execute. The brief is not just a summary. It is an instruction-bearing artifact with a clear contract: here is the problem, here is the known context, here are the boundaries, here is the proposed path, and here is how we will know whether the work is done.

# Implementation Plan

## Goal
Implement refresh token rotation without changing the existing login response contract.

## Known facts
- JWT validation happens in src/auth/middleware.ts.
- Session persistence is implemented in src/auth/session-store.ts.
- Existing tests cover login success and expired access tokens.

## Assumptions to verify
- Old refresh tokens are not currently invalidated after rotation.
- Token reuse should be treated as suspicious but not immediately lock the account.

## Open questions
- Should reuse detection emit an audit event?
- Is token family tracking already present in the database schema?

## Context map
- Auth middleware: request validation and session lookup
- Session store: token persistence and expiration
- Test suite: integration tests under tests/auth/

## Work packages
1. Verify current token rotation behavior.
2. Add invalidation logic or token-family tracking.
3. Extend integration tests for old-token reuse.
4. Produce Review.

## Acceptance criteria
- New refresh token is issued on rotation.
- Previous refresh token cannot be reused.
- Existing login response shape is unchanged.
- Tests demonstrate success, expiration, and reuse behavior.

Context maps

A context map tells the model where to look and why. It is not a full dump of file contents. It is a pointer layer: directories, files, functions, commands, external systems, and documents that are likely relevant. Good context maps reduce token use because Claude can read the right files in the right order instead of scanning the repository blindly.

A context map should include both primary and secondary context. Primary context is required to make the change. Secondary context helps review risk, verify behavior, or understand why the system is shaped the way it is.

## Context map
Primary:
- src/auth/middleware.ts: request authentication boundary
- src/auth/session-store.ts: refresh token persistence
- db/schema.sql: session and token tables
- tests/auth/refresh-token.test.ts: integration behavior

Secondary:
- docs/security/auth-model.md: intended auth posture
- .claude/rules/api-security.md: project-specific security rules
- recent PRs touching auth middleware: intent and regression context

Watch for staleness: A context map is a snapshot, not a contract. Files move, functions get renamed, and pointers rot. A stale map is worse than no map, because it sends the model confidently to the wrong place. Keep maps short so they are cheap to refresh, store them next to the code they describe (in the Implementation Plan or a CLAUDE.md pointer), and treat “verify the map still resolves” as the first step whenever you reopen one. Prefer durable anchors (stable file and directory roles) over line numbers, which drift fastest.

Debate review before coding

Before code is written, ask Claude to attack the plan. The goal is not to win the debate; the goal is to improve the artifact. A debate review should look for ambiguous goals, missing constraints, unsupported assumptions, hidden coupling, risky files, weak acceptance criteria, and likely regression paths.

Review this Implementation Plan as a skeptical senior engineer. Identify unsupported assumptions, missing context, and acceptance criteria that would fail to catch a regression. Do not implement. Return a revised brief outline and a list of questions that must be answered before coding.

The output of debate review should be folded back into the brief. If the debate produces useful insights that remain trapped in conversation history, the next operator still cannot use them. Artifact-first planning means the artifact is the durable memory.

What makes a plan reusable downstream?

A reusable plan has clear boundaries. It names the exact goal, the non-goals, the files likely involved, the evidence already gathered, the assumptions still open, and the stop condition. It also includes enough review criteria to prevent the model from declaring success too early.

Goal: What outcome should exist after the work is complete?
Non-goals: What tempting adjacent work should not be done?
Facts: What has been directly observed?
Assumptions: What might be true, but needs verification?
Context map: Where should the next operator look first?
Work packages: What are the smallest implementation units?
Acceptance criteria: What evidence will prove the work is complete?
Review focus: What risks should review emphasize?

Exercise: Take a messy intake request from your team and compress it into an Implementation Plan. Then ask another participant whether they could execute it without reopening the original discussion. Any question they ask is either an open question or a missing fact.

Module recap

Planning in Claude Code is not about slowing down. It is about preserving momentum by producing a compact artifact that can survive compaction, handoff, review, and automation. The better the brief, the less the model has to infer and the easier it is for humans to hold the work accountable.

Module 3 · 60 minutes

Intent Recovery and Dynamic Evidence

A plan tells the model what to do; this module makes sure the plan is built on why the code looks the way it does. When a codebase is old enough, the current code is rarely the whole story. You will recover intent from Git history, issues, PRs, logs, tests, command traces, and screenshots before asking Claude to change behavior.

Map this to your existing SDLC

None of this is a new ceremony bolted onto your process; it is the work you already do, with the roles shifted. Planning (Module 2) is the architecture diagram or RFC an engineer would normally write by hand, except Claude drafts it and you edit it. Intent recovery (this module) is the same digging a careful engineer does before touching legacy code: reading the original PR, the linked ticket, and the test that encodes a past incident. And review (Module 4) is the pull-request review you already run, except the human reviewer is now governing an AI-produced change. The artifacts are the same artifacts; the model produces the first draft and the engineer stays accountable for the decision.

Module outcome: You create Explore findings and a Request Changes review that distinguish evidence from inference and make the next model turn more accurate.

Follow-along repo links: Explore findings example · Request Changes example · Explore findings anti-pattern · Request Changes anti-pattern

Hands-on: produce Explore findings and a Request Changes review

Use at least two evidence types.
Label evidence, inference, and unknowns.
Rewrite weak feedback into expected vs actual, command output, file pointer, and next ask.

Token-efficient investigation habit

Do not paste the whole investigation trail. Compress it into evidence, inference, unknowns, and pointers to files, lines, commits, commands, or screenshots.

For example, instead of pasting a 400-line test log plus three files into the session:

Bloated (costs tokens twice: once now, again on every later turn):
  [pastes full pnpm test output, full middleware.ts, full session-store.ts]

Compressed (same signal, a fraction of the tokens):
  Evidence:  refresh-token.test.ts:42, old token reuse returns 200, expected 401
  Evidence:  PR #184 added rotation; diff shows no invalidation step
  Inference: invalidation likely belongs in session-store.ts, not middleware.ts
  Unknown:   does the schema already track token families?
  Pointer:   src/auth/session-store.ts:88 (rotate fn)

Find the why before the what

Claude can usually explain what code does from the current files. The harder question is why it does that. Was a strange branch added for a customer-specific edge case? Did a test encode a production incident? Was a confusing abstraction introduced to support a migration that has since finished? Current code often hides the reason for its own shape.

Intent recovery is the discipline of gathering enough historical and runtime evidence to avoid undoing deliberate behavior. It is especially important when the requested change appears simple. Simple changes are dangerous when they cut across hidden intent.

Static code is not enough

Reading the current file gives one kind of evidence. Git history gives another. Tests reveal expected behavior. Issues and PRs reveal tradeoffs. Logs reveal runtime reality. Screenshots reveal UI states that code alone may not make obvious. CLI traces reveal exact failure modes. Documentation and MCP-backed systems can provide external context that is not stored in the repository.

Evidence source	Question it answers	Risk if omitted
Current code	What does the system do now?	The model may miss hidden coupling outside the local file
Git blame and commits	Why was this line introduced or changed?	The model may remove a deliberate workaround
PR discussion	What tradeoffs were accepted?	The model may re-litigate settled decisions
Issues / tickets	What user or incident motivated the behavior?	The model may solve the wrong problem
Tests	What behavior is currently protected?	The model may pass local reasoning but break expected behavior
Logs and traces	What happens in real executions?	The model may optimize for imagined behavior
Screenshots	What does the user actually see?	The model may miss visual or state-machine issues
Docs and runbooks	What standards should govern the change?	The model may violate team conventions

Evidence versus inference

A strong Explore findings labels evidence. It does not say “the bug is caused by stale cache” unless there is direct evidence. It says “the failure appears after the cache read path; logs show cache hit with outdated value; no write-through event appears in the trace; inference: stale cache is likely.” This distinction matters because the next model turn will use the memo as context. If guesses are written like facts, the model will build on them.

# Explore Findings
# Builds on the Module 2 Implementation Plan: "refresh token rotation".

## Question
Why is an old refresh token still accepted after rotation issues a new one?

## Evidence gathered
- Reproduces: pnpm test tests/auth/refresh-token.test.ts: old-token reuse returns 200, expected 401.
- Server logs show the session store resolves the previous token after rotation.
- Git history shows rotation added in PR #184 with no invalidation step.
- src/auth/middleware.ts validates tokens but never deletes the prior record.

## Inferences
- Rotation issues a new token but does not invalidate the previous one.
- Invalidation likely belongs in the session store, not the middleware.

## Open questions
- Where should the previous token be invalidated: session store or middleware?
- Should reuse of an old token emit an audit event?

## Recommended next step
Trace refresh-token writes and invalidation in src/auth/session-store.ts before changing code.

Checkpoint: Read your memo back and underline every sentence stated as fact. For each one, can you point to the command, line, or commit that proves it? Anything you cannot point to is an inference wearing a fact’s clothing; label it as such before the model builds on it.

Dynamic evidence sources

Dynamic evidence is information that changes with execution: logs, test runs, database state, browser behavior, CLI output, screenshots, observability traces, and external tool responses. It is powerful because it grounds the model in reality. It is also noisy. A good operator does not paste raw dynamic output indiscriminately. They capture the relevant excerpt, label how it was produced, and explain why it matters.

When collecting dynamic evidence, include command provenance. The model should know not only the output, but the command, environment, timestamp or branch, and whether the result was reproducible.

Anti-pattern: raw dump	Better: labeled excerpt with provenance
Pastes 200 lines of test runner output with no framing. The model has to guess which line is the real failure and what produced it.	Quotes the one failing assertion, names the command and branch that produced it, and states the interpretation. The model knows exactly what is proven.
`it broke, here's the log: ...`	The structured capture below: command, branch, result, relevant output, interpretation.

Command: pnpm test tests/auth/refresh-token.test.ts --runInBand
Branch: refresh-token-rotation
Result: failed, 1 test
Relevant output:
  expected old refresh token reuse to return 401
  received 200
Interpretation:
  Existing implementation issues a new refresh token but does not invalidate the previous token.

Why “that didn’t work” fails as feedback

The phrase “that didn’t work” is almost useless to the model. It omits what was attempted, what was expected, what actually happened, what evidence was observed, and what changed between attempts. Good feedback is a bundle: action, expectation, observation, evidence, hypothesis, and next constraint.

Weak feedback	Better feedback
That did not work.	After applying the patch, `pnpm test tests/auth/refresh-token.test.ts` still fails. Expected old token reuse to return 401; actual response is 200. Relevant log shows session lookup succeeds for the old token. Focus next on invalidation in session-store, not middleware.
The UI is broken.	Clicking Save leaves the modal open. Browser console shows no error. Network tab shows PATCH /settings returns 204. The likely issue is local modal state not closing after success.
Try again.	Revise the approach without changing the public API response shape. Preserve existing success tests and add one regression test for duplicate submission.

The Request Changes pattern

A Request Changes review is a structured correction that improves the next model turn. It should be short, but it must contain enough evidence for the model to update its plan. Use it whenever Claude’s first implementation fails, when a test result contradicts an assumption, or when a human reviewer spots a gap.

Where does it live? A Request Changes note is usually ephemeral working context, not a committed deliverable. The most common pattern is to paste it directly into the chat as your next turn, so it steers the immediate retry and then ages out of the session. If the change spans multiple sessions or hands off to another engineer, write it to a scratch file you do not commit (for example .claude/scratch/request-changes.md or a path covered by .gitignore), and delete it once the fix lands. Reserve committed files (a PR comment, REVIEW.md) for review feedback that the team needs as a durable record. Rule of thumb: commit the outcome (the Review, the PR), keep the back-and-forth (Request Changes) out of version control.

# Request Changes

## Attempted change
Added token rotation logic in src/auth/middleware.ts.

## Expected result
Old refresh token reuse should return 401.

## Actual result
Old refresh token reuse returns 200.

## Evidence
Test: pnpm test tests/auth/refresh-token.test.ts
Failure: expected 401, received 200
Trace: old token still resolves in session-store lookup.

## Updated hypothesis
Middleware is not the right layer for invalidation. The session store accepts both token records.

## Next instruction
Inspect session-store token persistence and invalidation. Do not change response shape.

Module recap

Intent recovery prevents well-intentioned regressions. Dynamic evidence prevents hallucinated debugging. Request-Changes reviews convert failure into useful context. Together, they turn Claude Code from a code generator into an evidence-driven collaborator.

The running example, end to end

Here is how the refresh-token bug has moved through the day's artifacts so far, each one feeding the next:

Implementation Plan (Module 2) framed the goal (rotate refresh tokens without changing the login response) and flagged the load-bearing assumption: old tokens may not be invalidated.
Explore findings (this module) tested that assumption with evidence: refresh-token.test.ts returns 200 on reuse, PR #184 added rotation with no invalidation step, and the inference points at session-store.ts.
Request Changes (this module) caught the first failed attempt (invalidation was added in the wrong layer, middleware) and redirected the next turn to the session store without changing response shape.
Review (Module 4, next) will check the resulting diff against the plan's acceptance criteria and make residual risk explicit.

No single artifact is large, but each one lets the next operator (or the next model turn) start from a settled position instead of re-deriving it. That chain is the whole point.

Exercise: Choose a bug or confusing behavior from the demo app. Produce a one-page Explore-findings memo with at least three evidence sources and a Request Changes review that would help Claude recover from a failed first attempt.

Module 4 · 60 minutes

Review, Test, and Verify

The work is not done when code changes. It is done when the diff has been reviewed against the plan, the verification evidence is explicit, and the residual risk is clear enough for a human to accept or reject.

Module outcome: You produce a Review that makes findings, evidence, scope drift, regression risk, and PR-readiness explicit.

Follow-along repo links: Review · Review anti-pattern

Hands-on: build a Review

Review a diff against the plan.
Name scope drift and regression risk.
List verification evidence and what was not verified.
End with residual risk and PR handoff.

Token-efficient review habit

Review against the plan and the diff first. Expand only when a finding needs more context, and keep verification output to command, result, excerpt, and residual risk.

Review the diff against the plan

A Claude-assisted review should not ask “does this code look good?” That question is too broad and too subjective. The stronger question is: “does this diff satisfy the Implementation Plan without violating constraints or introducing unacceptable risk?” The brief becomes the review contract.

Findings-first review means the reviewer leads with issues, not narrative. Each finding should state severity, evidence, affected file or behavior, why it matters, and a recommended fix. If there are no blocking findings, the review should still describe what was checked and what residual risk remains.

# Review Finding

Severity: Important
Area: Refresh token invalidation
Evidence: tests/auth/refresh-token.test.ts covers successful rotation but not reuse of the old token.
Why it matters: The acceptance criteria require old refresh tokens to be rejected.
Recommended fix: Add a regression test that attempts reuse of the previous token after rotation and expects 401.

Scope drift

Scope drift is any change that is not required by the brief. Some drift is harmless cleanup. Some drift is dangerous because it changes behavior the team did not intend to change. Claude can drift when it sees adjacent improvements, especially if the prompt rewards broad helpfulness. Review must therefore compare the diff against explicit goals and non-goals.

Drift type	Example	Review response
Benign cleanup	Renaming a local variable for clarity	Accept if low risk and local
Adjacent refactor	Changing session-store interfaces while adding one token behavior	Challenge unless required by the brief
Behavior expansion	Adding account lockout on token reuse when not requested	Reject or move to follow-up
Contract change	Changing login response shape while implementing rotation	Block
Test-only expansion	Adding regression tests for directly related edge cases	Usually accept

Testing is one gate, not the only gate

Passing tests are necessary evidence, but they are not proof of correctness. Tests only cover what they assert. A Review should include test results, manual checks when relevant, static review, diff review, command outputs, and residual risk. The point is not to create paperwork. The point is to prevent the phrase “tests pass” from hiding an unreviewed assumption.

For LLM-assisted work, verification should also include provenance: what files changed, what commands were run, what evidence was observed, and what the model did not check. This gives the human reviewer a clear map of confidence and uncertainty.

Checkpoint: When the model says “tests pass,” ask what the tests do not cover. A green suite proves the asserted behavior, not the absent assertion. Name one thing that could break that no current test would catch.

The Review

# Review

## Plan alignment
Goal: Refresh token rotation rejects old token reuse.
Status: Implemented and tested.

## Changed files
- src/auth/session-store.ts: invalidates previous refresh token on rotation
- tests/auth/refresh-token.test.ts: adds old-token reuse regression test

## Evidence
- pnpm test tests/auth/refresh-token.test.ts: passed
- pnpm test tests/auth/login.test.ts: passed
- Manual API check: old refresh token returns 401 after rotation

## Scope review
No public response contract changes observed.
No unrelated auth routes modified.

## Residual risk
Database cleanup of invalidated token records is not addressed. Existing retention behavior remains unchanged.

## PR handoff note
Reviewer should focus on token-store concurrency and whether invalidated token retention meets audit expectations.

Regression risk

Regression risk is not just the probability that something breaks. It is the product of likelihood, blast radius, and detectability. A small likelihood with a huge blast radius still deserves attention. A likely bug with easy rollback may be acceptable if the release path is safe. Claude can help enumerate risks, but the team must decide what risk is acceptable.

Risk question	Why it matters
What user-visible behavior changed?	Identifies blast radius
What existing tests protect this path?	Identifies current safety net
What did we not test?	Prevents false confidence
What external systems depend on this behavior?	Finds hidden contracts
How would we detect failure in production?	Separates known risk from invisible risk
How would we roll back?	Determines operational readiness

What makes a handoff PR-ready?

A PR-ready handoff gives the reviewer the shortest path to an informed decision. It should contain the problem statement, brief link or summary, changed files, review focus, verification evidence, known non-goals, and residual risk. The reviewer should not need to reconstruct the story from chat history.

PR-ready handoff formula

Problem → Approach → Changed files → Verification → Review focus → Residual risk. If any of those pieces is missing, the PR is not fully handoff-ready.

Going further · Level 4

Review as a fleet, not a reviewer

Everything above describes one reviewer making one pass. That is the right default, and for most diffs it is enough. But a single reviewer has a single blind spot, and on a high-risk change (auth, a migration, anything touching money or user data) one pass is a gamble. The advanced move is to stop treating "the review" as one agent and start orchestrating a small fleet.

Two ideas do most of the work. The first is lens diversity: instead of one general review, fan out several reviewers in parallel, each pinned to a single question, then merge what they find. A reviewer told only "hunt for security holes" catches things a generalist skims right past.

Parallel reviewer	The one question it owns
Correctness	Does the diff do what the plan says, on the happy path and the edges?
Security	Injection, broken authz, secret exposure, unsafe input at the boundaries?
Scope drift	Does anything here exceed the brief's goals and non-goals?
Regression	What existing behavior or test could this silently break?
Test coverage	Does a test now fail if this exact bug comes back?

The second idea is adversarial verification, and it is the one that keeps Claude honest. A model will state a finding with total confidence whether or not it is real. So before you act on a finding, spawn a skeptic whose only job is to refute it, and keep the finding only if the skeptic cannot. For a critical change, use a small panel and keep a finding only when it survives a majority. This is exactly how a serious automated review pipeline is built, and how Claude Code's own /code-review works under the hood: parallel hunters and an auditor surface candidates, a verifier pass then tries to falsify each one, and synthesis reports only what survived.

Orchestrated review (shape, not syntax):

  fan out  →  [correctness] [security] [scope] [regression] [tests]   one lens each, in parallel
  collect  →  merge and dedupe the raw findings
  verify   →  for each finding, spawn a skeptic that tries to refute it
  keep     →  only the findings the skeptic could not refute
  report   →  severity-ranked, each with the evidence that survived

Checkpoint: Fan-out is not free; every extra agent costs tokens and latency. Reserve the fleet for changes where a missed bug is expensive, and keep a single-lens review for routine diffs. The skill is matching review depth to blast radius, not running the fleet on everything.

Module recap

Review and verification discipline turns model output into engineering evidence. The goal is not to make Claude “sound confident.” The goal is to make the work auditable: what changed, why it changed, how it was checked, and what remains uncertain.

Exercise: Review a provided diff or demo artifact against its Implementation Plan. Produce three findings if problems exist; otherwise produce a Review and residual-risk note that would be acceptable in a PR description.

Module 5 · 45 minutes

Workflow Design and Minimal Improvement Loop

Module 1 introduced workflows as the way to stitch primitives together; this is where you design one deliberately. The session closes by turning the day’s isolated practices into a single named workflow (with handoffs, gates, stop conditions, and one lightweight Workflow retro) so the next run can improve without expanding into a full operating-system redesign.

Module outcome: You leave with a Workflow and one credible next-run improvement checklist or metric.

Follow-along repo links: Workflow retro · Workflow Retro anti-pattern · Demo artifacts index

Hands-on: design one workflow

Name the workflow trigger and roles.
Define artifact handoffs, gates, and stop conditions.
Choose which parts are commands, skills, agents, or human review.
Add one next-run improvement.

Token-efficient workflow habit

Each agent handoff should define what context crosses the boundary, what stays behind, and where compaction happens before the next phase.

From good sessions to repeatable workflows

By now the individual moves should feel routine, so this module asks more of you: stop thinking about a single good session and start engineering a system that produces good sessions on demand. A brilliant one-off that no one can reproduce is, organizationally, a dead end. The discipline here is to capture the sequence that made the session work (how the work was framed, what artifacts were produced, which agents or skills carried each phase, where a human had to sign off, and what evidence counted as done) and harden it into something named, bounded, and reusable. Get this right and the payoff is not linear: a named workflow runs again next week, transfers to the next project, and onboards a new teammate without you re-teaching the whole system. That is the leverage that compounds.

This module is intentionally compact. The goal is not to design a full engineering operating system; it is to leave with one workflow you can try next week and improve after one run.

A compact multi-agent workflow

Before the design, a word on what a workflow physically is. In Claude Code a workflow is not a special file format or a piece of product UI; it is just a markdown file that describes the sequence: the trigger, the roles, the handoffs, the gates, and the stop condition. You make it runnable by saving it as a slash command (.claude/commands/<name>.md) or as a skill (.claude/skills/<name>/SKILL.md), at which point invoking it replays the whole sequence. That is the entire mechanism: the workflow below is the content of one such markdown file.

Multi-agent workflow design should begin with work boundaries, not agent names. Each agent or subagent should own a distinct lens or phase. If two agents need the same broad context and produce overlapping output, the workflow is probably not decomposed well.

# Workflow

Name: Evidence-first bug fix

Trigger:
A bug report has enough detail to reproduce or investigate.

Artifacts:
1. Explore findings
2. Compact Implementation Plan
3. Implementation diff
4. Review

Roles:
- Explorer subagent: identify relevant files, history, and evidence sources
- Planner: compress evidence into Implementation Plan
- Implementer: make bounded code changes from the brief
- Reviewer: compare diff against brief and produce findings

Gates:
- Do not implement until facts, assumptions, and open questions are separated.
- Do not review until acceptance criteria are explicit.
- Do not hand off until verification evidence and residual risk are written.

Stop condition:
The change is PR-ready or blocked by a named open question.

Handoffs

A handoff is where one operator’s output becomes another operator’s input. In Claude Code workflows, handoffs should be artifact-based. The explorer hands off Explore findings. The planner hands off an Implementation Plan. The implementer hands off a diff plus notes. The reviewer hands off findings and verification evidence. If a handoff requires the next operator to read the entire chat transcript, the handoff failed.

Handoff	Input	Output	Quality bar
Investigation → Planning	Evidence, traces, history, open questions	Implementation Plan	Facts and assumptions are separated
Planning → Implementation	Implementation Plan and context map	Bounded diff	Non-goals and acceptance criteria are respected
Implementation → Review	Diff and brief	Findings or approval with residual risk	Review is evidence-based
Review → Handoff	Findings, fixes, verification commands	PR-ready packet	Reviewer can decide without chat history

Gates and stop conditions

Gates prevent premature motion. Stop conditions prevent infinite motion. A gate says what must be true before the workflow can advance. A stop condition says when the workflow is complete, blocked, or unsafe to continue. Claude workflows need both because models tend to continue helping unless told what “done” means.

Good gates are observable. “Make sure the plan is good” is not a gate. “The brief includes goal, non-goals, facts, assumptions, open questions, context map, work packages, and acceptance criteria” is a gate. Good stop conditions are explicit. “Continue until fixed” is vague. “Stop when the Review shows the acceptance criteria pass, or when an open question blocks safe implementation” is actionable.

A useful test for a good gate: could a teammate who was not in the room mark it pass or fail without asking you? The examples below all pass that test.

Weak (a preference)            →  Strong (an observable gate)
"the plan is good"             →  "the Implementation Plan names goal, non-goals,
                                   facts, assumptions, open questions, context map,
                                   work packages, and acceptance criteria"
"investigation is done"        →  "every claim in the Explore findings cites a
                                   file:line, command, or commit"
"it's been reviewed"           →  "the Review lists each acceptance criterion as
                                   pass/fail with evidence, and residual risk is named"
"tests look fine"              →  "pnpm test exits 0 and a regression test for the
                                   reported bug exists and passes"

Checkpoint: Look at each gate in your workflow and ask whether a new teammate could tell, without you, whether it has been met. If a gate needs your judgment to evaluate, it is a preference, not a gate; rewrite it as something observable.

Going further · Level 4

Orchestration patterns for bigger workflows

The Evidence-first workflow above is linear: explorer, then planner, then implementer, then reviewer, one after another. That is the right shape to learn on. But once a workflow grows past a couple of agents, how they run starts to matter as much as what each one does, and a handful of patterns separate a workflow that merely works from one that is fast, thorough, and reproducible.

Pattern	What it does	Reach for it when
Fan-out / fan-in	Run independent agents at the same time, then merge their results	Phases do not depend on each other: several explorers searching different ways, or the review lenses from Module 4
Pipeline vs barrier	In a pipeline each item flows through every stage on its own; a barrier makes a stage wait for all items before the next begins	Default to a pipeline; use a barrier only when the next stage truly needs every prior result (to dedupe, or to stop early on zero)
Loop-until-dry	Keep spawning finders until several rounds in a row surface nothing new	Open-ended discovery (all bugs, all edge cases) where a fixed count would miss the tail
Judge panel	Generate several independent attempts, score them with parallel judges, synthesize from the winner	The solution space is wide and one attempt-then-iterate tends to anchor too early
Completeness critic	A final agent whose only job is to ask "what did we miss?"	Before you trust a "done" from a fan-out, to catch the angle nobody ran

Two rules keep orchestration from turning into a liability. First, put the control flow (the loops, the conditionals, the fan-out) in the workflow itself, not in the model's head; deterministic orchestration is reproducible, while model-driven orchestration is a fresh roll of the dice every run. Second, set limits before you fan out: cap how many agents run at once, give the run a token budget so a runaway loop cannot drain it, and when several agents edit files in parallel, give each its own git worktree so they do not clobber one another.

Checkpoint: Before adding parallelism, ask whether the phases are genuinely independent. Fanning out agents that all need the same context and produce overlapping output does not buy speed; it buys token cost and a merge headache. Parallelize work that is actually parallel, and keep the rest a clean pipeline.

Guarding the main branch against slop

Orchestration makes a workflow productive; this is what keeps it safe. The faster Claude produces code, the easier it is for plausible-looking slop (code that compiles, reads fine, and is quietly wrong, out of scope, or never actually verified) to slip onto a branch. The defense is two layers that back each other up: cheap automated gates that catch mechanical slop, and human gates that own the judgment a machine should not make.

The first layer is CI checks. A workflow's stop condition is a promise; CI is what enforces that promise when no one is watching. Past the table stakes (lint, typecheck, and tests must be green), wire in checks aimed squarely at AI output:

CI guard	Slop it catches
Plan-vs-diff drift check	Files or behavior changed that the Implementation Plan never mentioned: scope creep sneaking in
Headless review gate (`claude -p`)	Re-runs the Module 4 review fleet on the diff and fails the build if a blocking finding survives verification
Leftover-marker scan	`TODO`, `FIXME`, `console.log`, `debugger`, commented-out blocks left behind by a fast generation
Diff-size / blast-radius limit	A "small fix" that quietly rewrote 40 files: a signal to stop and look
Regression-test presence	A bug fix that ships without a test proving the bug stays dead
Hallucinated-dependency check	Imports of packages or APIs that do not exist in the lockfile or codebase

Because a workflow is just a markdown file (the point from earlier in this module), you can run the review itself headless in CI: feed claude -p the diff plus the Implementation Plan, and fail the pipeline if the change drifts from the plan or a blocking finding cannot be verified away. That turns the review fleet into a merge gate that runs on every push, not a courtesy a human remembers to perform.

# .github/workflows/ai-review.yml (shape, not a full config)
- run: npm run lint && npm run typecheck && npm test     # table stakes
- run: scripts/check-plan-drift.sh                          # diff matches the plan's scope
- run: |                                                    # headless review gate
    claude -p "Review this diff against PLAN.md. Output BLOCKING findings only,
    each verified against the code. Exit non-zero if any survive." \
      --allowedTools Read,Grep,Bash(git diff:*) > review.txt
- run: scripts/fail-if-blocking.sh review.txt               # red build on surviving slop

The second layer is human guards, and no amount of automation removes it. ClosedLoop's milestone model (Draft → In Review → Approved → Executed → Done) exists precisely so a person signs off before an AI-produced change advances. Some calls stay human on purpose:

Accepting residual risk. CI can surface the risk; a person has to decide it is acceptable to ship. That decision carries a name.
No agent self-merge. An agent can open the PR and make it green, but a human approves the merge. The machine does the work; the human owns the outcome.
Scope changes. If the diff exceeds the brief, that is a conversation, not an auto-approval, no matter how green the build is.
High-blast-radius changes. Auth, migrations, money, and anything user-visible get a named approver, even when every automated gate passes.

Checkpoint: Sort your gates into "a machine can decide this" and "a human must decide this." Push everything mechanical into CI so it runs for free on every change, and reserve human attention for the judgment calls: accepting risk, approving scope, and signing off on blast radius. Slop gets through when those two categories blur and a human rubber-stamps what they assume CI already checked.

One lightweight Workflow retro

A Workflow retro is a short retrospective: a five-minute, structured look back you run after a workflow completes, asking what worked, what did not, and the single thing to change before the next run. It is the same idea as a sprint retro, but pointed at the workflow itself (its artifacts, evidence, and gates), not at the team. The improvement loop should be small enough that the team actually uses it. Pick one Workflow retro that can be completed after a workflow run in five minutes. The Workflow retro should measure the workflow, not the model’s personality. It should ask whether artifacts were reusable, whether evidence was sufficient, whether context was controlled, whether review caught issues, and what should change next run.

# Workflow Retro

Workflow name:
Date:
Task:

1. Was the Implementation Plan usable without reopening the original conversation? 0 / 1 / 2
2. Did the Explore findings separate evidence from inference? 0 / 1 / 2
3. Did the implementation stay inside the stated scope? 0 / 1 / 2
4. Did verification include more than passing tests? 0 / 1 / 2
5. Was residual risk explicit? 0 / 1 / 2

One thing to keep:
One thing to change next run:
One artifact or rule to update:

What to improve on the next run

Do not try to improve everything after the first run. Choose one improvement. Maybe the context map was too vague. Maybe the review agent needs a narrower rubric. Maybe the Implementation Plan omitted non-goals. Maybe verification evidence was too thin. The Workflow retro turns that observation into a small change: update a template, add a rule, refine a skill, or adjust a gate.

Closing synthesis

Step back and look at the whole arc you just walked. You started by reaching for the right primitive instead of one long chat. You learned to compress a problem into a plan, to recover the intent behind code before changing it, to review a diff against that plan, and finally to wire the sequence that worked into a workflow you can run again. Each skill fed the next; together they form one loop. That loop is small enough to learn in a day and strong enough to carry a whole team. Run it on a real change this week, keep the artifacts, and you will feel the difference on the very next one: less rework, faster shipping, and a growing library of assets that make every future run cheaper and more reliable. That is what moving from operating Claude Code to mastering it actually looks like.

What “exceptional” actually looks like: the modules teach the shape of each artifact, but the bar is set by the worked examples, not the prose. Before you call your own kit done, read the good-versus-anti-pattern pairs end to end and copy the level of specificity in the “good” column:

The single refresh-token bug runs through every one of those exemplars, so reading them in order shows the same change graduating from evidence to plan to fix to review. That is the standard to match.

Exercise: Name one workflow your team will run in the next week. Fill out the Workflow, define at least three gates, and choose one Workflow retro question that will determine what you improve after the first run.

Appendix: Student Desk Reference and Repo Links

Use this as your one-page desk reference: durable facts go in CLAUDE.md; repeatable procedures become skills or workflows; broad exploration goes to subagents; PR handoffs require evidence.

Core commands and settings

Area	Use this	When it matters
Setup and health	`/doctor`, `claude --safe-mode`, `CLAUDE_CODE_SAFE_MODE=1`	Validate install health or troubleshoot by disabling customizations.
Memory and context	`CLAUDE.md`, `@docs/file.md`, `.claude/rules/*.md`, `/memory`, `/compact focusing on ...`, `/clear`	Make important context durable, modular, scoped, inspectable, and cheap to carry forward.
Model selection	`/model`, `/model opusplan`, `/effort low\|medium\|high\|xhigh\|max`, `/fast`, `fallbackModel`, `--fallback-model`	Use deeper reasoning where mistakes are expensive; use faster/cheaper paths for bounded work.
Delegation	`/agents`, `claude agents`, project `.claude/agents/`, background subagents	Keep broad exploration isolated and return concise findings to the main session.
Review and cleanup	`/code-review high`, `/code-review --fix`, `/simplify`, `REVIEW.md`	Review against the plan, make risk explicit, and clean up before handoff.
Governance	`availableModels`, `enforceAvailableModels`, `requiredMinimumVersion`, `Tool(specifier)` permission rules, `disableBundledSkills`	Keep teams on approved models, versions, tools, and extension surfaces.

Context pointers

Search before reading: use grep/file search to find paths and line numbers first.
Read narrowly: use offsets and limits instead of loading entire files.
Label evidence: command run, source, timestamp, relevant excerpt, and why it matters.
Separate facts, assumptions, open questions, and inferences.
Compact between tasks and tell compaction what to preserve.

Primary repo links

Concrete Examples, File Locations, and Repo Links

Use this section while practicing: each reusable Claude Code concept below includes the location where you would store it in a project and a repo link that backs the concept here.

Tools and permissions

Tools are the model action surface: reading, searching, editing, running commands, fetching context, and calling integrations. The course examples emphasize matching tool access to task risk rather than allowing everything by default.

Course backing doc

docs/TOOL-PERMISSIONS-EXAMPLES.md

Safe exploration, controlled implementation, shared repo guardrails, and workflow-specific permission posture.

Where this lives

settings.json, managed settings, permission dialogs, MCP/plugin policies, and task-specific approval choices.

Permission rules use the Tool(specifier) form, for example Bash(npm run test:*) or WebFetch(domain:example.com).

Permission prompt pattern:
What must Claude read?
What may Claude write?
What requires approval?
What would be dangerous if Claude guessed?
What evidence is required before widening permissions?

Try it

Choose one live task.
Write allowed reads, allowed writes, and approval-required actions.
Compare your posture to the safe exploration and controlled implementation examples.

Commands

Commands are for short, prompt-shaped, directly invoked operations. Use them when the behavior repeats but does not need a full method, bundled assets, or a specialist role.

Course backing doc

docs/PLUGINS-SKILLS-COMMANDS-AND-MODELS.md

Includes command examples such as /review-pr-risk, /summarize-failing-test, and /draft-pr-body.

Where this lives

Common project pattern: .claude/commands/<name>.md.

Built-ins appear in the slash menu, such as /model, /agents, /mcp, /permissions, and /compact.

# .claude/commands/summarize-failing-test.md
Summarize the failing test evidence in this shape:
1. Command run
2. First failing assertion or error line
3. Relevant file and line pointer
4. Likely failure category
5. Next narrow read or command

Do not propose a fix until the failure category is grounded in evidence.

Skills

Skills are for repeatable methods with structure: required inputs, context gathering, workflow, output artifact, verification checklist, and safety rules.

Skill starter template

templates/skill-template.md

The template defines the minimum sections students should fill in for a first-pass skill.

Where this lives

Common project pattern: .claude/skills/<skill-name>/SKILL.md.

Use skills for procedures and reusable artifact production, not always-on project facts.

# .claude/skills/flaky-test-investigation/SKILL.md
# Skill: flaky-test-investigation

## Purpose
Investigate a flaky test using evidence before proposing a fix.

## Required inputs
- failing command
- test name or file
- relevant CI/local output

## Workflow
1. Capture exact command and failure excerpt
2. Classify the failure mode
3. Identify dynamic evidence needed
4. Produce Explore findings
5. Propose the next narrow action

## Outputs
Explore findings with evidence, inference, and next step.

Agents and subagents

Agents and subagents are bounded workers with a mission, explicit tools, output shape, and stop condition. Use them when you need separate context or specialist review.

Course backing doc

docs/PLUGINS-SKILLS-COMMANDS-AND-MODELS.md

Defines agent and subagent usage, including context isolation and bounded missions.

Where this lives

Common project pattern: .claude/agents/<agent-name>.md.

Use /agents or claude agents to manage sessions where supported.

# .claude/agents/security-reviewer.md
---
name: security-reviewer
description: Review code changes for security vulnerabilities. Use proactively.
tools: Read, Grep, Glob, Bash
model: sonnet
maxTurns: 10
---

You are a security specialist. For every code change:
1. Check for injection vulnerabilities
2. Verify input validation at system boundaries
3. Check for exposed secrets or API keys
4. Verify authentication and authorization checks

Report findings by severity. Do not edit files unless explicitly asked.

Plugins and marketplace evaluation

Plugins are a distribution abstraction. A plugin can package multiple reusable units such as commands, subagents, MCP servers, hooks, skills, or workflow assets. Evaluate a plugin like dependency surface area, not like a shortcut.

Course backing doc

Marketplace evaluation checklist

Use the checklist before installing or promoting a package.

Where this appears

Use /plugin flows and /plugin list where available.

Prefer local commands or skills when the behavior is still small or unstable.

Demo artifacts and anti-patterns

Module	Good example	Anti-pattern
Operating surface	Primitive framework and examples	Primitive kit anti-pattern
Investigation	Explore findings	Explore findings anti-pattern
Feedback	Request Changes	Request Changes anti-pattern
Review	Review	Review anti-pattern
Workflow	Workflow retro	Workflow Retro anti-pattern

Setup and supporting files

Token Efficiency Throughout the Course

Student goal: Learn to spend context where it changes the outcome, not where it merely makes the session feel busy.

Token efficiency is not a separate optimization topic. It is the connective tissue across primitive design, planning, investigation, review, and workflow design, and it is money: tokens are cost and context is speed, so disciplined context means lower cost-to-serve and more work shipped per hour. The habits below should show up in every exercise.

Route narrowly

Start with the smallest useful abstraction: direct prompt, command, skill, subagent, agent, then plugin only when the reuse unit truly deserves packaging.

Search before read

Use file lists, grep results, line ranges, and exact excerpts before asking Claude to ingest full files or logs.

Delegate exploration

Broad search belongs in a subagent with isolated context and bounded output: finding, pointer, confidence, and next action.

Compact between phases

Use guided /compact after planning, investigation, or review so decisions survive while exploration noise falls away.

Compaction is lossy; guide it or it costs you. Compaction is not free cleanup; it rewrites the session into a summary, and a blind /compact can silently drop the exact facts the next phase depends on (a file path, a failing assertion, a decision and its reason). The failure mode is quiet: the model keeps going, but now reasons from a thinner, fuzzier context and starts re-asking settled questions or re-reading files it already understood. Two rules keep it safe:

Always compact with a focus. Prefer /compact focusing on decisions made, evidence pointers, files touched, unresolved questions, and residual risk over a bare /compact.
Compact at phase boundaries, not mid-task. Compacting in the middle of an investigation can erase the half-built chain of evidence you are actively using. Wait until a phase produces a durable artifact (a plan, findings, a review), then compress.

When in doubt, write the decisions into an artifact first and use /clear to start the next phase clean, rather than trusting compaction to preserve them.

Module-by-module token habits

Module	Token-efficient behavior	Student artifact
Primitive Build Lab	Choose the smallest primitive that controls context and reuse.	Primitive kit with primitive, permissions, model, output, stop condition.
Planning	Add a context budget: must load, can defer, delegate, do not load, preserve.	Compact Implementation Plan with context map.
Investigation	Compress evidence into claims with source pointers instead of transcripts.	Explore findings and Request Changes.
Review	Review the diff against the plan before expanding context.	Findings-first Review.
Workflow	Put context gates between roles and agents.	Workflow with handoff limits and compaction points.

Token-efficient prompt patterns

Search first. Return only matching file paths and line numbers. Do not read full files yet.

Read only <file> lines <start>-<end>. Summarize the relevance in 3 bullets.

Delegate to a subagent. Return only: finding, evidence pointer, confidence, next action. Limit to 10 bullets.

/compact focusing on decisions made, evidence pointers, files touched, unresolved questions, and residual risk.

Review this diff against the compact plan. Return findings only with severity, file pointer, and suggested next action.

The same patterns, shown as the wasteful version next to the disciplined one:

Token-heavy prompt	Token-efficient rewrite
Read these files and tell me how auth works. (loads whole files into context)	Search first. Return only matching file paths and line numbers for JWT validation. Do not read full files yet.
Here is the full 600-line log, what went wrong? (pastes everything)	Summarize this log into failing assertions, evidence pointers, and the next command to run. Cap at 10 bullets.
Go explore the repo and report back. (unbounded reads)	Delegate to a subagent. Return only: finding, evidence pointer, confidence, next action.
Review the whole PR. (re-reads everything)	Review this diff against the compact plan. Return findings only, with severity and file pointer.

Hands-on: token budget your current task

Write what Claude must load.
Write what can be deferred.
Write what should be delegated to a subagent.
Write what should not be loaded.
Write what must survive compaction.

How the Workbook Reinforces the Course

Tone and stance: These modules are practical, operational, and opinionated. They are not a feature tour; the goal is not to memorize every Claude Code command, setting, or flag. Instead they teach reusable engineering leverage: better context, lower token burn, and better first-pass code.

Three ideas worth carrying through the whole day

Give Claude better context

Persistent memory, targeted rules, subagent delegation, and guided compaction reduce repeated instructions and keep the session focused.

Burn fewer tokens for more impact

Use RTK (Rust Token Killer, a CLI proxy that compresses noisy command output), pointers, search-before-read, skills, and compaction so the team spends context where it changes the outcome.

Land better code the first time

Route by model, effort, speed mode, and review mechanism. Treat /code-review, /simplify, and REVIEW.md as quality controls.

Standardize the patterns

The organizational win comes from encoding good behavior in CLAUDE.md, REVIEW.md, commands, skills, agents, hooks, and workflow artifacts.

Coverage checklist

Reader point	Student-facing follow-along action	Concrete backing
Memory layers	Decide what belongs in managed, user, project, and local memory.	`CLAUDE.md`, `~/.claude/CLAUDE.md`, `CLAUDE.local.md`
Auto-memory	Write one durable correction Claude should not need to be told again.	`/memory` and readable markdown memory files
Path-scoped rules	Draft one rule that should load only for a file path pattern.	`.claude/rules/*.md`
@ imports	Plan one shared standard that should be imported instead of pasted.	`@docs/coding-standards.md`
Subagents	Delegate a broad search and return only findings, pointers, confidence, and next action.	`.claude/agents/*.md`, `/agents`
Custom agents	Review a named specialist with frontmatter, tools, model, max turns, and severity output.	Security reviewer example in the deck
Guided compaction	Write a `/compact focusing on...` prompt after a phase boundary.	`/compact` vs `/clear`
RTK and measurement	Name noisy commands worth compressing and how savings would be measured.	`rtk gain`, `rtk gain --history`, `rtk discover`
Pointers over full text	Replace one long file/log dump with path, line range, exact excerpt, and next command.	Search-before-read exercise
Skills	Promote one repeated prompt into a skill with inputs, workflow, outputs, verification, safety.	templates/skill-template.md
Model selection	Choose model and effort by job, not habit.	`/model`, `/model opusplan`, `/effort`
Fast mode	Name one situation where latency is worth the per-token tradeoff.	`/fast`
Review and simplify	Run correctness review before cleanup, then verify residual risk.	`/code-review`, `/code-review --fix`, `/simplify`, `REVIEW.md`
Multi-model workflow	Plan, implement, explore, review, simplify, and fast-iterate using the right control for each phase.	Workflow design module and desk reference

Concrete examples to keep open

Token Savings Field Guide

Student goal: Use token efficiency as an engineering operating habit: search first, point precisely, delegate noisy work, compact at phase boundaries, and turn repeated work into reusable artifacts.

Token savings are not about making Claude think less. They are about keeping the context window focused on the material that changes the outcome. Read the nine moves below as a rough progression, not a checklist of rules: the early ones are reflexes you build in your first week, the later ones are habits you grow into as your sessions get more ambitious. You will not do all nine on every task, but the more advanced your work gets, the more of them you will reach for without thinking.

1. Enable RTK for noisy command output

RTK (Rust Token Killer) is a CLI proxy that compresses noisy command output before it reaches Claude's context, so a run that would have streamed hundreds of lines into the session lands as a compact, structured summary instead. It is useful when shell output is large, repetitive, or noisy. The impact compounds because command output does not just cost tokens once; it remains in the session and can be re-read on later turns.

rtk gain              # cumulative token savings this session
rtk gain --history    # per-command breakdown with savings
rtk discover          # find missed compression opportunities

Good RTK targets:
- test output
- build logs
- git status and diff noise
- package manager output
- long formatter or type-checker traces

Practice habit: If a command regularly emits hundreds or thousands of lines, compress it, script it, or summarize it before it enters the main Claude context.

2. Move repeated tasks to executable scripts

The third time you type roughly the same instructions, treat it as a signal rather than a chore. Every repeat costs you twice: you re-describe the workflow, and Claude re-infers what you meant, sometimes differently than last time. Pinning that sequence into a stable script makes it cheaper, faster, and far more predictable, and it frees the conversation for the judgment only you can supply.

Repeated prompt pattern	Token-efficient replacement
“Run the usual checks.”	`scripts/check-pr-ready.sh` or `npm run check:pr`
“Do the full release validation we always do.”	`scripts/release-validate.sh` plus a skill that explains when to use it.
“Look for the standard security issues.”	A security-reviewer subagent plus a compact severity output contract.
“Please remember our migration checklist.”	A skill or command checked into the repo, with verification and safety rules.

# Example: scripts/check-pr-ready.sh
#!/usr/bin/env bash
set -euo pipefail
npm run lint
npm run typecheck
npm test
git status --short

# Prompt Claude:
# Run scripts/check-pr-ready.sh and summarize only failures,
# evidence pointers, and the next command to run.

3. Do not say “explore” without intention

Watch your own verbs for a day and you will catch yourself doing this. Words such as “explore,” “look around,” “understand this,” or “check the repo” quietly tell Claude to read broadly, and it will. Sometimes that is exactly what you want, when you are genuinely onboarding to an unfamiliar area. The skill is noticing the difference and choosing on purpose, rather than reaching for a vague verb out of habit and paying for a wide read you did not need.

Token-heavy prompt	Better prompt
Explore the auth system.	Find the entry points for JWT validation. Return only file paths, line numbers, and a one-sentence role for each. Do not read full files yet.
Look through the tests.	Search for tests that mention refresh tokens. Return matching files and test names only.
Understand why this broke.	Use git history, failing test output, and the changed files. Return claims with evidence pointers and confidence.
Review this whole PR.	Review the diff against the Implementation Plan. Return blocking findings first, each with severity and file pointer.

4. Use permission posture as a productivity lever, not a safety shortcut

Permissions shape token efficiency because unnecessary approval prompts interrupt loops, but broad permissions can create risk. Start with read-heavy, write-light permissions, then widen only when the task and verification path are clear.

{
  "permissions": {
    "allow": [
      "Read",
      "Grep",
      "Glob",
      "Bash(git status:*)",
      "Bash(npm test:*)"
    ],
    "defaultMode": "acceptEdits"
  }
}

Important: Lead with a scoped allow-list like the one above, not with bypassPermissions. Treat bypassPermissions as a sandbox-only accelerator for trusted, disposable training repos or isolated environments. Do not use it as the default posture in production-adjacent repos, repos with secrets, deployment scripts, CI mutation, migrations, or broad network access. Prefer auto mode or scoped allow rules when available.

5. Enable and curate memory

Memory saves tokens when it prevents repeated corrections. It wastes tokens when it becomes stale, vague, or bloated.

Use /memory to inspect what Claude has learned.
Put durable project conventions in CLAUDE.md, not in chat history.
Use path-scoped rules for standards that only apply to certain files.
Use @ imports for shared docs instead of repeated pasted text.
Prune stale assumptions and overly broad memories.

6. Share context and investigation across tasks

The cheapest context is the context already summarized well. Do not make the next task rediscover what the last task proved.

Reusable artifact	What it should preserve	Why it saves tokens
Explore findings	Claim, evidence pointer, confidence, open question.	Future tasks read findings instead of raw history.
Request Changes	What failed, exact evidence, correction request, desired output.	Prevents “that didn’t work” follow-up churn.
Review	Findings, severity, diff pointer, residual risk.	Review stays anchored to facts instead of re-reading everything.
Verification	Commands run, outputs summarized, unverified areas.	Next operator knows what is proven and what is not.
Workflow handoff	Role, input artifact, output artifact, gate, stop condition.	Agents and humans can continue without reopening the whole problem.

7. Bound subagent output

Subagents are powerful because they can absorb noisy exploration without polluting the main context. That benefit disappears if the subagent returns a transcript-sized report.

Delegate this investigation to a subagent.

Scope: auth middleware and token refresh only.
Tools: Read, Grep, Glob, Bash for git history.
Do not edit files.

Return exactly:
1. Findings, max 8 bullets
2. Evidence pointers: file:line or commit SHA
3. Confidence: high/medium/low
4. Recommended next action

8. Prompt with a token budget

Return paths only. Do not read files yet.

Read 40 lines around the match, not the whole file.

Summarize logs into failures, evidence, and next command.

Cap the answer at 10 bullets unless a blocking risk requires more.

/compact focusing on decisions made, evidence pointers, files touched,
unresolved questions, and residual risk.

9. End-of-task token checklist

Before asking Claude to work	Before continuing the session
Have I named the artifact I want?	Did I save decisions into an artifact?
Have I bounded files, paths, commands, and output length?	Did I compress noisy evidence into pointers?
Have I chosen command, skill, script, or subagent?	Did I update memory, `CLAUDE.md`, or a skill if needed?
Have I set the right permission posture?	Should I use guided `/compact` before the next task?

Hands-on: make one task 50% cheaper

Pick one broad prompt from your current workflow.
Rewrite it with a target, search boundary, output contract, and token budget.
Move any repeated command sequence into a script or skill.
Decide whether noisy discovery belongs in a subagent.
Write the compaction prompt you will use after the task.