All posts

Effective CLI Tools for the AI Era

11 min read

Effective CLI Tools for the AI Era

These are my recommendations for building command-line tools that AI agents can discover, understand, and operate effectively, along with some architectural decisions that make them easier to use.

Motivation

I’m on holiday. Each holiday, I pick something simple but interesting to analyse, read and write about, and maybe even build a small PoC for. Sometimes these ideas see the light of day; sometimes they don’t. Handle this text with caution; it’s not intended to be the definitive guide; it’s just some notes & guidelines on command-line tools (CLI) and AI, from the perspective of someone who is still learning.

Abstract

Why are we here? AI has changed the way developers interact with tools, and it will keep doing so. AI coding agents, such as Claude Code, Cursor and Codex, are becoming important consumers of developer tools. CLIs have re-emerged as one of the lowest-friction interfaces between a model and an external system.

It’s not my intention to stop you from writing Model Context Protocol (MCP) servers. Do it; they are great. There is enough room in this (AI) world for MCP servers, CLIs and whatever we’ll invent in the future.

The Token Economy

I believe one of the most important aspects of AI today is economics. Not only because of the cost of GPUs, memory, and inference, but also because developers and systems are dealing with a limited resource in their day-to-day work: tokens.

In many current MCP integrations, tool schemas are loaded into the model context before use, which can create significant token overhead. Newer approaches such as dynamic discovery, tool indexing, and schema compression can reduce this, but they are not universally available.

CLIs support progressive disclosure naturally: the model can inspect only the command, subcommand, or help text it needs. The agent can run --help if it needs to know how to invoke a command, and it pays only for that. In published benchmarks and engineering write-ups, MCP-style tool exposure has been reported to consume anywhere from thousands of extra tokens per turn to tens of times more tokens than equivalent CLI workflows, depending on the number of tools and how discovery is implemented.

Because tokens are also reasoning capacity, leaner tool surfaces leave more of the context window available for solving the actual problem.

Shell interactions are abundant across public repositories, tutorials, and Q&A sites, so command-style usage is a well-represented pattern for code-capable models. Agents are already familiar with git, curl, jq, grep, docker, and kubectl, and can compose commands without a schema being spelled out first. Coding agents also benefit from one of the greatest inventions of the Unix philosophy, pipes. Agents can effectively chain different tools’ outputs/inputs to resolve problems not only elegantly but also very effectively, something a GUI cannot offer.

Should I build a CLI instead of an MCP server?

After reading a couple of articles that you’ll find in the bibliography section, my short answer is: build a CLI when you look for deterministic, developer-style work. MCP may be a better fit when you need a managed integration surface around authentication, permissions, hosted services, policy enforcement, and auditing (assuming those capabilities are actually implemented by the host, server, or surrounding platform).

Agent-Native CLIs

When writing CLIs I was focused on Developer Experience, Human Developer Experience to be more specific. I tend to provide enough tools to the (human) user to mitigate the lack of a GUI. Now I think we are facing a game changer. We have to focus on Agent Developer Experience too.

  • The first thing I found out is that the CLI is the docs, not that six-month-old README file that was never updated after v1.
  • The same command should produce the same output across different environments. Be predictable, not clever.
  • Every irrelevant or superfluous field or parameter costs tokens, and we need to save tokens and reasoning capacity as much as possible. Let the caller ask for more if it needs to.
  • Prepare for the unexpected. Humans sometimes hallucinate; agents do this constantly. Every input should be validated accordingly.
  • Fail in a machine-readable way. An error the agent can parse and understand, and that lets it retry intelligently, is far better than letting the agent guess.

Some recommendations

Here are the conventions I’d start with when making a CLI easier for agents to use. You don’t need to add everything at once; the first ones will give you most of the value.

  • Structured output: Every command that returns data should support a JSON mode, either with --output json or just --json. Keep the human-formatted default for operators.
  • Runtime introspection: Let the agent discover capabilities at runtime rather than relying on documentation. A --describe, --help or schema command that shows the command tree, parameters, types, and required scope as JSON makes the CLI self-describing. I believe this is the single highest-leverage feature for discoverability.
  • Shaping the response: APIs and tools often return large blobs; let the agent request only the fields it needs.
  • Pagination: For large result sets, implement pagination. For streaming or very large outputs, consider NDJSON. By paginating, the agent does not need to buffer a giant array into context.
  • Safe mutation: Some CLI operations may be destructive; implementing a --dry-run is nowadays more important than ever, as it allows agents to validate a destructive operation before committing it.
  • Idempotent mutations: Agents retry aggressively, and the dangerous case is the ambiguous failure: a timeout where the write may already have landed, so a blind retry duplicates it. Let the caller pass an idempotency key so repeats are safe, and structure errors so the agent can tell “definitely did not apply” (retry) from “outcome unknown” (verify first).
  • Structured errors: As said before, a structured object describing an error is a gold mine for an agent. An error code, a message, a hint, and even a retryable field enable graceful failure handling and intelligent retries.
  • Non-interactive by default: Agents should not have to answer an “Are you sure? [y/N]” prompt interactively, so remove or bypass interactive prompts. Some tools may choose to formalise this as an explicit “agent mode” alongside a deterministic “human mode”.
  • Input hardening: Reject control characters, path traversals, unexpected query parameters, and double-encoded strings. Fuzz your inputs with the kinds of mistakes agents actually make. Treat the agent as an untrusted operator because it is.
  • Headless auth: Support non-interactive authentication via API keys, tokens, environment variables, or service accounts. Avoid authentication flows that require a browser redirect, for example.
  • Exit codes: Last but not least, honour standard exit codes. 0 is success; non-zero is failure. Keep stdout for data and stderr for diagnostics, so output pipes cleanly into the next command.

I’ve turned these conventions into a checklist you can actually run against a CLI, see the appendix.

Discovering

Good flags make a CLI usable; the layers below make it discoverable declaratively.

Originally introduced by OpenAI for Codex, AGENTS.md is now an open format stewarded by the Agentic AI Foundation under the Linux Foundation, with adoption across tens of thousands of repositories, and is read by Cursor, GitHub Copilot, Gemini CLI, and others (Claude Code is a notable exception in practice: its documented project-memory mechanism is CLAUDE.md, although teams can still bridge formats manually). Keep it concise — it is instructions that fit in the model’s attention window, not full documentation.

On the other hand, Agent Skills, originally introduced by Anthropic, use SKILL.md as an open format for packaging instructions plus runnable scripts that an agent loads on demand. Its discovery model is progressive disclosure.

  1. Discovery — the agent loads only each skill’s name and description.
  2. Activation — when a task matches, it reads the full SKILL.md.
  3. Execution — it follows the instructions, loading referenced files or running bundled code as needed.

A minimal skill is a folder with a SKILL.md (YAML frontmatter + Markdown body) and optional scripts/, references/, and assets/. Shipping your CLI with a skill file may become an important distribution path for agent-friendly tooling.

One more thing

CLI execution typically inherits the invoking session’s permissions. Give each agent the narrowest credential and tool set its task requires. An agent doing code review does not need filesystem-wide write access.

My Agent-Native CLI conformance checklist

This is a starting rubric I’m experimenting with, not a standard. Treat the tiers as one opinionated way to prioritise, not gospel.

How the tiers gate the verdict

  • Critical — a FAIL makes the tool Non-conformant; an unverified PARTIAL makes it Provisional.
  • Important — anything less than PASS yields Conformant (with advisories).
  • Recommended — never blocks the verdict; counted separately as quality extras.

Mark each: [x] pass · [~] partial (advertised, unverified) · [ ] fail · [-] N/A.

Marking rule for behaviour you cannot observe directly. Some controls can only be confirmed with a stateful test harness (e.g. proving --dry-run actually suppressed the side effect, or that an idempotency key prevented a duplicate). When the Verify step can only confirm that the feature is advertised — not that it behaves correctly — default to [~], not [x]. A control is only [x] when its effect has been observed.

Critical — must pass for conformance

  • [ ] MO-1 · JSON / structured output mode (Machine output) — Every data-returning command offers --json / --output json. Verify: run the JSON command; stdout parses as JSON/NDJSON, exit 0.
  • [ ] MO-2 · JSON output is valid & clean on stdout (Machine output) — JSON mode emits only parseable JSON to stdout — no log noise. Verify: JSON command stdout is valid JSON with exit 0.
  • [ ] DI-1 · --help exits 0 with usage on stdout (Discoverability) — Help is requestable, succeeds, and prints to stdout (not stderr). Verify: --help exits 0 with non-trivial stdout.
  • [ ] SA-1 · Rejects unknown flags with a non-zero exit (Safety) — Unknown input fails fast — no silent accept, no hang. Verify: a bogus flag returns a non-zero exit code.
  • [ ] NI-1 · Never blocks on interactive input (Non-interactive) — Runs to completion without waiting on a TTY/prompt. Verify: blocking-stdin probe does not time out.
  • [ ] EX-1 · Honours exit-code convention (Exit codes) — Success exits 0; any failure exits non-zero, so callers and pipes can branch on the result. Verify: a known-valid command exits 0; a known-invalid command exits non-zero.

Important — failing these triggers advisories

  • [ ] DI-2 · Machine-readable self-description (Discoverability) — A --describe / --help --json that emits the command tree as JSON. Verify: such a command returns valid JSON.
  • [ ] DI-3 · Reports --version (Discoverability) — Agents can pin behaviour to a version. Verify: --version exits 0 and prints a version.
  • [ ] SA-2 · --dry-run for mutating operations (Safety) — Destructive actions can be validated before committing. Verify: the dry-run command exits 0 and previews the action. (Suppression of the real side effect needs a stateful harness — default [~] until observed.)
  • [ ] SA-3 · Explanatory, structured errors (Safety) — Errors are clear, on stderr, and structured in JSON mode — ideally with a code, a message, a hint, and a retryable field. Verify: invalid input yields an explanatory stderr message; in JSON mode the error is structured.
  • [ ] SA-4 · Idempotent / retry-safe mutations (Safety) — Mutations accept an idempotency key (or are safe to repeat by construction), and errors let the agent tell “definitely did not apply” (safe to retry) from “outcome unknown” (verify first). Verify: the tool accepts --idempotency-key (or documents idempotent semantics); re-running the same keyed mutation produces one effect. (Needs a stateful harness — default [~] until observed.)
  • [ ] SE-1 · Secrets never leak through the surface (Security) — Credentials are accepted via env var, file, or stdin — not required as positional args — and never echoed to stdout or printed in help. Verify: the secret value does not appear in stdout or --help; a non-arg input path (env/file/stdin) exists.
  • [ ] MO-5 · Bounded default output (Machine output) — A default invocation does not dump unbounded data into context; it caps results (a default limit) or paginates, and signals when output was truncated. Verify: default output of a list/query command is bounded and/or carries a truncation or next-page indicator.
  • [ ] UX-1 · Diagnostics → stderr, data → stdout (Conventions) — Stdout stays clean and pipeable; logs/errors go to stderr. Verify: error output appears on stderr, not stdout.

Recommended — quality extras (non-blocking)

  • [ ] MO-3 · Field selection / response shaping (--fields) (Machine output) — Callers can request only the fields they need. Verify: a fields command returns a reduced JSON object.
  • [ ] MO-4 · Streamable pagination / NDJSON (Machine output) — Large result sets stream as one JSON object per line. Verify: the NDJSON command emits multi-line NDJSON.
  • [ ] NI-2 · Non-interactive flags (--yes / --no-input) (Non-interactive) — An explicit switch guarantees no prompting in automation. Verify: the non-interactive command completes without blocking.
  • [ ] UX-2 · Concise, context-friendly help (Conventions) — Help is complete but small enough to not bloat agent context. Verify: help on stdout is roughly ≤ ~1,500 tokens.

Conclusion

The goal is not to make CLIs replace MCP servers, APIs, or GUIs. The goal is to recognise that CLIs remain one of the most practical interfaces for agentic software work. If we make them structured, discoverable, deterministic, and safe to retry, they become not only better tools for humans, but better tools for agents too.

Bibliography

  1. MindStudio. MCP vs CLI in Agentic Workflows: 35x Token Overhead and 72% vs 100% Reliability. 10 May 2026. mindstudio.ai
  2. MindStudio. Claude Code MCP Servers and Token Overhead: What You Need to Know. 2 April 2026. mindstudio.ai
  3. xpander.ai. MCP vs CLI for AI Agents. 29 March 2026. xpander.ai
  4. Anthropic. Code execution with MCP: building more efficient AI agents. 4 November 2025. anthropic.com
  5. OpenAI. OpenAI co-founds the Agentic AI Foundation under the Linux Foundation. 9 December 2025. openai.com
  6. Anthropic. Equipping agents for the real world with Agent Skills. 16 October 2025. anthropic.com

More posts