Engineering

AI command generation for SSH: lessons from 10,000 users

What we learned about prompt design, context windows, and trust after a year of shipping AI-powered command suggestions to mobile SSH users.

CC Chen Chen· Founder·May 14, 2026·9 min read

Why SSH is harder than chat

Asking an LLM to write a shell command sounds like a small problem. It isn't. In a chat product, a wrong answer means the user re-reads or asks again. In a shell, a wrong answer means a command runs against a production server. The threshold for "good enough" is much higher when the output gets executed.

We shipped the AI command helper in TermAI v0.4 and have watched a year of telemetry. About a third of users use it daily; about half use it weekly; a quarter rarely touch it. The daily users are the ones we listen to most carefully — they spotted every paper cut and have opinions.

Context: how much, and what

The first version sent the user's question and nothing else. The model then guessed at a shell context that often didn't match — assuming Ubuntu when you were on Alpine, assuming bash when you were on zsh, assuming you were root when you weren't.

So we added context. But how much?

The cheap thing to do is send the full terminal scrollback. We didn't, for three reasons:

  1. Privacy. Scrollback contains everything — file contents, output, secrets that scrolled past. Shipping that to a model by default felt wrong.
  2. Token cost. Long contexts are expensive. AI helper usage is in the free tier; we couldn't afford a 10× cost per call.
  3. Quality. Counter-intuitively, more context often hurt answers. The model would key off some old error in the scrollback that wasn't relevant to the new question.

The shipped behavior: the helper sees the working directory, the last command's exit code, and the last 5 lines of output. That's enough to ground the answer in this session without leaking the rest. The proxy strips IPs, hostnames, and obvious secrets before forwarding. Users can opt out of context entirely if they want bare prompts.

Prompt design that works

Prompt engineering deserves its own essay, but the pattern that converged after a year of iteration looks like this:

System: You are a shell command generator. Output ONLY a runnable
command, no explanation. If the request is ambiguous, generate the
most common interpretation. Do not include backticks or markdown.
Target shell: {detected_shell}
OS: {detected_os}

User: {question}

Context (last 5 lines of output):
{context_lines}

Two things to notice. First, we tell the model to output the command and only the command — no preamble. Mobile users want the result, not a tutorial. Second, the system prompt says "common interpretation" instead of asking for clarification. Asking for clarification through chat is fine in ChatGPT; in a terminal it's an extra round-trip that breaks flow. If the user wanted something more specific, they'll edit the result.

Confirm-before-execute

Generated commands never auto-run. The helper produces a suggestion; you tap to send to the terminal. This sounds obvious, but it's the single most important UX decision in the whole product. Users trust the helper precisely because it doesn't surprise them.

Three actions on every suggestion: Run sends it as-is, Edit drops it into the input line for tweaking, Dismiss closes the suggestion. We considered an "auto-run on simple commands" mode and shipped it briefly. Users hated it. Even ls needs the tap, because the next thing the helper suggests might not be ls.

Pasting errors as context

The most-used flow turned out to be error analysis, not command generation. The pattern: user runs a command, it fails, they select the error output and tap "Ask AI about this". The helper gets the error, the failing command, and the last few lines of scrollback as context — and explains what's wrong.

This single feature is responsible for maybe 40% of helper usage. Error messages are gnarly and intimidating; an LLM is genuinely good at parsing them. We added a fast-path that detects when the input looks like an error (matches known patterns from sshd, nginx, systemd, docker, etc.) and routes to a smaller model with a tighter error-explanation prompt. Faster, cheaper, often better.

When we route to which model

This is the part that changes most often. By the time you read this, the specifics will be different. The principle won't.

We use multiple models. Cheap fast models for the high-volume cases (simple commands, error parsing), slower smarter models for the hard cases (complex multi-step requests, "set up nginx with TLS and a redirect"). A small classifier on the client side decides which to route to based on token count, presence of certain keywords, and the user's tier.

Free-tier users get the fast model for everything (it's good enough for 80%+ of cases). Pro users get the smarter model by default with the fast model as a fallback during peak hours.

Five lessons we learned the hard way

  1. Latency beats quality. A 200ms decent answer is better than an 800ms great answer. Mobile users will dismiss anything that feels slow and just type the command themselves.
  2. Context is a knob, not a maximum. More context often hurts. We tune per query type.
  3. Never auto-run. The product is built on trust. One surprise is enough to lose it.
  4. Explain failures, don't just translate them. A good error analysis says "this means X, try Y" — not "this error means Y/n prompt was unanswered".
  5. Limit the free tier in calls, not features. 20 calls/day with the full feature set converts much better than unlimited calls with a watered-down model. Users who'd benefit from upgrading do; users who don't, don't feel pinched.
Try TermAI

Free on iOS and Android. 3 SSH connections + 20 AI calls/day on the free tier.

CC
Chen Chen — Founder of TermAI

Writes about mobile DevOps, terminal UX, and the surprising depth of "boring" infrastructure.

💬 Discuss this article: Hacker News · Reddit · V2EX
Was this useful? ← Back to blog