Streaming

Claude Code is genuinely great. It holds a multi-file change in its head across dozens of tool calls, streams its reasoning while it works, and hands control back at the right moments instead of steamrolling. That’s the bar I’m trying to clear. But I’m building my own, and this series is the running journal.

Why build another one

The first reason is usage limits. I keep hitting You've used 90% of your session limit on Opus every few hours, and when that happens I don’t want to stop working — I want to switch models. My own harness lets me flip from Opus to Sonnet to Qwen to Minimax without leaving the flow, and the cheaper models are often good enough for the task in front of me.

The second is customization that actually sticks. Claude Code’s memory doesn’t reliably persist the small preferences I care about. I’ve asked it, more than once, not to add the Co-authored-with: Claude Code trailer in PR descriptions and to always squash-and-delete the branch after merge. It remembers for a while, then forgets. Building the harness myself means I control what goes into the system prompt — the preferences live in code, not in a file the model might or might not honor.

The third is that I can add security guardrails on bash tools — whatever allowlist, denylist, or sandboxing policy I want, enforced at the tool layer instead of relied on by prompt.

The plan is to mirror the Claude Code shape: terminal UI first, then an IDE plugin — the same ramp a normal dev workflow takes, so I can eat my own dogfood at each step.

Agent framework

There are plenty of third-party agent frameworks — LangChain, LangGraph, Vercel AI — and I’ve used some of them. The trouble is it’s hard to see what’s actually happening under the hood. And it turns out writing your own isn’t hard. Trust me: it’s really simple. An agent is just a while loop between LLM API calls and tool calls, with an adapter layer for each provider you support. That’s it.

Streaming APIs

The two providers I’m wiring up first both expose streaming over server-sent events, but the event shapes differ enough that the adapter layer earns its keep.

Anthropic — Streaming Messages
OpenAI — Streaming responses

The stream is usually made up of three kinds of content:

Text. The assistant’s reply arrives chunk by chunk.
Tool use. Tool-call blocks arrive incrementally, often as partial JSON, before the agent decides whether to execute them.
Thinking. Some providers stream the model’s reasoning separately from the final answer, with its own start/end markers.

Providers mark boundaries with events like message_start, message_end, thinking_start, thinking_end, tool_use, and tool_input_json, so the client can rebuild the final message incrementally.

Most of the engineering work is in handling those provider-specific events and normalizing them into a single consistent stream before they reach the rest of the agent.

EventStream

The job of the adapter is to take a provider-specific stream and expose it as a single consistent interface to the rest of the agent. Here’s what Anthropic’s stream looks like out of the box:

import { Anthropic } from "@anthropic-ai/sdk";
const client = new Anthropic();

const stream = client.messages.stream({
  model: "claude-opus-4-7",
  messages: [{ role: "user", content: "Hello" }],
  max_tokens: 256
});

for await (const event of stream) {
  // handle the event
}

The for await (const event of stream) line is syntactic sugar for the AsyncIterable protocol — it lets you iterate over values that arrive asynchronously over time.

I want to keep that for await ergonomics for the consumer, so the adapter wraps the provider stream in another AsyncIterable — call it EventStream — that emits canonical events instead of provider-specific ones:

class EventStream<T, R> implements AsyncIterable<T> {
  push(event: T): void;
  close(): void;
  abort(error: Error): void;
  setFinalOutput(output: R): void;
  getFinalOutput(): R;
  [Symbol.asyncIterator](): AsyncIterator<T>;
}

class AssistantMessageEventStream extends EventStream<AnthropicEvent, AssistantMessage> {}

AssistantMessageEventStream is just a typed alias so the rest of the app can work with one stream abstraction.

In the adapter, we read from the provider’s stream and push each event into our own:

const anthropicStream = this.client.messages.stream({
  model: "claude-opus-4-7",
  stream: true,
});
const eventStream = new AssistantMessageEventStream();
for await (const event of anthropicStream) {
  eventStream.push(event);
}
return eventStream;

And the consumer keeps using for await without caring which provider is on the other end:

for await (const event of eventStream) {
  // handle the event
}

The async IIFE trick

There’s one subtlety that’s easy to miss. The snippet above looks fine, but if you wrap it in a function as written, you’ll break streaming. Here’s the broken version:

async function getStream() {
  const eventStream = new AssistantMessageEventStream();
  const anthropicStream = this.client.messages.stream({
    model: "claude-opus-4-7",
    stream: true,
  });
  for await (const event of anthropicStream) {
    eventStream.push(event);
  }
  return eventStream;
}

The problem: for await blocks. The function won’t return eventStream until the provider stream is fully drained, which defeats the whole point of streaming — the consumer sees nothing until the model is done generating.

The fix is an async IIFE (immediately invoked function expression) — a function that’s defined and called in one go:

// standard IIFE
(function () { /* … */ })();

// async IIFE
(async () => { /* … */ })();

We use it to kick off the async work without awaiting it, so the outer function can return the EventStream synchronously while events keep flowing into it in the background:

function getStream() {
  const eventStream = new AssistantMessageEventStream();

  (async () => {
    const anthropicStream = this.client.messages.stream({
      model: "claude-opus-4-7",
      stream: true,
    });
    for await (const event of anthropicStream) {
      eventStream.push(event);
    }
    eventStream.close();
  })();

  return eventStream;
}

Now the caller gets the EventStream immediately and can start iterating. The IIFE runs in the background, pushing events as the provider produces them.

Agent loop

With streaming sorted, the agent loop itself is almost anticlimactic. Each turn is a while loop between LLM calls and tool calls: if the stop reason is tool_use, run the tool and feed the result back. Anything else — end_turn (the final answer) or max_tokens — and the turn ends.

async function* run(input: string) {
  messages.push({ role: "user", content: [{ type: "text", text: input }] });

  while (true) {
    const stream = provider.stream({ messages, tools });
    for await (const event of stream) yield event;

    const reply = stream.getFinalOutput();
    messages.push(reply);
    if (reply.stop_reason !== "tool_use") break;

    const results = [];
    for (const block of reply.content) {
      if (block.type !== "tool_use") continue;
      const output = await callTool(block.name, block.input);
      results.push({ type: "tool_result", tool_use_id: block.id, content: output });
    }
    messages.push({ role: "user", content: results });
  }
}

What’s next

In Part 2, I’ll cover making the terminal UI work and the challenges of streaming markdown in a terminal.