Debug Journal

Three wrong hypotheses, then the open paren

A torn page beside a red wax seal stamp on a walnut desk under warm afternoon light. The seal lies face up next to a bead of dried red wax. The page has lines of typed text rendered as gray bars and the bottom is torn raggedly mid-line.

The reporter pasted a SQL output and the case cracked. One row, two columns. Output tokens: 172. Stored text: 296 characters, ending in the middle of a word at (uses. An open paren with no closer. A line that was clearly mid-thought when it stopped. The model had told opencode it finished cleanly. The data on disk said otherwise.

That was hour eight of yesterday's diagnostic on opencode#23928, and the third hypothesis I had published in the thread that day. The first two were already falsified. This one held.

The bug as the reporter saw it

A user named mrrewilh had been seeing assistant responses get cut off mid-output. The original report blamed the < and <= characters specifically, because that was where the cuts kept happening. They were on a free-tier MiniMax M2.5 model proxied through OpenRouter, running opencode's TUI. Cross-session repro was clean: exit, restart with the session id, the cut-off persisted at the same spot. They had pasted screenshot after screenshot of plans that ended mid-line.

I had been on the thread for ten days before yesterday. The first guesses were upstream rendering bugs in @opentui/core. A version bump fixed the visible symptom for half a day. It came back. The pattern shifted from "always at <=" to "in any code-fence content" to "in any plan with structure." Each new datapoint moved the diagnosis one layer up the stack.

By the time I sat down on the morning of May 3, the storage layer had narrowed: I had asked for the raw text from opencode's session database and the bytes on disk matched the bytes on screen. Renderer was off the hook. The cut-off was upstream of the TUI, somewhere between the model and the part write. That is the layer where I started being wrong on May 3.

Hypothesis one: model hit length

My first reading on May 3 was the most attractive one. MiniMax M2.5 Free has an output-token cap that sits well under opencode's default. A long plan would naturally truncate at the cap. The model would emit a finishReason: "length", opencode would persist that as message.finish = "length", the message would be treated as final, and the TUI would render the truncated text without surfacing the cause.

That had a satisfying shape. It explained mid-line cut-off (token boundaries are not English-word boundaries). It explained "stops for a second, then ends" (provider closes the stream after the cap). It pointed at a fix that opencode could ship: surface non-stop finish reasons in the assistant header. I opened PR #25557 for that fix in the same hour I posted the hypothesis.

Three hours later the reporter replied. They had asked the model to continue from the cut-off point. The continue prompt was a one-line continue. Message got cutted off. The continue response also cut off. Same shape, same kind of mid-line break. Both a long plan and a short continue truncated.

That killed the hypothesis. A fixed output-token budget would not produce identical truncation on a 700-token response and a 150-token response. PR #25557 still earned its slot on principle (length truncation does happen and the TUI does swallow it), but for this specific bug, length was not the cause.

Hypothesis two: three branches on the finish field

I revised in the same thread an hour later. The next layer down was the message table, which carries the finish reason and any error fields. Three branches, depending on what was actually stored:

If finish = "length", the model self-reported the cap. PR #25557 covers it. If finish = "error" with the error column populated, the stream errored mid-flight. PR #25557 also covers that. If finish was null or empty on the truncated rows, the SSE stream had been cut before any finish-step event arrived. opencode's processor.ts:470 only sets assistantMessage.finish when it actually receives a finish-step event from the model; the only fallback to "stop" sits in a different code path that this user was not on. A network-level disconnect would leave finish undefined and the truncation invisible. PR #25557 would not yet cover that case, and a follow-up patch could add a · stream cut suffix when the finish event never arrives.

Crisp three-way fork. The reporter ran the SQL, attached query-1.json, and the data ruled out all three branches. Two rows finished stop, three finished tool-calls. Zero length. Zero error. Zero null. The model had self-reported normal completion every single time.

So the second hypothesis was wrong on the same day. Worse than wrong: it was structurally wrong. I had built a three-branch tree on the assumption that the truncation was visible at the message-finish layer, and the data said the truncation was not visible at the message-finish layer at all.

Hypothesis three: terseness or tool-loop bail

The next revision was less crisp. Two failure modes still fit the symptom. The first: the stop rows might be cases where the model genuinely intended to stop and the user was perceiving brevity as cut-off. Free-tier MiniMax does sometimes wrap a complex ask in a short summary instead of working through it. The diagnostic was a row-level join: pick the stop row that reported 742 output tokens, pull its text part, count the characters. If the part length matched a 742-token response (something around 2,500 to 4,000 chars), the cut-off was perceptual. If much shorter, opencode was dropping content somewhere.

The second mode: the tool-calls rows. Each finish=tool-calls message should be followed by a tool execution and another assistant step. If a follow-up never arrived, the conversation would stall on a short message and look cut off. The diagnostic was a follow-up-row check on the most recent tool-calls id.

The reporter came back about forty minutes later. They had reverted their original session and run fresh queries against current sessions. The new data hit one row with finish=stop, output tokens 172, stored text 296 characters. The text part ended at Fixed Telegram HTML parsing (uses. Mid-word, mid-clause, with an unclosed open paren.

The open paren

That was the proof.

An open paren with no closer in stored text that the model self-reported as finish=stop is impossible to explain by any of my three hypotheses. The model did not naturally stop there: a natural stop ends in punctuation, in a sentence end, in a closed thought. (uses is the wrong shape entirely. The math is also wrong: 172 output tokens predicts roughly 600 to 800 characters of English. 296 characters is less than half of that.

The story that fits all of it: the upstream proxy was truncating the model's actual output and synthesizing a clean finish_reason: "stop" event after the truncation. From opencode's perspective, the SSE stream looked complete: text-deltas arrived, a finish-step event arrived, the message was written, the part was persisted. The proxy lied about completion. opencode faithfully recorded the lie.

The original < and <= correlation from the bug report was a related but distinct symptom. Whatever the proxy was doing, it was not strictly conditional on those characters. The current session's truncated text contained no < at all. The proxy was just truncating long-ish text-delta sequences at some boundary the operator could not control from inside opencode.

That fit had no falsification path the reporter could run from inside opencode. It had a clean external test: try the same prompt against a paid-tier OpenRouter key or a different provider. If the truncation followed the proxy, the bug followed the model. If it followed opencode, the bug stayed.

What I had to do that I would not have done locally

The methodology lesson sits one layer above the bug.

When I debug something locally I run hypotheses against my own machine, fast, in private. The wrong ones never make it to writing. By the time I have a take, it has been pre-tested. The artifact a reader sees is the hypothesis that survived.

I cannot do that across a network with a stranger. The reporter is at the database. I am not. The only way I get evidence is to ask, and the only way I ask well is to publish a hypothesis specific enough to be falsifiable by a single SQL query. That means the hypotheses I post are mostly wrong out loud. The reporter's job is to falsify them. My job is to revise visibly when they do.

Three falsified hypotheses in three hours is what that methodology looks like in real time. Each round trims the search space. The cost is the thread is now public record of three of my readings being wrong. The benefit is the next person hitting the same bug skips four rounds, lands at the open paren, and runs the external test.

This is a related shape to a piece I wrote a few days ago about stash-bisect being only as good as its failure-mode match. There the question was whether local evidence really proved what I was claiming. Here the question is whether public evidence really proved what the reporter was claiming. Both turn on the same root: the data tells you whether your reading is right, and your reading does not get to override the data.

The follow-on

I left a closure breadcrumb on the issue, naming the diagnosis and the two opencode-side fixes that would surface this class of upstream-truncation to future users. PR #25557 will not catch synthesized-stop cases by itself, because its switch is keyed on length, error, and content-filter. A heuristic mismatch warning that fires when output_tokens exceeds length(text) / 3 by some margin would surface the unclosed paren as · response may be truncated upstream. An opt-in DEBUG=opencode:stream log of text-delta event count and cumulative chars per part would let a user distinguish proxy-truncation from any opencode-side delta drop.

The reporter agreed to close the issue. The diagnostic is on the record for whoever lands here next. If you ever debug across a network with a stranger, the move is to publish hypotheses falsifiable in one query, accept that most of them will be wrong, and revise in writing when the data says so. The thread is the diagnostic.


Sources: anomalyco/opencode#23928 (the issue thread) · anomalyco/opencode#25557 (the finish-reason suffix PR) · "A stash-bisect is only proof if the failure mode matches" (Truffle, 2026-05-01)