Project update
What only the pixels knew.
At 05:53 on Friday morning, a session on Easel got asked a simple question: "What's that image?" The agent answered honestly. It located both images on the board by coordinate, described where each sat, and then said the quiet part: "I can only see their file references, not the pixels themselves." Three hours later, at 08:21, a different session on a different board caught a title that was visually clipped, widened the text box so the full line showed, and left a sticky note describing what it had seen. Same agent. Same model. The difference was a screenshot.
Easel is a shared canvas where an agent works the board live: stickies, text, frames, generated images, all in one JSON document the browser and the agent mutate through the same versioned API. Until Friday morning the agent's entire knowledge of a board was that document. Element types, positions, sizes, z-order, text content. A coordinate model. And a coordinate model is a furniture inventory, not a room. It tells you a text element exists at x:120 with width 260. It cannot tell you whether the glyphs fit.
The fact that lived nowhere in the document
The proof session ran on the demo board. The prompt asked the agent to judge the board with its eyes and fix anything it could see. It took a screenshot, and the screenshot showed the board title rendering as "Midnight Bakery —" with the rest of the line cut off by its own box. Nothing in the document was wrong. The element existed, the width was a positive number, the text was intact in the JSON. Whether that text survives the trip through font metrics, line wrapping, and CSS overflow is a fact that exists only at render time, only in pixels. The agent widened the box, took another look to confirm the full line showed, and wrote an observation sticky. Forty-six seconds, thirty cents.
That is the whole argument for vision in one bug. Overlap, misalignment, crowding, clipping, a generated image that came back too dark to read against: these are render-time facts. An agent that arranges a visual surface from coordinates alone is doing interior design from a spreadsheet.
How the eyes work
The mechanics are deliberately boring. The site exposes a read-only render route that mirrors a board as plain HTML, no JavaScript, same CSS as the live canvas. The bridge that runs the agent session mints a token for that route per session: an HMAC of the board id, keyed on the bridge secret, truncated to 32 hex characters. The token is board-scoped and read-only, so the subprocess doing the looking never holds anything that can write, and never holds the master bearer at all. No token gets a 403. A wrong token gets a 403. The minted token gets the board.
The agent's screenshot_board tool drives a Playwright browser running as a sibling container, navigates to the tokenized render route, screenshots the stage as a JPEG, and passes the image block straight through to the model. The budget is five shots per session, which turns out to be plenty: the working rhythm that emerged is look, move, look again. Think with the document, judge with the pixels.
Why a real browser and not a cheaper picture
The tempting shortcut is to skip the browser: rasterize the board server-side from the JSON, or just describe the layout to the model in words. Both are the same mistake. They are a second renderer, and a second renderer drifts from the first. The clipped title existed precisely because of how the real CSS wrapped real glyphs at a real width; a homemade rasterizer would have to reproduce that wrapping bug-for-bug to be worth anything. The browser is the only honest witness to what the user sees, so the browser is what the agent looks through. The render route exists to make that look cheap, stable, and safe to authorize.
There is a quieter benefit too. Because the screenshot is of the same surface the user has open, the agent and the user are arguing about the same picture. When it leaves a sticky saying the title was clipped, you can scroll up and see exactly the clipping it means. The evidence is shared.
The lesson, stated once
An agent that operates a visual surface needs two channels, not one. The document model is for mutation: precise, versioned, diffable. The pixels are for judgment: the only place where render-time truth lives. Easel had the first channel from day one and shipped useful sessions with it. But the 05:53 session, politely confessing it could not see, was the product telling me what it was missing. The 08:21 session was the answer.
The board where the agent caught the clipped title is public: open it and the green observation sticky is still there, in the agent's own words. The substrate that runs all of this, including the bridge that mints the tokens and owns the subprocess, is open at github.com/ghostwright/phantom.