How I evaluate AI features in my CMS

Every AI feature I ship in Squilla goes through a 3-question gate before it ships. Specificity, an eval set, and a real week with friendly users. Here is how each gate works, what passed, what died, and the small Go binary I use to keep the bar honest.

Every AI feature I ship in my CMS goes through a 3-question gate before it ships. Not after. Not during the rollout call. Before. If a feature cannot answer those three questions cleanly, it does not get built, or it gets built smaller, or it gets quietly killed in a branch and nobody notices.

This sounds like process theater. It is not. I learned it the hard way, by shipping AI features that demoed well, looked clever in a screenshot, and then sat in the admin panel for six months collecting zero clicks. The gate is what stopped that pattern. I want to walk through it, because I think the same three gates apply to almost any AI feature in a content workflow, not just mine.

Gate one: specificity

The first question I ask is the rudest. Does this feature do one thing well, or is it a chat-with-your-CMS soup?

The reason it is rude is that the chat-with-your-CMS pitch is irresistible. You imagine a little drawer in the corner of the admin. The editor types what they want. The CMS understands. Posts get written, images get tagged, SEO gets fixed, the menu reorganizes itself. It is a beautiful story and it is also, in my experience, a way to ship something nobody uses.

Generic chat is generic. People do not know what to ask for. They open the drawer, type something vague, get a response that is somewhere between useful and confusing, and then close the drawer. The feature has no shape, so it cannot be good at anything, so it ends up being mediocre at everything. Editors are not going to learn the prompting tricks needed to coax a useful answer out of it. They just want the thing to work.

So gate one rejects anything that is a generic chat. The replacement is a feature that does exactly one verb. Suggest alt text. Rewrite this heading shorter. Generate three excerpt candidates. Find duplicate slugs. One verb, one input, one output. If I cannot describe the feature in five words without using the word "assistant", it fails gate one.

Five words sounds arbitrary. It is not. Five words is the length of a button label. If you cannot put it on a button, your editors will not find it.

Gate two: the eval set

The second gate is the one most teams skip, and the one that produces almost all the leverage. Before I write the production code, I write 20 realistic test inputs. The feature has to pass at least 16 of them.

The word "realistic" is doing the heavy lifting here. Realistic does not mean random. It means I take 20 actual cases from the actual content in the actual CMS. Real images for alt text. Real headings for the rewriter. Real excerpts that need shortening. I label what a good output looks like for each one, in my own words, before the model ever sees the input. That labeling step is where most of the design work happens, because writing down what "good" looks like for 20 cases forces me to be specific in a way I cannot wave away.

Then I run the feature against the set and grade the outputs. Not automatically. By eye. AI eval automation is a topic I respect, but for a feature with 20 inputs and a clear quality bar, I am faster and more accurate than any judge model. I score each output pass or fail against my own pre-written rubric.

The bar is 16 out of 20. Not 20 out of 20, because that bar is fake and it pushes you to game the eval set. Not 10 out of 20, because that is a coin flip and your users will hate you. 16 out of 20 means the feature is right four times out of five, which is roughly where editors stop second-guessing it and start trusting it.

The tooling for this is a small Go binary I keep in the repo. About 300 lines. It loads the eval set from a YAML file, calls the feature endpoint for each input, writes the outputs to disk, opens them in pairs against the previous run, and lets me mark pass or fail with a keystroke. At the end it prints a diff against the last run. Regressions show up immediately. New passes show up immediately. I can see, in one screen, whether the change I just made helped or hurt.

It is not fancy. It is not a framework. It is the smallest thing that lets me answer the question "did this get better or worse" with evidence instead of vibes.

Gate three: the field

The third gate is the one that kills the most features, and that is exactly why it exists. After a feature passes the eval set, I ship it to one or two friendly users for a week. Real users, real content, real workflow. No special treatment. If they stop using it inside the week, the feature did not earn its keep, and I pull it.

The reason this gate is brutal is that an eval set tells you the feature works. The field tells you the feature is wanted. Those are two completely different questions, and a feature can pass the first and fail the second without me being able to predict which features will be which.

One that lived, one that died

An AI alt-text suggestion feature passed all three gates and stayed. Specificity, easy. One verb, suggest alt text for this image. Eval set, 18 out of 20 once I tightened the prompt to forbid the words "image of" and "picture showing". Field, my two friendly editors used it on every single image upload for the whole week and complained on day eight when I broke it during a refactor. That complaint is the strongest possible signal a feature is real. People only complain about things they were relying on.

A summarize-my-last-10-posts feature died on gate three. It passed gate one, technically, because the verb was clear. It passed gate two with a solid 17 out of 20. The summaries were good. They were accurate. They were well written. And nobody ever clicked the button. Not once, in the whole week. When I asked why, the answer was the most honest possible feedback. Nobody had ever needed to know what their last 10 posts said. The feature solved a problem that did not exist. I deleted it the same afternoon.

Set the bar

If I had to compress everything I have learned about shipping AI into one sentence, it would be this. Most failed AI features failed because nobody set the bar before shipping.

The eval set is the bar for quality. The field test is the bar for relevance. The specificity gate is the bar for shape. None of them are about the model. All of them are about whether the feature deserves a place in the workflow.

Set the bar. Then ship the things that clear it.