Dscout: Why researchers should lead AI evaluations

Published in January 2026 on Dscout’s People Nerds blog, this article by Nathan Reiff — a Senior UX Researcher and Product Manager at Dscout — argues that the most pressing place for researchers to apply their skills right now is not in studying users of AI products, but in evaluating AI systems themselves. As teams race to ship AI features, Reiff contends that researchers who join evaluation workflows early can prevent significant product failures that automated metrics alone will not catch.

The argument

AI evaluation — often called “evals” in engineering contexts — typically involves three methods: human evaluations, LLM-as-a-judge evaluations, and code-based evaluations. Each measures whether an AI system is producing accurate, useful, or safe outputs. Most teams treat these as purely technical tasks. Reiff challenges that assumption.

Human evaluations require someone to read or listen to AI outputs and assess their quality — a task that looks simple but demands exactly the skills researchers use daily: recognizing when an answer is technically accurate but contextually misleading, identifying edge cases that fall outside training distribution, and formulating criteria that reflect what real users actually need from the system. Engineers can write code-based tests efficiently, but they are often poorly positioned to define what quality means from a user perspective.

Bridging evaluation and UX research

The article proposes that researchers treat evals as a bridge rather than a separate discipline. A researcher embedded in an AI team can begin with human evaluations during early model development, gathering qualitative signal about which outputs feel off and why. That signal then informs the criteria used to build LLM-as-a-judge systems and eventually the structured datasets used in code-based evaluation. This progression moves the research function from a consumer of model outputs to a shaper of model behavior.

Reiff is direct about why this matters beyond individual product quality: teams that treat AI features as engineering deliverables rather than user experiences are producing systems that function correctly by internal metrics but frustrate or mislead the people using them. A researcher acting as what Reiff calls a “benevolent dictator” in early evaluations can redirect that trajectory before it becomes expensive to change.

Who this is useful for

The article is most valuable for UX researchers who feel marginalized in AI development cycles — brought in after decisions are made, asked to test products rather than shape them. Reiff offers a concrete entry point: request to participate in evaluation workflows, even informally, and demonstrate how research criteria differ from engineering criteria. For research managers, the article makes a business case for embedding researchers in AI teams from the beginning of model development rather than at the point of user testing.

Product teams building AI-driven features will also find this useful as a framing for where researcher involvement creates the most leverage — not in the final usability test, but in the evaluations that determine what the AI does before any user touches it.