Product Growth: AI evals masterclass — the most important new skill for product managers in 2026

What the video covers

Published on February 19, 2026, on Aakash Gupta’s Product Growth channel, this masterclass features Ankit Shukla, the founder of HelloPM. The topic is AI evaluations — how product managers should design, implement, and scale the systems that test whether AI features actually perform as intended before and after launch.

The premise is that most AI feature failures trace back to a single gap: teams ship AI without a reliable method for measuring output quality. Traditional QA and A/B testing are not sufficient for non-deterministic systems, where the same input can produce different outputs. Evaluations fill this gap by defining what “good” looks like and testing against those definitions systematically.

Who it’s for

Product managers building or owning AI features, regardless of engineering background. The content is practical rather than theoretical — each concept comes with a step-by-step implementation approach. It is particularly useful for PMs who need to establish evaluation processes from scratch, or who have been asked to define quality standards for an AI system without prior experience in the area.

Key takeaways

Evaluations fall into three categories that serve different purposes. Offline evals run before launch against curated test datasets; online evals monitor production traffic in real time; and human evals provide periodic spot-checks on output quality. Each type covers blind spots the others miss, and relying on only one category is not sufficient for AI features with meaningful consequences for users.
A useful evaluation rubric starts with scenarios, not metrics. The process begins by identifying specific user scenarios, then writing 4 to 6 scoring categories for each, with reference examples that illustrate what scores of 1, 3, and 5 look like in practice. Inter-rater reliability testing — having two evaluators score the same outputs independently and comparing results — verifies that the rubric produces consistent results across reviewers.
The right metric depends on the task type. Retrieval systems such as search and recommendations require precision and recall measurements. Open-ended text generation responds better to semantic similarity metrics like BERTScore. Highly specific tasks may need custom metrics tied to concrete outcomes rather than generic quality proxies.
LLM judges can automate evaluation at scale once calibrated. An LLM judge uses a language model to score other model outputs. Calibration means comparing the judge’s scores against a human-annotated baseline to confirm alignment, then running periodic tests to detect drift. An uncalibrated LLM judge creates an illusion of evaluation rigour without the substance.
Production monitoring requires three distinct tracking layers. System metrics cover latency and error rates. Quality metrics track automated evaluation scores on live outputs. Business metrics capture task completion rates and user satisfaction. Automatic alerts and human review queues for flagged outputs complete the monitoring loop — without all three layers, problems in one dimension can go undetected while the others look healthy.

Worth watching if…

You are preparing to launch an AI feature and have no formal evaluation process in place, or if your team is struggling to define what “good output” means for your specific use case. Also useful when preparing a business case for evaluation infrastructure, as the frameworks described are concrete enough to translate directly into resourcing conversations.