How to conduct heuristic evaluation: a practical guide with AI prompts

What is heuristic evaluation?

Heuristic evaluation is an expert-led inspection method where 3–5 trained evaluators independently walk through an interface and judge it against a fixed set of usability principles, most often Jakob Nielsen’s 10 heuristics. Each evaluator records every observed violation, tags it to a specific heuristic, rates its severity, and the team then merges the individual lists into a single prioritized backlog of issues. Heuristic evaluation is the cheapest and fastest way to surface the systemic usability problems in a product before money is spent on engineering or formal user testing — a single full pass with three evaluators on a focused flow takes one to three days and reliably catches around 75% of the issues that would otherwise show up later in usability testing.

What question does it answer?

Where does this interface violate well-known usability principles, and which violations are the most severe?
Which screens or flows have the most concentrated usability problems, and which are the safest to ship as-is?
What are the obvious issues we should fix before running an expensive usability test, so that test sessions focus on real questions instead of catching obvious bugs?
Which design choices contradict industry conventions in ways that will cost us learnability for new users?
Which patterns of usability debt repeat across the product (poor feedback, missing error prevention, inconsistent terminology) and deserve a systemic fix rather than a one-off patch?
For an existing live product, where are the highest-impact places to invest design effort given a limited engineering budget?

When to use heuristic evaluation

Early in the design cycle, when wireframes or mockups exist but the product has not yet been built — heuristic evaluation catches obvious problems while changes are still cheap to make.
Before any usability test, to clear out the obvious violations so that test participants spend their time on the questions only real users can answer.
When auditing a live product whose UX has degraded over time and the team needs a structured way to prioritize debt without recruiting users.
When entering a project as an external consultant or agency and you need a fast, defensible read on the state of the product within a few days.
When the budget or timeline rules out usability testing entirely and the team still needs an evidence-based list of usability problems to act on.
When training junior designers or PMs to develop UX instincts — running heuristic evaluations on real products is one of the fastest ways to internalize usability principles.

Not the right method when the question is about the deeper why behind user behavior, motivation, or unmet needs — heuristic evaluation only finds violations of known principles, not new categories of problem the heuristics do not cover. It also misses issues that depend on the specific mental models of a real user group (specialized terminology, domain workflows, accessibility needs the heuristics do not surface). Heuristic evaluation is a complement to usability testing, not a replacement; teams that ship a product on heuristic evaluation alone consistently miss the failures that only show up when real users try to complete real tasks. Finally, it is not an audit of accessibility or content quality — those need their own dedicated checks (WCAG audit, content review).

What you get (deliverables)

Heuristic evaluation workbook: a structured document per evaluator with every observation tagged to the violated heuristic, the screen or flow where it appeared, and a recommended fix.
Consolidated issue list: the merged output of all evaluators, with duplicates removed, contradictions discussed, and each issue assigned a final severity rating.
Severity-rated backlog: every issue scored on a standard scale (typically cosmetic / minor / major / critical, or Nielsen’s 0–4 scale) so that the team can prioritize fixes against engineering capacity.
Cluster map: issues grouped by theme (navigation, feedback, terminology, errors, forms) so that systemic problems become visible instead of looking like a flat list of one-off bugs.
Annotated screenshots: each major issue illustrated with the relevant screen and a callout pointing at the violation, so engineers and PMs can see what the evaluator saw.
Action plan: the top 5–15 issues prioritized by severity, frequency, and business impact, with concrete recommended fixes and a rough effort estimate.
Readout deck or brief: 5–10 page document or short presentation walking stakeholders through the method, the headline issues, and the recommended next steps.

Participants and duration

Participants: none directly — heuristic evaluation is an expert inspection. The “participants” are 3–5 trained evaluators who walk through the interface independently.
Evaluators: 1 evaluator catches around 35% of usability problems, 3 evaluators catch around 60%, 5 evaluators catch around 75% (Nielsen 1994). Three is the most common sweet spot for product teams.
Setup time: 0.5–1 day to define scope, brief the evaluators on heuristics, prepare screenshots or test access, and choose the logging tool.
Independent evaluation: 2–4 hours per evaluator for a focused flow, more for a complex product. Each evaluator works alone first, without seeing other evaluators’ findings.
Consolidation: half a day to a full day to merge findings, discuss disagreements, calibrate severity, and cluster issues into themes.
Synthesis and writing: 0.5–1 day to write the brief, build annotated screenshots, and prioritize the action plan.
Total wall-clock time: 1–3 days end to end for a typical product flow with three evaluators.

How to conduct heuristic evaluation

1. Define the scope and the user group

Decide exactly what part of the product you will evaluate before you brief any evaluators. Pick 1–3 critical user flows that map to high-value tasks (onboarding, checkout, the core job-to-be-done) rather than trying to cover the entire product. State explicitly which screens, states, devices, and user types are in scope and which are out. A narrow scope produces specific, actionable findings; a wide scope produces a vague, opinion-heavy report. Write down the user type the evaluators should adopt — a novice trying the product for the first time and a power user running a routine task will spot very different issues.

2. Choose the heuristic set and brief the evaluators

For most product teams, Jakob Nielsen’s 10 usability heuristics are the right starting point because they are well-documented, broadly applicable, and instantly recognizable to anyone who has worked in UX. For specialized domains add a domain-specific set on top — Bastien & Scapin for ergonomic depth, Gerhardt-Powals for cognitive engineering, the WCAG quick reference for accessibility. Brief each evaluator on the chosen set before they start. If the team has never run a heuristic evaluation together, run a 30-minute practice round on a third-party product (a weather app, a small SaaS landing page) so that everyone calibrates on what counts as a violation.

3. Decide how to log findings

Choose one shared template and require every evaluator to use it. The standard format is a row per observation with columns for screen or flow, what breaks, the heuristic violated, severity, and a recommended fix. Avoid loose tools like Figma comments or sticky notes — they make consolidation a nightmare. A shared spreadsheet (Google Sheets, Airtable) works for most projects; for visual evaluations, NN/g’s free heuristic evaluation workbook PDF or a Miro board with one workspace per evaluator both work well. Whatever the tool, every evaluator must log their findings privately first; sharing during the independent pass biases the whole evaluation.

4. Walk through the interface twice

Each evaluator walks through every flow at least twice. The first pass is for orientation — get familiar with the product, learn the basic vocabulary, complete each task once without trying to find issues. The second pass is for evaluation — go back through the same flows slowly, this time deliberately checking each screen against the heuristics. Most evaluators find more problems on the second pass than the first because the orientation pass takes the cognitive load of “what does this product even do” off the table. Plan 2–4 hours per evaluator for this step.

5. Log every observation as a structured issue

For each problem, write down five things: where the issue appears (screen, flow, exact element), what specifically goes wrong (the observable behavior, not the opinion), which heuristic it violates (one primary, optionally a secondary), why it matters (the impact on task completion or user trust), and one concrete recommended fix. Quote the exact text or describe the exact interaction rather than paraphrasing. Take a screenshot. The discipline of writing in this format is what separates a real heuristic evaluation from a “this feels off” design roast — every issue should be reproducible by someone who was not in the room.

6. Rate severity against a fixed scale

After the independent evaluation but before consolidation, each evaluator scores every issue on the same severity scale. The most common is a 4-point scale: cosmetic (visual polish, no impact on task), minor (noticeable friction but task succeeds), major (significant friction, likely to cause errors or hesitation), critical (blocks task completion or causes irreversible mistakes). Nielsen’s original 0–4 scale (no problem, cosmetic, minor, major, catastrophe) works equally well. The decision rule is simple: if the issue blocks someone from doing their job, it is critical; if it just annoys, it is minor or cosmetic. Consistent severity is what makes the consolidated backlog actionable.

7. Consolidate and merge the lists

Bring all evaluators into one room (or one call) with their independent lists. Walk through every issue together, group duplicates, discuss disagreements, and recalibrate severity where evaluators rated the same issue differently. The discussion is where the value compounds — one evaluator caught a navigation issue, another caught an error prevention issue on the same screen, and the discussion reveals that they are facets of the same underlying problem. Capture the merged list in one place and tag each issue with the heuristic, severity, screen, and a recommended fix. If a consensus on severity cannot be reached, escalate one notch; the cost of fixing a non-issue is lower than the cost of shipping a real critical bug.

8. Cluster issues into themes

Group the consolidated issues into 3–7 thematic clusters: navigation and information architecture, feedback and system status, error prevention and recovery, terminology and content, forms and inputs, visual design and hierarchy, accessibility. Clustering turns 30–60 individual issues into a handful of headline patterns that stakeholders can hold in their heads and act on. A team that walks out of a readout meeting remembering “we have a feedback problem and a navigation problem” will ship better fixes than a team handed a 47-row spreadsheet.

9. Prioritize and write the action plan

Score each cluster (or each issue, if the list is short) on three dimensions: severity (how badly it hurts the user), frequency (how often users hit it), and business impact (how much it affects activation, retention, conversion, or support cost). Sum the scores or use a simple high/medium/low matrix. Pick the top 5–15 items to push into the next sprint. End with concrete recommended fixes and a rough effort estimate; a brief that ships only diagnoses without recommendations is much harder for product teams to act on.

10. Write the brief and present to stakeholders

Produce a 5–10 page brief or a short deck that opens with the scope and method, summarizes the headline findings on the first page, walks through each major theme with annotated screenshots and quotes from evaluators, and ends with the prioritized action plan. Present to the product, design, and engineering leads in person rather than emailing the document; the discussion in the readout is where stakeholders calibrate severity, agree on priorities, and commit to fixes. Keep the full backlog and the consolidated spreadsheet attached as a reference for the team that will actually implement the changes.

How AI changes heuristic evaluation

AI compatibility: partial — Modern multimodal LLMs can read screenshots and run a credible first pass against Nielsen’s 10 heuristics, but recent peer-reviewed studies (Zhong, McDonald, Hsieh 2025; Vasiu et al. 2026) show that AI evaluators catch a different and partially overlapping set of issues compared to human experts, with notable blind spots on context-dependent problems. The realistic split is that AI handles the routine surface checks (loading states, button labels, form errors, color contrast, copy clarity) while humans handle the judgment calls about scope, business context, severity, and the issues that depend on the user’s actual mental model.

What AI can do

Run a first-pass review of screenshots: Multimodal models like Claude, GPT-4o, and Gemini can take a set of screenshots, walk through each, and produce a structured list of candidate violations against Nielsen’s 10 heuristics in minutes. This is usually about 50–70% as thorough as a human expert on routine flows and a useful starting point for the human evaluator.
Generate the heuristic checklist for a specific flow: Given a flow description and a target user, an LLM can produce a tailored set of yes/no checklist questions for each of the 10 heuristics, customized to the screens being reviewed. This compresses the prep work for a human team from a few hours to a few minutes.
Tag and consolidate findings across evaluators: When multiple evaluators submit free-text observations, an LLM can read the raw lists, deduplicate similar issues, suggest cluster names, and propose a merged backlog with severity ratings — work that previously took half a day of consolidation meetings.
Score severity consistently across evaluators: Calibration drift between human evaluators is a chronic problem. An LLM applying a fixed severity rubric to every issue produces consistent, defensible scores that human evaluators can then override case by case rather than rating from scratch.
Cross-reference findings against accessibility (WCAG) at the same time: A model can apply Nielsen’s heuristics and WCAG 2.2 criteria in the same pass, flagging issues that violate both and producing a combined risk score. Tools like Heurilens, Stark AI, and the Figma heuristic evaluation plugins automate this cross-reference.
Draft the readout brief: Given a consolidated issue list, an LLM can produce a first-draft brief organized by theme with headline findings, annotated examples, and recommended fixes. The human researcher then rewrites it for tone and sharpens the prioritization.

What requires a human researcher

Defining scope and the user perspective: AI will gladly run an evaluation against any flow you point at, but choosing which flows matter most, which user type to adopt, and what to leave out is product judgment that depends on knowing the business. Get this wrong and the AI’s findings are accurate but useless.
Catching context-dependent violations: AI systematically misses problems where the violation depends on the user’s mental model, the surrounding workflow, or domain conventions the model has never seen. Specialized B2B tools, regulated industries, and culturally specific flows are the hardest blind spots.
Calibrating severity against business reality: A model can score severity against a generic rubric, but knowing that “billing-page hesitation” is a critical churn signal in this specific company while “settings-page inconsistency” is cosmetic requires knowing the product’s revenue model and customer base.
Distinguishing real violations from intentional design tradeoffs: Hamburger menus violate “recognition rather than recall” but are the right call on mobile. AI evaluators consistently flag these as real issues; a human knows when to defend the tradeoff. Without that filter, the AI report becomes noise.
The consolidation discussion: The value of multiple evaluators comes from the meeting where they reconcile their lists, not from the lists themselves. AI can dedupe and cluster, but the conversation where one evaluator’s “navigation issue” turns out to be another evaluator’s “error prevention issue” is where the headline findings emerge.

AI-enhanced workflow

Before AI, a heuristic evaluation on a focused flow with three evaluators took 2–3 days end to end: half a day of prep, four hours per evaluator working independently, half a day to consolidate and rate severity, half a day to write the brief. The bottleneck was the independent evaluation pass — three smart designers each walking through the same screens and writing down the same routine issues.

With AI in the workflow, the same project compresses to one to two days. The lead researcher spends an hour framing the scope and the user type, then feeds screenshots of each screen to a multimodal model with a custom prompt that walks through Nielsen’s 10 heuristics. The model returns a structured first-pass list of candidate issues per screen in minutes. Two human evaluators then take that first pass as a starting point, walking through the same flows themselves to confirm, expand, override, and add the context-dependent issues the model missed. The evaluators spend their time on judgment, not on repeated mechanical checks. Consolidation runs faster because the issues already share a structure and a vocabulary; severity calibration runs faster because the model has pre-scored everything against the same rubric. The brief gets a first draft from the model and a final pass from the human researcher.

The catch is the same as for AI-assisted desk research: the time saved depends on a real human verification pass on the AI’s output. Studies that measured AI-only heuristic evaluations against human-only evaluations found false positives (the model flagged issues that were not real), false negatives (the model missed issues that mattered), and a tendency to over-flag visual minutiae while under-weighting context-dependent problems. The researchers that get the most value from AI here treat it as a thorough but naive junior evaluator: useful for the first pass, never trusted as the final answer, always paired with a human who knows the product.

Tools

Heuristic evaluation workbooks and templates: NN/g’s free heuristic evaluation workbook PDF, Maze’s heuristic evaluation template, the Eleken UX heuristic evaluation form, the Figma heuristic evaluation plugin, AIPrm’s CSIR prompt templates.

Spreadsheet and document logging: Google Sheets, Airtable, Notion databases — the standard format for logging issues with one row per observation and consistent columns.

Visual workspace and consolidation: Miro, Mural, FigJam — useful for the consolidation phase where team members cluster sticky notes by theme; create one workspace per evaluator before the consolidation meeting.

AI-assisted heuristic evaluation: Claude, GPT-4o, Gemini for screenshot-based first-pass evaluation; Heurilens for automated evaluation against Nielsen’s heuristics; specialized GPTs trained on Nielsen’s heuristics; the AI components of Lookback, Marvin, and Dovetail for synthesizing multi-evaluator output.

Accessibility cross-checks: WAVE, axe DevTools, Stark for Figma, Lighthouse, and the Microsoft Accessibility Insights extension — pair these with the heuristic evaluation when accessibility is in scope.

Annotated screenshot tools: CleanShot X, Snagit, Loom (for video walkthroughs), Markup.io, Shottr — used to attach the visual evidence each issue needs in the readout.

Works well with

Usability Testing Moderated (Ut): Heuristic evaluation is the canonical preflight for moderated usability testing. Run heuristic evaluation first to clean up the obvious issues, then run usability testing to focus on the questions only real users can answer. The Nielsen Norman Group has documented this pairing for thirty years.
Cognitive Walkthrough (Cw): Both methods are expert inspections, but they ask different questions — heuristic evaluation checks the design against principles, cognitive walkthrough checks whether a first-time user can figure out the next step at each screen. Running them together gives a fuller picture for early-stage designs.
Accessibility Testing (At): Heuristic evaluation and accessibility testing both inspect the interface for compliance with rules; pairing them in the same pass catches issues that violate both Nielsen’s heuristics and WCAG, and produces a single combined backlog instead of two separate ones.
Content Analysis: When a heuristic evaluation surfaces a “Content and terminology” cluster, follow up with content analysis on real user feedback (support tickets, reviews) to confirm whether the terminology issues the evaluators flagged actually trip real users.
Survey (Sv): A short post-task survey (e.g., SUS or SUPR-Q) on real users complements the heuristic evaluation by quantifying the perceived severity of the issues from the user side. The two together produce a much more credible business case than either alone.

Example from practice

A B2B SaaS company shipped a new pricing-and-billing dashboard for their enterprise customers and started seeing a 30% spike in support tickets about “I can’t find my invoice” within two weeks of launch. The product manager wanted to understand the cause before authorizing a redesign and had a one-week budget. Running a usability test would have taken three weeks once recruitment was factored in, and the support team needed an answer faster than that.

The lead researcher ran a heuristic evaluation in three days. She defined the scope as “enterprise admins trying to find and download their last three months of invoices,” picked Nielsen’s 10 heuristics plus a quick WCAG 2.2 pass, and recruited two designers from adjacent teams as the second and third evaluators. Each evaluator walked through the flow twice (a one-hour orientation pass followed by a two-hour evaluation pass) and logged findings in a shared Google Sheet using a strict format: screen, element, heuristic, what goes wrong, severity, recommended fix. She also fed screenshots into Claude with a custom prompt that ran a first pass against the same heuristics, then used the model output as a fourth “junior evaluator” — accepting some findings, overriding others where Claude misread context.

Consolidation took half a day and surfaced 38 issues across six themes. The headline finding was unambiguous: the redesign had moved invoice download from a primary action on the billing landing page to a three-click sub-flow inside the “Account history” tab, which violated heuristics #6 (recognition rather than recall) and #7 (flexibility and efficiency of use) and explained nearly all the support tickets. The top recommendation was to surface a “Download last invoice” button on the billing landing page; the engineering effort was estimated at half a sprint. The fix shipped two weeks later, support tickets dropped back to baseline within a week of release, and the heuristic evaluation cost roughly 18 hours of researcher time across the team — versus the 60+ hours a comparable usability test would have required.

Beginner mistakes

Skipping the independent pass

The biggest single failure mode is letting evaluators see each other’s findings before the independent walkthrough is done. The whole statistical case for using 3–5 evaluators is that each one catches a different subset of problems; if they synchronize early, you collapse three perspectives into one and the coverage drops to roughly what one evaluator would have caught alone. Always run the independent pass in private notebooks or separate sheets, and only share lists at the consolidation meeting.

Logging opinions instead of observations

Notes like “this feels confusing” or “the layout is messy” are useless in a consolidated backlog because no one can act on them. Every issue must include the specific element, the observable behavior, the heuristic it violates, and a concrete recommended fix. The discipline of writing “the Submit button stays grey with no error message when the email field is empty (#9 — error recovery)” instead of “the form is buggy” is what separates a heuristic evaluation that drives fixes from one that gets ignored.

Treating every heuristic violation as a real problem

Heuristics are guidelines, not laws. Hamburger menus violate “recognition rather than recall” but are the right call for most mobile designs. Confirmation modals violate “user control and freedom” but are essential for irreversible actions. A heuristic evaluator who flags every textbook violation without checking whether the tradeoff is intentional produces a noisy report that loses credibility with the design team. Always check the design intent before logging a violation as a real issue.

Using a vague or inconsistent severity scale

Severity is the lever that turns 38 issues into a list of 5 things to fix this sprint. If two evaluators are working with different mental models of “minor” vs. “major,” the merged backlog is meaningless. Pick one explicit scale (the 4-point cosmetic/minor/major/critical scale or Nielsen’s 0–4 scale), define each level with a one-sentence test (“does it block task completion?”), and recalibrate during the consolidation meeting whenever evaluators disagree.

Skipping the prioritization step

A heuristic evaluation that ends with a 47-row spreadsheet and no action plan rarely drives any fix at all. The team gets the document, scrolls through it once, and goes back to the existing roadmap. Always end the project with an explicit top 5–15 prioritized list scored on severity, frequency, and business impact, and present it in person rather than emailing the spreadsheet. The conversation in the readout is where the fixes get committed to.

AI prompts for this method

4 ready-to-use AI prompts with placeholders — copy-paste and fill in with your context. See all prompts for heuristic evaluation →.