Skip to content

How to run error analysis in UX research: a practical guide with AI prompts

What is error analysis?

Error analysis is a focused inspection method that takes the moments where users get something wrong — wrong clicks, failed form submissions, abandoned flows, support tickets, rage clicks — and turns them into a structured, prioritized backlog of usability issues. Instead of asking the broad question “did users succeed,” error analysis asks “where did users fail, why did they fail, and which failures cost the business the most.” A single error analysis pass on a focused flow takes one to three days, produces a categorized list of every distinct failure mode with frequency and severity, and tells the team which issues to fix first when engineering capacity is limited.

What question does it answer?

  • Where in this flow are users making mistakes, and how often does each mistake happen?
  • Are the failures we see caused by the design, by the user’s mental model, by content, or by a system bug?
  • Which failures block task completion entirely, and which are just friction the user can recover from?
  • Which two or three failure modes account for the majority of support tickets, and what would it cost to fix them?
  • After we shipped the redesign, did the failure rate on the main task actually go down, and did new failure modes appear?
  • Which errors are concentrated on specific user segments, devices, or browsers, and which hit everyone equally?

When to use error analysis

  • After a moderated or unmoderated usability test, when you have video, notes, or self-reported issues that need to be coded into a structured failure list before you can prioritize fixes.
  • When support tickets, in-app feedback, or rage-click logs suggest a pattern of failure on a specific flow but no one has yet quantified which failures matter most.
  • After a launch or a redesign, when the team needs to know whether the new version reduced or shifted the failure rate before committing to the next round of work.
  • When analytics show a sharp drop-off at a specific step but the numbers alone do not explain why users are leaving — error analysis turns the “what” into a “why.”
  • When an LLM-powered feature is in production and the team needs to understand its specific failure modes (hallucinations, retrieval errors, off-topic answers) before defining evaluation metrics.
  • When an executive asks “is this product getting better,” and the team needs a defensible, repeatable measurement that can be tracked across releases.

Not the right method when there is no data yet — error analysis depends on observable failures from real sessions, logs, or support channels, and it cannot generate insights from a blank slate. It is also the wrong call for early discovery questions about user motivation or unmet needs; those need interviews or diary studies. Error analysis tells you where the current product breaks, not what to build next. Finally, do not use it as a substitute for a usability test on a brand-new product — without baseline tasks and observation, there is no error stream to analyze in the first place.

What you get (deliverables)

  • Failure mode taxonomy: a small set of named, mutually exclusive categories (six to ten is typical) that every observed error can be slotted into, with a one-line definition for each category.
  • Coded error log: a row per observed failure tagged with category, severity, the screen or step where it occurred, the user segment, and a verbatim quote or screenshot when available.
  • Frequency table: how often each failure mode appears, broken down by task and by user segment, so the team can see which failures are common and which are edge cases.
  • Severity matrix: every distinct failure scored on impact (does it block the task, is it recoverable, is it cosmetic) so that the team can prioritize against engineering capacity.
  • Root-cause notes: for each high-priority failure mode, a short paragraph explaining the most likely cause — design, content, mental model, technical bug — based on the evidence.
  • Recommended fixes: concrete change proposals tied to each high-priority failure mode, with a rough effort estimate so the team can plan the next sprint.
  • Readout brief: a five to ten page document or short deck with the headline failures, the categorized backlog, the action plan, and the method so that future analyses can be compared against this baseline.

Participants and data

  • Participants: none recruited directly. Error analysis is a secondary analysis on data already collected from another source — usability test sessions, session recordings, support tickets, error logs, in-app feedback, or self-reported survey items.
  • Data volume: for moderated usability testing data, a focused flow with five to twelve participants generates enough errors for a meaningful pass. For log or ticket data, fifty to two hundred observations per flow is the practical floor; more is better for rare failure modes.
  • Setup time: 0.5–1 day to define scope, gather the source data, and choose the coding tool.
  • Open coding pass: 0.5–1 day to read every observation, write a free-text label for each failure, and resist the urge to force the labels into categories too early.
  • Taxonomy building: half a day to cluster the free-text labels into a small set of named categories and write the one-line definitions.
  • Structured coding pass: 0.5–1 day to re-read every observation and assign the final taxonomy label, severity, and segment tag.
  • Synthesis and writing: 0.5–1 day to build the frequency table, score severity, write the root-cause notes, and produce the brief.
  • Total wall-clock time: 2–4 days end to end for a focused flow with fifty to a few hundred observations.

How to conduct error analysis (step-by-step)

1. Define the scope and the data source

Before touching any data, write down exactly which flow, which task, which time window, and which user segment you are analyzing. A scope like “checkout failures on mobile web for new users in the last 30 days” produces actionable findings; a scope like “errors in the product” produces a vague report no one can act on. Then pick the data source that matches the scope: video and notes from a moderated usability test, screen recordings from an unmoderated study, support tickets from the help desk, error logs from the front end, in-app feedback widgets, or some combination. Different sources catch different failures, so be explicit about which ones you are using and which ones you are not.

2. Gather a representative sample

Pull a sample of observations large enough to surface the common failure modes and a few rare ones. For moderated data, this is everything you collected — five to twelve sessions. For logs or tickets, aim for fifty to two hundred observations on the flow being analyzed; for high-volume products, sample by segment rather than at random so that smaller user groups are represented. When sampling, deliberately include some sessions where users succeeded as well as ones where they failed — the contrast is what makes the failure patterns visible. Save the sample to one place (a folder, an annotation queue, a spreadsheet) so that every coding pass works from the same set.

3. Open coding: label what you see, not what you think

Walk through every observation in your sample and write a free-text comment describing the first place where something went wrong. Stay descriptive: “user clicked the wrong product card and did not notice for 30 seconds” or “form submission failed with no error message visible.” Do not yet try to slot the failure into a category — that comes later. Open coding is the step where you let the data speak; forcing categories too early will make you miss the failure modes you did not expect. If a session contains multiple errors, focus on the first failure that mattered, since downstream errors are usually consequences of the upstream one.

4. Cluster the labels into a taxonomy

After you have free-text labels for every observation, lay them out and group similar ones together. The goal is a small set — six to ten categories is the sweet spot — of mutually exclusive failure modes with a one-line definition for each. Common categories include slips (user intended the right action but misclicked), mistakes (user followed the wrong mental model), confusions (interface ambiguity led to hesitation or backtracking), system errors (the product itself broke), content failures (terminology or copy was unclear), and abandonments (user gave up without an explicit error). For LLM products, the taxonomy looks different — hallucinations, retrieval failures, off-topic responses, formatting issues, missing follow-up. Do not invent categories that the data does not support; if you only saw one example of “form validation,” it lives under “system errors,” not its own row.

5. Structured coding: re-label everything against the taxonomy

Now go back through every observation and assign the canonical taxonomy label, the severity, and the segment tag. This is the pass that turns raw notes into a quantifiable dataset. For each observation, capture five fields: where (the screen or step), what (the observable behavior), category (from your taxonomy), severity (cosmetic / minor / major / critical or a 0–4 scale), and segment (user type, device, browser, time of day). If you find observations that do not fit any taxonomy category, you have two options: extend the taxonomy if the misfit is a real pattern, or add an “other” bucket and review it at the end. Resist the urge to expand the taxonomy beyond ten categories — anything more becomes too granular to be useful.

6. Build the frequency table and the severity matrix

Tally how often each failure mode appears, broken down by task, by segment, and by severity. Count the number of distinct users who hit each failure, not just the number of occurrences — one user hitting the same bug ten times tells a different story from ten users each hitting it once. Build a simple matrix with categories on one axis and severity on the other; the cells where critical failures cluster are where the team should focus first. Where possible, weight each failure by its likely business impact: a failure on the checkout page that blocks revenue is worth more than a failure on the settings page that no one sees.

7. Diagnose root causes for the top failures

For the top three to seven failure modes, write a short paragraph explaining why this failure happens. Distinguish between four causes: a design problem (the layout, labeling, or affordances misled the user), a mental-model mismatch (the user expected a different model than the product offered), a content failure (the wording or terminology was unclear), or a technical bug (the product itself broke). Each of these has a different fix and a different team to assign it to, so the diagnosis matters for action. Use direct evidence — quote the verbatim observation, link to the screenshot or session recording — rather than interpretation, so that the design and engineering leads can verify the call for themselves.

8. Prioritize and recommend fixes

Score each high-priority failure on three dimensions: severity (how badly it hurts the user when it happens), frequency (how often a real user hits it), and business impact (effect on activation, retention, conversion, support cost). Sum the three scores to get a priority rank, then pick the top five to ten failures to push into the next sprint. For each one, draft a concrete recommended fix and a rough effort estimate (small, medium, large). End with a small set of failures that should be tracked but deferred — usually low-severity issues that would be cheap to fix as a side effect of a larger change.

9. Write the brief and present to stakeholders

Produce a five to ten page brief or a short deck. Open with the scope, the data source, and the headline finding on the first page. Walk through each major failure mode with the frequency, severity, root cause, evidence, and recommended fix. End with the prioritized action plan and a one-page method appendix so the next analysis can replicate the approach. Present in person to the product, design, and engineering leads rather than emailing the document; the conversation in the readout is where stakeholders calibrate severity and commit to fixes. Keep the full coded log attached as a reference for the team that implements the changes.

How AI changes error analysis

AI compatibility: partial — The bulk of an error analysis is mechanical pattern matching across noisy text data, and modern LLMs are very good at exactly that work. The catch is that the most valuable judgment calls — defining the right scope, distinguishing a real failure mode from a one-off, calibrating severity against business impact, and explaining the root cause for each top failure — still need a human who knows the product. AI compresses the open coding and clustering passes from days into hours, which frees the researcher to spend more time on diagnosis and prioritization.

What AI can do

  • Cluster free-text observations into a taxonomy: Given a few hundred raw error notes from an unmoderated test or a support ticket export, an LLM can group similar observations, propose six to ten coherent failure categories, and write the one-line definitions in a few minutes — the same work that would take a researcher half a day of card sorting.
  • Run the structured coding pass at scale: Once a taxonomy exists, an LLM can re-label every observation against it with high consistency and flag edge cases for human review. This is the step that historically scaled poorly with sample size; AI removes the bottleneck and lets the analysis cover hundreds or thousands of observations instead of dozens.
  • Score severity against a fixed rubric: A model applying a one-paragraph severity rubric to every failure produces consistent scores that the human researcher can spot-check rather than rate from scratch. Calibration drift between human coders is a chronic problem in this method; AI fixes it for the routine cases.
  • Mine support tickets and session-recording transcripts for failure patterns: Tools like Dovetail, Marvin, Notably, and Gong’s research features can ingest large volumes of unstructured feedback, surface recurring themes, and link each theme to verbatim quotes — turning a haystack of free-text complaints into a structured failure list.
  • Analyze LLM-app failure modes: For products that include their own LLM features, evaluation platforms (Langfuse, Humanloop, Braintrust, Promptfoo) automate the open-coding workflow on production traces, cluster the failures into named modes, and let the team track failure rates across model and prompt changes.
  • Draft the readout brief: Given a coded log and a frequency table, an LLM can produce a first-draft brief organized by failure mode with headline findings, frequency tables, and recommended fixes. The human researcher then rewrites it for tone, sharpens the prioritization, and removes the inevitable false certainty.

What requires a human researcher

  • Defining scope and choosing the data source: AI will cheerfully cluster anything you give it, but deciding which flow matters, which time window to pull, and which sources to combine is product judgment that depends on knowing the business context. Get this wrong and the analysis is technically correct but useless.
  • Diagnosing root causes from limited evidence: Distinguishing a design problem from a mental-model mismatch from a content failure from a bug requires knowing how the product works, who the user is, and what the team has already tried. Models will guess at causes from text alone and produce plausible-but-wrong diagnoses.
  • Calibrating severity against business reality: A one-paragraph severity rubric cannot capture that a hesitation on the billing page is a critical churn signal in this specific company while a hesitation on the settings page is cosmetic. Severity needs to be re-anchored to business impact by someone who knows both.
  • Spotting failure modes the model has not seen: AI clustering gravitates toward common categories and quietly drops the rare edge cases that often matter most — the failure that hits one percent of users on a high-revenue flow, or the new failure mode introduced by the last release. A human researcher needs to read the outliers manually before trusting the taxonomy.
  • The conversation in the readout: Stakeholders commit to fixes in the meeting, not in the document. AI can produce a credible draft, but the live discussion where product, design, and engineering negotiate priorities and trade-offs is where the report actually drives change.

AI-enhanced workflow

Before AI, an error analysis on a hundred observations took three to five days of analyst time: a day to read every observation and write open-coded notes, half a day to build a taxonomy on sticky notes, a day to re-code everything against the taxonomy, half a day to build the frequency table, and the rest to write the brief. The bottleneck was the two coding passes — slow, repetitive, and prone to drift as the analyst’s attention waned around observation eighty.

With AI in the workflow, the same project compresses to one to two days. The researcher spends an hour framing scope and choosing the data source, then feeds the raw observations into an LLM with a prompt that produces a first-draft taxonomy in minutes. The researcher reviews the taxonomy, edits it where the model missed context, and adds the failure modes the model collapsed away. The structured coding pass then runs against the cleaned taxonomy at machine speed, with the researcher spot-checking ten to twenty percent of the labels and overriding where needed. The frequency table and severity matrix are generated automatically. The researcher’s time goes into the steps that matter most: scope, root-cause diagnosis, and the readout conversation.

The catch is the same as for AI-assisted desk research and heuristic evaluation: the speed-up is real only when a human verifies the AI’s output instead of trusting it. Studies of LLM-driven qualitative coding show that models cluster the easy patterns reliably but quietly drop the rare and ambiguous ones, and that severity ratings drift toward the middle of the rubric unless a human anchors them. The researchers who get the most value from AI here treat it as a fast junior coder — useful for the bulk pass, never trusted as the final answer, always paired with a human who reads the outliers and owns the diagnosis.

Example from practice

A mid-sized e-commerce company saw a 12% drop in mobile checkout completion after a redesign that consolidated the address, payment, and review screens into a single long page. Analytics showed the drop clearly but did not explain what was happening between the start of the page and the failed submissions. The product manager had two weeks to recover the lost conversion before the holiday season.

The lead researcher pulled three data streams over a 14-day window: 180 session recordings filtered to mobile users who started checkout but did not complete, 240 support tickets tagged with checkout-related keywords, and the front-end error log from Sentry for the same period. She loaded the recordings into Dovetail, the tickets into a Google Sheet, and the Sentry exceptions into a separate tab. She ran an open-coding pass on a sample of 60 recordings and 60 tickets, writing free-text labels for the first observable failure in each. She then fed all 120 free-text labels into Claude with a clustering prompt and received a draft taxonomy of seven failure modes — three of which she kept verbatim, three she edited for clarity, and one she split into two because it conflated a content failure with a real bug. She then re-coded all 420 observations (180 recordings + 240 tickets) against the cleaned taxonomy in two days.

The frequency table revealed that 41% of all failures came from a single mode: users were tapping the autofilled address field, expecting to edit it inline, and instead getting a modal address picker that lost their cart context when they hit “back.” The Sentry log confirmed a JavaScript error firing on the modal close handler. The recommended fix was to remove the modal entirely and let the address autofill be edited inline. Engineering shipped the change in three days, the failure mode disappeared from the next sample, and the checkout completion rate recovered 9 points in the first week — about three-quarters of the original loss. The analysis took the researcher four days of work, less than half of what a follow-up moderated usability test would have required.

AI prompts for this method

4 ready-to-use AI prompts with placeholders — copy-paste and fill in with your context. See all prompts for error analysis →.