How to run a moderated usability test: a practical guide with AI prompts

An e-commerce company redesigned its mobile checkout flow to reduce cart abandonment, which was running at 72%. The product team was confident the new design was an improvement — it consolidated the previous 5-step checkout into a single scrolling page. Before launching, the UX research team ran moderated usability tests with 6 smartphone users, asking each to add a specific item to cart, apply a promo code, and complete the purchase.

Five of the 6 participants completed the purchase, but the sessions revealed problems that completion rates alone would not have caught. Three participants missed the promo code field entirely — it was positioned below the fold and participants scrolled past it looking for the “Place Order” button. When prompted to find the promo code field, they scrolled up and down multiple times before locating it, and one participant said “I assumed the promo code was on a different screen.” Two participants hesitated at the payment step because the total did not update in real time after entering the promo code — they were unsure whether the discount had been applied and considered abandoning the purchase.

The team made two changes based on the test findings: they moved the promo code field above the order summary so it appeared before the total, and they added an inline confirmation that showed the discount amount and updated total immediately after the code was applied. A follow-up unmoderated test with 50 users showed promo code usage increased from 12% to 31%, and the checkout completion rate rose from 28% to 41%. The moderated test took one week; the insight it produced would not have surfaced from analytics or unmoderated testing because the root cause — a scrolling position problem combined with missing visual feedback — required observing real users and asking “what were you expecting to see?”

That is what moderated usability testing produces: not just whether users can complete a task, but why they succeed, struggle, or fail — and what to change about the design to fix it.

What moderated usability testing actually is

Moderated usability testing is a research method in which a facilitator observes participants as they attempt specific tasks on a product or prototype, asking follow-up questions in real time to understand why they succeed, struggle, or fail. Unlike unmoderated testing, the facilitator’s presence enables probing beneath surface behavior — catching the moment a user hesitates, misinterprets a label, or recovers from an error — and turning those observations into actionable findings about interface design, information architecture, and interaction patterns.

What questions it answers

Moderated usability testing addresses questions about whether users can accomplish their goals with the product:

Can users complete the key tasks this product is designed to support, and where exactly do they get stuck or fail?
Why do users make the errors they make — what are they expecting to see, and how does the interface violate those expectations?
Which parts of the interface are confusing, ambiguous, or invisible to users, even if the design team considers them obvious?
How do users recover when they go down the wrong path — can they find their way back, or do they get lost?
What do users think the interface does at each step, and how does that mental model differ from the designer’s intent?
How does this design compare to the user’s current solution in terms of effort, speed, and satisfaction?

When to use

When a prototype or working product exists and the team needs to know whether users can actually complete key tasks before launching or shipping to a wider audience.
When the team suspects specific usability problems exist but needs to observe real users to pinpoint where and why the breakdowns happen, rather than speculating from analytics or heatmaps.
When the design involves complex workflows, multi-step forms, or unfamiliar interaction patterns where the reasoning behind user errors matters as much as the errors themselves.
When the team is early in the design process and wants to test a low-fidelity or mid-fidelity prototype to catch major problems before investing in high-fidelity design and development.
When accessibility, safety, or compliance requirements demand that the interface be tested with representative users under observation — not just validated through automated checks.
When quantitative usability data from unmoderated tests has identified problem areas (low completion rates, high time-on-task) but the team needs to understand the root cause before redesigning.

Not the right method when the team needs large-sample quantitative data (completion rates, time-on-task across 100+ users) — unmoderated testing is faster and cheaper for that. Also not appropriate when there is no prototype or product to interact with — if the question is “do users want this?” rather than “can users use this?”, concept testing is the better method. Moderated usability testing requires a skilled facilitator, dedicated time per participant, and higher incentives than unmoderated tests, so it should be reserved for situations where the qualitative depth justifies the investment.

What you get (deliverables)

Usability findings report: a prioritized list of problems, each with severity rating (critical, moderate, low), frequency (how many participants encountered it), description of what happened, and a screenshot or video clip illustrating the issue.
Task completion data: binary success/failure for each task, plus partial completion tracking (completed with errors, completed with help, abandoned).
Time-on-task measurements: how long each participant took for each task, flagging tasks where time significantly exceeds expectations.
Post-task and post-session ratings: Single Ease Question (SEQ) scores per task and System Usability Scale (SUS) or SUPR-Q scores for overall impression.
Highlight reel: a short video compilation (5-10 minutes) of the most illustrative usability failures and successes, for stakeholder presentations.
Design recommendations: specific, actionable changes tied to each finding — not just “fix the navigation” but “move the ‘Save draft’ button from the overflow menu to the primary toolbar because 4 of 5 participants did not find it.”

Participants and duration

Participants: 5-8 per round of testing. NNGroup’s research shows that 5 users uncover approximately 80% of usability problems for a given set of tasks and user profile. Test with 5, fix the critical issues, then test again with 5 more to verify the fixes and catch remaining problems. If testing with multiple user segments (e.g., novice vs. expert), recruit 5 per segment.

Session length: 45-60 minutes per participant. Allow 10 minutes for introduction and rapport, 25-35 minutes for tasks, and 10 minutes for post-session questions and wrap-up.

Facilitator preparation: 1-2 days for writing the test plan, creating tasks, and preparing the prototype. Add half a day for a pilot test with 1-2 internal participants.

Analysis and reporting: 2-3 days for reviewing recordings, coding findings, rating severity, and writing the report.

Total timeline: 1-2 weeks (preparation: 2-3 days; recruitment: 2-3 days in parallel; testing: 2-3 days for 5-8 sessions; analysis and report: 2-3 days).

How to run a moderated usability test (step-by-step)

1. Define research questions and success criteria

Start with the decisions the test needs to inform. “Can users complete the checkout flow?” is too broad. “Can users find the promo code field, apply a discount, and confirm the updated total?” is testable. For each task, define what success looks like (the user reaches the confirmation screen with the correct total) and what failure looks like (the user abandons, uses the wrong field, or does not notice the discount was applied). Set a threshold: if fewer than 4 of 5 participants complete the task without assistance, the design needs revision.

2. Write realistic task scenarios

Write 3-6 task scenarios that cover the most important user journeys. Frame each task as a realistic situation, not an instruction: “You bought a jacket online last week but it doesn’t fit. Find out how to return it and start the process” rather than “Click on Returns and fill out the form.” Do not use words from the interface in the task description — if the button says “Returns & Exchanges,” do not say “return” in the scenario. This avoids giving participants a findability clue that real users would not have.

3. Prepare the prototype and test environment

Ensure the prototype covers not just the happy path but also error states and alternative paths. If a user clicks the wrong button, the prototype should respond rather than dead-ending — otherwise you lose the observation of how users recover from errors. For remote tests, verify that screen sharing works with the prototype tool. For in-person tests, set up recording equipment for both screen capture and audio. Run a complete pilot session with a colleague to catch technical issues and refine task wording.

4. Recruit representative participants

Recruit 5-8 participants who match the product’s target users in terms of domain experience, technical skill, and context of use. A B2B invoicing tool should be tested with people who actually process invoices, not with design students. Use a screening questionnaire to filter for relevant experience. Over-recruit by 20% to account for no-shows — if you need 5, schedule 6. Offer appropriate incentives (typically $75-150 per hour for consumer products, higher for specialized B2B users).

5. Run the session

Open with 5-10 minutes of rapport building. Explain the process: “We’re testing the product, not you. There are no wrong answers. If something is confusing, that’s the product’s fault, not yours.” Ask the participant to think aloud as they work through each task — narrating what they see, what they expect, and what they are trying to do. When they fall silent, prompt gently: “What are you thinking right now?” or “What are you looking for?” Do not help, hint, or react emotionally to their actions. If they ask “Am I doing this right?”, respond with “What would you do if I weren’t here?” Record both the screen and the audio.

6. Probe between tasks, not during

After each task, ask the participant to reflect: “How was that for you?” “What did you expect to happen when you clicked that?” “Was there anything confusing?” Use the Single Ease Question (SEQ): “On a scale of 1 to 7, how easy or difficult was that task?” This retrospective probing technique, recommended by MeasuringU, avoids interrupting the natural task flow while the participant is working and captures their reasoning while the experience is still fresh.

7. Close with overall impressions

Reserve the final 10 minutes for zoomed-out questions: “How was the overall experience?” “What was the most confusing part?” “How does this compare to what you use today?” “Would you use this product? Why or why not?” Administer a standardized questionnaire — SUS for software products, SUPR-Q for websites — to get a comparable overall usability score. Thank the participant and send the incentive promptly.

8. Debrief immediately after each session

Within 30 minutes of each session ending, write a summary: key observations, surprises, and any hypotheses to explore in the next session. If a note-taker was present, compare notes. This immediate debrief prevents the details of one session from blurring into the next and allows the facilitator to adjust task wording or add probing questions for subsequent sessions.

9. Analyze findings and rate severity

After all sessions are complete, review recordings and notes to compile a list of usability findings. For each finding, record: what happened, how many participants encountered it, the severity (critical — blocks task completion; moderate — causes difficulty but does not prevent completion; low — minor friction), and a screenshot or video clip. Group findings by area (navigation, forms, labels, feedback) and prioritize by severity and frequency. A problem encountered by 4 of 5 participants at critical severity is the top priority.

10. Report with evidence and recommendations

Write a report that leads with the top-priority findings, each illustrated with a screenshot or video clip and paired with a specific design recommendation. Do not bury findings in a 50-page document — stakeholders will not read it. Create a one-page summary of the 3-5 most critical findings, a detailed findings table for the design team, and a 5-10 minute highlight reel for stakeholder presentations. Schedule a findings review meeting where the design team watches key clips together and discusses solutions.

How AI changes this method

AI compatibility: partial — AI can transcribe sessions in real time, generate session summaries, identify recurring themes across participants, and draft findings reports. However, the core of moderated usability testing — running the session, reading participant behavior in real time, and making judgment calls about when and how to probe — requires human presence and cannot be automated. The facilitator’s ability to notice a two-second hesitation and ask “what just happened?” is what separates moderated testing from unmoderated testing, and no AI can replicate that in-session judgment.

What AI can do

Real-time transcription: Tools like Otter.ai, Grain, or Looppanel transcribe sessions as they happen, freeing the note-taker to focus on observations rather than verbatim recording.
Session summary generation: After each session, an LLM can process the transcript and produce a structured summary — tasks attempted, completion status, key quotes, and areas of confusion — reducing the post-session debrief from 30 minutes to 10 minutes of review and correction.
Cross-session pattern detection: Given transcripts from all sessions, AI can identify which problems appeared across multiple participants, which interface elements triggered the most negative comments, and which tasks had the widest variation in completion time.
Highlight clip selection: AI-powered tools like Grain and Looppanel can tag moments in session recordings where participants expressed confusion, frustration, or surprise, enabling the researcher to build a highlight reel in minutes instead of re-watching hours of video.
Report drafting: An LLM can generate a first-draft findings report from structured session notes — listing each finding with severity, frequency, and supporting quotes — which the researcher refines, adds screenshots to, and adjusts for the audience.

What requires a human researcher

Running the session: The facilitator must build rapport, read the participant’s body language and tone, decide in real time which thread to follow, and maintain the delicate balance between prompting the participant to think aloud and not leading them toward a particular action. This requires social intelligence and moment-to-moment judgment that AI cannot perform.
Deciding when and how to probe: When a participant pauses on a screen for three seconds, the facilitator decides whether to wait (the participant may be reading) or prompt (they may be lost). This judgment depends on context — what the participant said 30 seconds ago, what their overall skill level appears to be, and what the task requires. No automated system can make this call.
Rating severity and prioritizing findings: Whether a usability problem is critical, moderate, or low depends on factors AI cannot assess: how the task fits into the broader user journey, how often real users would encounter this path, and what the business cost of failure is. A finding that 3 of 5 participants missed a button could be critical (if the button leads to the only way to cancel a subscription) or low (if there is an alternative path). This judgment requires domain knowledge and design experience.
Crafting actionable recommendations: Turning “4 participants didn’t see the Save button” into “move the Save button from the overflow menu to the primary toolbar, make it a filled button instead of a ghost button, and add a confirmation toast after save” requires design knowledge that goes beyond what the data alone can produce.

AI-enhanced workflow

The biggest time saving is in post-session work. Traditionally, a researcher spends 30-60 minutes after each session reviewing notes, timestamping key moments, and writing a summary. With AI-generated transcripts and summaries, the researcher reviews and corrects a draft summary in 10-15 minutes — a 60-70% reduction in post-session overhead. Across 5-8 sessions, this saves a full working day.

The analysis phase also accelerates. Instead of reading through 5-8 session transcripts to identify recurring findings, the researcher feeds all transcripts to an LLM that identifies patterns: “4 of 5 participants could not find the return policy link; 3 of 5 described the checkout flow as ‘confusing’ or ‘too many steps’; the word ‘discount’ appears in participant questions across all sessions but the interface uses ‘promo code’.” This cross-session synthesis, which traditionally takes 4-6 hours of transcript re-reading, can be drafted by AI in minutes and reviewed by the researcher in an hour.

The facilitation itself remains entirely human. No AI tool changes how the session is conducted — the researcher still sits with the participant, watches their screen, and asks follow-up questions in real time. AI’s role begins when the session ends: transcribing, summarizing, pattern-matching, and drafting. The researcher’s role shifts from mechanical post-processing (typing notes, re-watching video, compiling lists) to interpretive work (judging severity, crafting recommendations, deciding what to fix first).

Works well with

Usability Testing — Unmoderated (Ur): Moderated tests identify problems and explain why they occur; unmoderated tests quantify how often those problems occur across a larger sample. Run moderated first to discover, then unmoderated to measure.
Concept Testing (Ct): Concept testing validates whether users want the product; moderated usability testing validates whether they can use it. Run concept tests first (is this worth building?), then usability tests (can people actually use what we built?).
In-depth Interview (Di): Opening and closing questions in a moderated usability session function as a short interview. For deeper exploration of user needs, run a dedicated interview before the usability test to understand context, then use those insights to write better task scenarios.
Heuristic Evaluation (He): Expert heuristic evaluation catches predictable usability problems before testing with real users, reducing the number of obvious issues participants encounter and allowing the test to focus on less predictable problems that only real users reveal.
Journey Mapping (Jm): Journey maps identify which stages of the user experience have the most pain points; moderated usability tests zoom into those specific stages to observe exactly where and why the friction occurs.

Example from practice

Beginner mistakes

Helping the participant

The most common facilitator mistake is stepping in when the participant struggles. Saying “try clicking the menu icon” or “that’s the wrong button” destroys the test’s validity — the purpose is to observe what happens without help, because real users will not have a facilitator sitting next to them. When participants ask for help, redirect: “What would you do if you were at home?” or “Where would you look for that?” The discomfort of watching someone struggle is the entire point of the method.

Asking leading questions

Questions like “Did you find that easy?” or “You noticed the save button, right?” tell the participant what they should have experienced. Ask neutral questions: “How was that for you?” “What did you expect to happen?” “Walk me through what you were thinking.” Leading questions produce data that confirms the team’s hopes rather than revealing the product’s problems.

Testing too many tasks in one session

Beginners pack 8-10 tasks into a 60-minute session, leaving no time for probing or reflection. The participant rushes, the facilitator skips follow-up questions, and the data is shallow. Three to six well-chosen tasks with thorough probing produce richer findings than ten tasks done at surface level. Each task needs time for the attempt, for recovery from errors, and for retrospective discussion.

Using interface language in task descriptions

Writing “Click on ‘My Account’ and go to ‘Order History’” tells the participant exactly what to look for, bypassing the findability test entirely. Write task scenarios in plain language that describe the goal without naming the interface elements: “You ordered something last week and want to check whether it has shipped.” This forces the participant to figure out where to go, which is exactly what you need to observe.

Not testing iteratively

Running one round of 10 tests, writing a massive report, and never retesting is a common but wasteful pattern. The most effective approach is iterative: test with 5 users, fix the critical issues, then test again with 5 new users to verify the fixes worked and to catch the problems that were hidden behind the bigger ones. Two rounds of 5 produce better outcomes than one round of 10.

AI prompts for this method

4 ready-to-use AI prompts with placeholders — copy-paste and fill in with your context. See all prompts for moderated usability testing →.