How to run a tree test: a practical guide with AI prompts

A large university was redesigning its website, which had grown to over 4,000 pages across academic programs, student services, admissions, research, and campus life. Students reported difficulty finding financial aid deadlines, housing applications, and course registration instructions. The UX team had run a card sort with 40 participants and proposed a navigation structure with 8 top-level categories and 35 subcategories — but before committing to visual design, they ran a tree test with 60 participants.

Results showed 7 of 10 tasks exceeded the 70% success threshold, but three failed: financial aid deadline (48%), transfer housing availability (52%), and adding a minor (41%). Path analysis revealed that financial aid was split between two categories, housing was buried three levels deep, and degree modifications were hidden inside a category users associated with transcripts. The team consolidated financial aid into a single top-level category, elevated housing, and moved degree modifications into academic programs. A follow-up tree test confirmed all three tasks now exceeded 70% success — before any visual design work began.

That is what tree testing produces: evidence-based validation of a navigation structure at a fraction of the cost and time of a full usability test.

What tree testing is

Tree testing is a task-based research method that evaluates whether users can find information within a proposed navigation hierarchy by presenting them with a text-only version of a site’s category structure — no visual design, no layout, no content — and asking them to locate specific items. By stripping away everything except the labels and their nesting, tree testing isolates navigation structure from visual cues and reveals whether the information architecture itself makes sense to users or whether categories, labels, and groupings need to be reworked.

What questions it answers

Tree testing addresses questions about whether the navigation structure supports findability:

Can users find key resources and features using this navigation structure, and which categories cause the most confusion?
Do the category labels make sense to users, or do they expect different names for the same content?
Where in the hierarchy do users get lost — at the top level, one level down, or deeper in the structure?
Are there categories where users consistently navigate to the wrong location, suggesting a labeling or grouping problem?
How does one proposed navigation structure compare to an alternative?
Which areas of the navigation work well and should be preserved during a redesign?

When to use tree testing

When a proposed information architecture exists (from a card sort, stakeholder input, or design exploration) and the team needs to validate whether users can actually find content within it before building layouts and pages.
When redesigning a website or application’s navigation and the team wants to test multiple structural options quickly and cheaply before committing to one.
When analytics or user feedback suggest that users cannot find important content but the team is unsure whether the problem is the navigation structure, the visual design, or the page content — tree testing isolates the structure.
When the team needs a fast, lightweight method that can be run in a few days with no prototype, no visual design, and no content creation required.
When benchmarking an existing navigation structure to establish baseline findability metrics before a redesign.

Not the right method when there is no proposed hierarchy to test — if the team is still generating ideas for how to organize content, card sorting is the appropriate method. Also not appropriate when the research question involves visual design, interaction patterns, or content quality — tree testing strips all of that away intentionally. If users need to see a realistic interface to answer the research question, use a first-click test or moderated usability test instead.

What you get

Task success rates: the percentage of participants who found the correct location, broken down into direct success (went straight to the answer) and indirect success (found it after backtracking).
First-click correctness: whether participants’ first category choice was the correct path — a strong predictor of eventual success.
Time to complete: how long participants took to find or give up on each item.
Path analysis: the click-by-click sequence each participant followed, revealing exactly where users took wrong turns.
Pietree or treemap visualizations: visual representations of where all participants navigated, showing correct and incorrect paths at a glance.
Problem category report: a prioritized list of categories with low success rates, with recommendations for relabeling, regrouping, or restructuring.

Participants and duration

For qualitative tree testing (identifying problems), 5-8 participants provide enough signal to spot major findability issues and iterate. For quantitative tree testing (benchmarking, comparing two structures statistically), aim for 50+ participants per tree variant. NNGroup recommends qualitative testing for iterative improvement and quantitative testing for summative benchmarking.

Unmoderated tree tests typically take 10-15 minutes per participant for a study with 8-10 tasks. Moderated sessions run 20-30 minutes. Setup takes 1-2 days, analysis takes 1-2 days, and the total timeline is 3-7 days from setup to report.

How to conduct a tree test (step-by-step)

Export your proposed information architecture into a flat, indented list — each line is one category, indentation represents nesting level. Include every level down to the location where the target resource lives. Do not include individual content pages — only navigational categories and subcategories. Most tree-testing tools accept a spreadsheet format where column A is the top level, column B is the second level, and so on.

2. Write task scenarios that avoid giving away the answer

Create 8-12 tasks covering the most important content types and the categories you are most uncertain about. Each task should describe what the user needs to find without using the exact category label. If the category is “Starting a Business,” the task should say “You are considering opening a lawn-care service. Find resources that can help you begin the process.” Include a mix: tasks targeting key business goals, tasks exploring potentially confusing categories, and one easy warmup task at the beginning. For each task, define the correct answer location(s).

3. Choose between moderated and unmoderated

Moderated tree testing is best for qualitative, exploratory evaluation — the facilitator can ask “What did you expect to find here?” when a participant clicks the wrong category. Unmoderated tree testing is best for quantitative benchmarking — it scales to large sample sizes quickly. Choose based on whether you need to understand why users get lost (moderated) or measure how many get lost (unmoderated).

4. Set up the study in a tree-testing tool

Import the tree and tasks into your chosen tool (Treejack, Maze, UserZoom, or UX Metrics). Configure correct answers for each task. Enable task randomization (except the warmup). If comparing two tree variants, set up a between-subjects design where each participant sees only one tree.

5. Pilot the study

Run with 2-3 colleagues not involved in creating the navigation. Watch for ambiguous tasks, missing categories, and accidental label leakage. Fix issues before recruiting real participants. A pilot also estimates completion time, which should be under 15 minutes for unmoderated studies.

6. Recruit and launch

For unmoderated studies, distribute a link; participants complete it on their own time. For moderated studies, schedule 20-30 minute sessions. Recruit participants matching the actual audience — internal team members who designed the navigation are not valid participants.

7. Analyze results by task

For each task, review success rate, directness, first-click correctness, and time to complete. Flag tasks where success is below 70% or directness below 50%. Use path analysis to see exactly where participants went wrong: which categories attracted incorrect clicks, and at which level participants diverged from the correct path.

Look across tasks for patterns: same categories causing confusion in multiple tasks, one level consistently problematic, categories attracting many clicks but never being the correct answer. Classify problems as labeling (name is misleading), grouping (content is in the wrong parent), or depth (content is buried too deep). Recommend specific changes and plan a follow-up tree test to verify improvements.

How AI changes tree testing

AI compatibility: partial — AI can help generate task scenarios, analyze path data, suggest alternative category labels, and draft reports, but the core decisions about information architecture require human understanding of the domain, the users, and the organizational context.

What AI can do

Generate task scenarios from a list of content items and target categories, producing realistic user needs without using exact category labels.
Analyze path data and identify patterns — which categories attract the most incorrect clicks, where backtracking occurs most frequently, which tasks have the highest failure rates.
Suggest alternative category labels for low-performing categories, proposing names that better communicate the category’s contents.
Draft findings reports from success rates, path analysis data, and moderated session notes.
Compare two tree variants side by side, summarizing which structure performed better overall and per task.

What requires a human researcher

Designing the tree structure itself — grouping content, creating categories, and deciding nesting depth requires domain knowledge and understanding of how real users think about the topic.
Writing tasks that are realistic and unbiased — AI-generated tasks often inadvertently include label words or create unrealistic scenarios.
Interpreting why users chose wrong paths — in moderated sessions, the facilitator asks follow-up questions that require real-time judgment.
Making final decisions about restructuring — deciding how to fix problems (relabel, regroup, flatten, split) requires judgment about content relationships and organizational constraints.

AI-enhanced workflow

Before AI, a tree-testing cycle involved manually writing each task scenario (carefully avoiding label words), manually reviewing path analysis one task at a time, and manually compiling findings by cross-referencing success rates with path data. Task writing alone could take half a day for a 10-task study, and analysis typically consumed a full day for a quantitative study with 50+ participants.

With AI, the cycle accelerates at both ends. A researcher provides the tree and content items, and the LLM produces draft scenarios in minutes — the researcher reviews for label leakage and realism, reducing task creation from hours to under an hour. For analysis, raw data (click paths, success rates, timing) fed to an LLM produces pattern identification in minutes: “Which categories attracted the most wrong clicks? Where do participants most commonly backtrack?” The researcher’s time shifts from data processing to interpretation and decision-making about what to change in the information architecture.

Tools

Dedicated tree-testing platforms:

Optimal Workshop (Treejack) — the most established tree-testing tool, with pietree visualizations, path analysis, first-click data, and multi-tree comparison.
Maze — product research platform with built-in tree testing, integrated with card sorting and prototype testing.
UserZoom (now UserTesting) — enterprise research platform with tree testing, path visualization, and statistical comparison.
UX Metrics — lightweight tool for card sorting and tree testing.
Lyssna (formerly UsabilityHub) — tree testing alongside first-click tests and preference tests.

Complementary tools:

Miro / FigJam — for visualizing and editing the tree structure before importing.
Google Sheets / Excel — for preparing the tree in indented format.
Google Analytics / Mixpanel — for identifying which content users struggle to find, informing task design.

AI-assisted analysis:

ChatGPT / Claude — for task generation, path data analysis, label suggestions, and report drafting.

Beginner mistakes

Using category labels in task wording

The most common and most damaging mistake. If the task says “Find information about Starting a Business” and the tree has that exact category, the task tests reading comprehension, not findability. Always describe the user’s need without using the category label, and have someone unfamiliar with the tree review tasks for label leakage.

Testing too few or too many tasks

Fewer than 5 tasks does not cover enough of the navigation to find problems. More than 15 causes participant fatigue in unmoderated studies. NNGroup recommends 8-12 tasks, each testing a different category or path.

Not including a warmup task

Participants unfamiliar with the accordion-style interface may be confused. An easy warmup task familiarizes them with the interaction and screens out inattentive participants. Without it, the first real task absorbs the learning curve, producing artificially low success rates.

Confusing tree testing with card sorting

Tree testing evaluates an existing structure; card sorting generates a new one. Running a tree test before having a proposed structure produces meaningless data. The correct sequence: card sort first, then tree test.

Ignoring path data and looking only at success rates

A task with 75% success might seem fine, but if 60% of participants backtracked multiple times, the experience is frustrating. Directness and first-click correctness matter as much as final success. Low directness with high success means the structure is recoverable but confusing.

Works well with

Card sorting: Card sorting generates candidate structures; tree testing evaluates them. Sort first to discover groupings, then test the tree to verify findability.
First-click testing: Adds visual context that tree testing removes. After tree testing confirms the structure, first-click testing checks whether users find the right entry point in the actual layout.
Usability testing (moderated): After tree testing validates structure, usability testing evaluates the full navigation experience including visual design and content.
Heatmaps and click maps: Reveal where users click in the current navigation, identifying categories that attract unexpected traffic or that users ignore — data for focusing tree test tasks.
Analytics and clickstream: Show which pages users reach from search (bypassing navigation) and which internal searches suggest navigation failures — inputs for task design.

AI prompts for this method

4 ready-to-use AI prompts with placeholders — copy-paste and fill in with your context. See all prompts for tree testing →.