How to run a card sort: a practical guide with AI prompts

A B2B SaaS company selling project management software noticed that its help center had over 400 articles but declining page views and rising support ticket volume. Analytics showed that users searched for help terms 3x more often than they navigated through the help center’s category structure, and that 40% of searches returned zero results — not because the article did not exist, but because users searched using different terminology than the article titles used.

The UX research team ran an open card sort with 25 customers, using 45 cards representing the most-visited and most-searched-for help topics. Participants consistently created groupings that differed from the existing structure in two ways: they organized by task (“Setting up my project,” “Inviting my team,” “Tracking progress”) rather than by feature (“Gantt charts,” “Kanban boards,” “Resource allocation”), and they used concrete, action-oriented labels rather than the abstract feature names the help center used. The dendrogram showed strong agreement (70%+) on five task-based categories that replaced the original eight feature-based categories.

The team restructured the help center using the card sort categories and adopted the most common participant-generated labels. They followed up with a tree test (42 participants) confirming that 85% of tasks were findable in the new structure vs. 52% in the old structure. After launch, help center navigation usage increased by 60%, search-to-navigate ratio dropped from 3:1 to 1:1, and support ticket volume for “how to” questions decreased by 22% within two months.

That outcome is what card sorting produces: a shift from “we organized content the way we think about it” to “we organized content the way our users think about it.”

What card sorting actually is

Card sorting is a research method in which participants organize individually labeled cards into groups that make sense to them, revealing how they expect information to be structured. The method produces data that directly informs information architecture decisions — category names, navigation structures, and content groupings — by grounding them in users’ mental models rather than internal organizational logic.

What questions it answers

Card sorting addresses questions about how users expect information to be organized:

How do users naturally categorize and group the content, features, or topics on our site or application?
What labels and category names do users expect to see in the navigation — and which of our current labels confuse them?
Does our existing information architecture match the way users think about our content, or are we organizing it by internal department structure?
Which items do users consistently group together, and which items do they struggle to place — signaling ambiguous or poorly scoped content?
How do different user segments (new vs. returning, expert vs. novice) differ in the way they organize the same set of items?
Which category structure produces the highest agreement across participants, indicating a stable, user-centered architecture?

When to use

When designing or redesigning the information architecture of a website, application, or intranet, and the team needs empirical data about how users expect content to be organized rather than relying on stakeholder opinions or org-chart logic.
When the existing navigation is underperforming — users cannot find what they need, search usage is abnormally high, or support tickets frequently cite “I couldn’t find X” — and the team suspects the category structure is the root cause.
When adding a significant amount of new content or features to an existing product and the current architecture may not accommodate it without restructuring.
When merging content from multiple sources (e.g., after an acquisition, a platform migration, or consolidation of multiple microsites) and the team needs a unified structure that works for users of all the legacy systems.
When the team has already conducted a card sort and built a new architecture, and now wants to validate whether the proposed categories work using a closed card sort.
When there is internal disagreement about how to organize content and the team needs user data to resolve the debate rather than letting the loudest stakeholder win.

Not the right method when the question is about whether users can find specific items within an existing structure — that is tree testing, not card sorting. Card sorting generates structure; tree testing evaluates it. Also not appropriate when the content set is very small (fewer than 15 items) or very large (more than 80 items) — small sets produce trivially obvious groupings, while large sets cause participant fatigue and unreliable data. If the goal is understanding behavior, motivations, or attitudes rather than content organization, use interviews or contextual inquiry instead.

What you get (deliverables)

Similarity matrix: a table showing how frequently each pair of cards was placed in the same group across all participants, revealing which items users see as strongly related.
Dendrogram (cluster analysis): a tree diagram showing how items cluster together at different levels of agreement, helping the team decide where to draw category boundaries.
Proposed category structure: a set of user-generated categories with labels, based on the most common groupings across participants.
Category-naming data: the actual words participants used to label their groups, providing vocabulary for navigation labels that match user expectations.
Standardized spreadsheet: raw sorting data for each participant, enabling further statistical analysis or comparison across segments.
Disagreement inventory: items that participants placed in different groups with no clear consensus, flagging content that may need to be cross-linked, renamed, or split.

Participants and duration

Participants: 15-30 for an open card sort (NNGroup recommends 15+ for stable patterns; larger samples produce more reliable similarity matrices). 15-20 for a closed card sort. For remote unmoderated studies, recruit toward the higher end since some sessions will be incomplete or low-quality.

Session length: 15-30 minutes per participant for 30-50 cards. Sessions with more than 60 cards risk fatigue and superficial sorting. In-person moderated sessions can run 30-45 minutes because the researcher can prompt participants to think aloud.

Analysis time: 2-5 days depending on participant count and whether using dedicated analysis software (OptimalSort, UXtweak) or manual spreadsheet analysis.

Total timeline: 1-3 weeks (card preparation and pilot: 2-3 days; data collection: 3-7 days for remote, 1-2 days for in-person; analysis and report: 2-5 days).

How to run a card sort (step-by-step)

1. Choose the card sort type

Decide between three types based on your research question. Open card sort: participants create their own groups and name them — use when exploring how users naturally organize content and what labels they expect. Closed card sort: participants sort cards into predefined categories — use when validating whether an existing or proposed category structure works. Hybrid card sort: participants sort into predefined categories but can create new ones — use when testing a structure while leaving room for users to surface gaps.

2. Select and label the cards

Create 30-50 cards, each representing a page, feature, product, or content item. Write labels in plain language that participants will understand without domain expertise. Avoid using identical words across multiple cards (e.g., “Toyota Camry” and “Toyota Corolla” will be grouped by brand rather than by vehicle type). Avoid jargon or internal terminology that users would not recognize. Each card should represent a single concept — if a label is ambiguous, participants will sort inconsistently and the data will be noisy.

3. Pilot test with 3-5 people

Run a small pilot before launching the full study. Watch for: cards that participants do not understand, cards that everyone puts in the same group (too obvious, consider removing), cards that everyone struggles with (may need relabeling), and sessions that take too long (reduce the card count). Adjust labels, remove unnecessary cards, and clarify instructions based on pilot results.

4. Set up the study and recruit participants

For remote studies, set up the sort in a tool like OptimalSort, UXtweak, or Maze. Add a brief introduction explaining the task without biasing how participants should sort. Include screener questions to ensure participants match the target audience. For in-person studies, print cards on index cards or sticky notes and prepare a table surface. Recruit 15-30 participants who represent the product’s actual users — not colleagues, not UX professionals, not stakeholders.

5. Conduct the sort

For remote unmoderated sorts, launch the study and monitor completion rates. Expect 20-30% dropout for longer sorts; plan recruitment accordingly. For in-person moderated sorts, ask participants to think aloud as they sort, and probe their reasoning: “Why did you put those together?” “What would you call this group?” “Where would you look for X?” Record audio and take notes. Do not help participants or suggest groupings — the point is to see their uninfluenced mental model.

6. Clean the data

Remove incomplete sessions (participants who sorted fewer than half the cards). Remove spam or random sorts (participants who completed in under 2 minutes for a 40-card sort). Standardize group labels: participants will use different words for the same concept (“Help” vs. “Support” vs. “Customer Service”). Map synonymous labels to a single canonical term for analysis.

7. Analyze groupings and generate a category structure

Use the analysis tools in your card sorting software to generate: a similarity matrix (which cards were most frequently grouped together), a dendrogram (how clusters form at different agreement thresholds), and popular categories (the most common groupings with their participant-given names). Set a threshold for agreement — typically 60-70% — to decide which groupings are strong enough to become categories. Items below the threshold may need to appear in multiple categories or be cross-linked.

8. Translate findings into an information architecture

Convert the analysis into a proposed site structure. Use the strongest clusters as top-level categories. Use participant-generated labels as starting points for navigation names (validated through preference testing or A/B testing if needed). Document items with low agreement — these are the items that will cause findability problems regardless of where they are placed, and should be cross-linked or given secondary navigation paths.

9. Validate with tree testing

After building the proposed architecture from card sorting data, run a tree test to evaluate whether users can actually find items within the new structure. Card sorting tells you how users group items; tree testing tells you whether they can navigate to specific items within those groups. The two methods together — sort first, test second — produce an architecture that is both user-organized and user-navigable.

How AI changes this method

AI compatibility: partial — AI can accelerate analysis of card sorting data (similarity matrices, clustering, label standardization) and can even simulate preliminary sorts as a supplement to human participants. However, AI cannot replace real users sorting real cards because the goal of card sorting is to uncover how actual target users think — and an LLM’s categorization reflects its training data, not any specific user population’s mental model. MeasuringU’s research found that ChatGPT’s sorting results correlate with human results at a moderate level but diverge on ambiguous items — exactly the items where user data matters most.

What AI can do

Label standardization: An LLM can process the raw group labels from all participants and map synonymous terms to canonical labels (e.g., grouping “Help,” “Support,” “Customer Care,” and “Get Assistance” under a single category), reducing hours of manual label cleaning.
Cluster interpretation: After generating a dendrogram, AI can describe what each cluster represents in plain language, helping a team unfamiliar with dendrograms understand the analysis output.
Preliminary sorting simulation: Before recruiting real participants, an LLM can sort the cards to identify obviously problematic labels (cards that AI also cannot categorize consistently), serving as a quick pre-pilot to improve card quality.
Cross-segment comparison: Given sorting data from multiple user segments, AI can identify where segments agree and where they diverge, producing a structured comparison that would take hours of manual matrix analysis.
Report drafting: AI can generate a first-draft report from the analysis data — summarizing strongest groupings, listing items with low agreement, and proposing a category structure — which the researcher then validates and refines.

What requires a human researcher

Representing real users: The entire point of card sorting is to capture how real users think. An LLM’s sort reflects averaged patterns from its training corpus, not the mental model of a specific user segment (e.g., first-time customers, elderly users, domain experts). Only human participants provide this data.
Moderated session facilitation: In-person moderated sorts produce qualitative data — why participants grouped items the way they did — that no automated sort can capture. The researcher probes reasoning, notices hesitation, and follows up on unexpected groupings in real time.
Ambiguous item decisions: Items that generate low agreement across participants are the most strategically important — they represent genuine confusion about where content belongs. Deciding how to handle these items (cross-link, rename, split, or restructure) requires design judgment informed by business context.
Architecture decisions: Translating clusters into a navigable information architecture involves trade-offs between depth and breadth, between user expectations and business requirements, and between ideal structure and technical constraints. These are design decisions, not data problems.

AI-enhanced workflow

The most time-consuming phase of card sorting analysis is label cleaning. In a 30-participant open card sort with 40 cards, participants might create 150-200 unique group labels, many of which are synonyms or near-synonyms. A researcher traditionally spends 3-6 hours reading through every label, deciding which ones mean the same thing, and standardizing the list before any clustering analysis can begin. An LLM can produce a first-pass mapping of synonymous labels in minutes, reducing the researcher’s work to reviewing and correcting the AI’s output — typically 30-60 minutes instead of half a day.

The analysis phase also benefits. Tools like OptimalSort generate similarity matrices and dendrograms automatically, but interpreting them requires statistical literacy that many teams lack. An LLM can take the raw analysis output, describe the key clusters in plain language, and draft a recommended category structure that the team can discuss without needing to read the dendrogram directly. This makes card sorting results accessible to stakeholders who would otherwise disengage from a statistical presentation.

Where AI falls short is in the data collection itself. While it is tempting to skip recruitment and have an LLM sort the cards, this produces results that reflect general language patterns rather than the specific mental models of target users. A healthcare company’s patients categorize medical information differently from a general population — and that difference is precisely what card sorting is designed to reveal. AI-simulated sorts are useful as pre-pilots (to test card quality before real participants see them) but should never replace real participant data.

Works well with

Mental Model Mapping (Mm): Mental model mapping reveals how users organize their thinking about a domain; card sorting operationalizes that understanding by testing how they organize specific content items within that domain. The mental spaces from a mental model diagram can inform which cards to include in the sort.
In-depth Interview (Di): Moderated card sorts naturally generate interview-quality data when participants think aloud. Combining card sorting with follow-up interviews produces both structural data (groupings) and qualitative data (reasoning behind the groupings).
Persona Building (Ps): Card sorting data segmented by persona type reveals whether different user groups organize information differently, informing whether the architecture should accommodate multiple mental models or optimize for the primary persona.
Concept Testing (Ct): After card sorting produces a category structure, concept testing can evaluate whether users understand what each category contains based on its label alone — a quick validation before building the full architecture.
Participatory Design (Pd): Card sorting is one of the simplest participatory design activities. Combining it with broader participatory design workshops lets users shape both the content organization and the interface that presents it.

Example from practice

Beginner mistakes

Using too many or too few cards

Beginners either include every page on the site (producing a 100+ card sort that exhausts participants) or include so few cards (under 15) that the groupings are trivially obvious. The target range is 30-50 cards. Selecting the right cards is itself a research decision — choose items that represent the breadth of the content space and that you genuinely do not know how users would categorize. Exclude items with obvious homes (e.g., “Login” always goes in “Account”).

Biasing card labels with internal jargon

If the card labels use internal terminology or department names, participants will sort based on the words rather than on their understanding of the content. A card labeled “CRM Integration” will be sorted by people who know what CRM means but will confuse everyone else. Rewrite labels in plain user language: “Connect to your customer database” instead of “CRM Integration.” The card sort is testing how users think, not whether they know your acronyms.

Skipping the pilot

Running a card sort without a pilot means discovering problems (confusing labels, too many cards, unclear instructions) after 30 participants have already completed a flawed study. A 3-5 person pilot catches these issues when they are still cheap to fix. Watch the pilot participants sort — if they hesitate on the same cards, those labels need rewriting.

Treating the results as a finished architecture

Card sorting produces groupings, not a navigation structure. Beginners sometimes take the most common groupings and ship them as the final information architecture without considering depth/breadth trade-offs, cross-linking needs, business requirements, or navigability. Card sorting data is an input to architecture design, not the output. Always follow card sorting with tree testing to validate that the proposed structure actually works for finding specific content.

Running only a closed sort when an open sort is needed

Closed card sorts validate an existing structure — they cannot generate a new one. If the current architecture is the problem, a closed sort will only tell you whether users can sort into those broken categories. Start with an open sort to discover how users naturally organize, then use a closed sort to validate the resulting structure. Running a closed sort first is the most common misapplication of the method.

AI prompts for this method

4 ready-to-use AI prompts with placeholders — copy-paste and fill in with your context. See all prompts for card sorting →.