How to conduct a desirability study: a practical guide with AI prompts

A desirability study measures users’ emotional and aesthetic reactions to a design by asking them to select words from a controlled vocabulary that best describe their experience. The method was introduced by Joey Benedek and Trish Miner at Microsoft in 2002, who created 118 product reaction cards — physical cards, each printed with a single adjective — and asked participants to pick the five words that best captured their impression of a product. Desirability studies bridge the gap between usability testing (which tells you whether something works) and brand research (which tells you whether something feels right), providing structured data on the subjective qualities of a design.

What question does it answer?

What emotional impression does this design create — does it feel trustworthy, exciting, confusing, or cheap?
Do users associate this design with the brand attributes we intend to communicate (e.g., “professional,” “innovative,” “friendly”)?
How do emotional reactions differ between two or more design alternatives?
Does the visual design evoke different reactions from different user segments (e.g., novice vs. experienced, younger vs. older)?
Has a redesign shifted the emotional perception of the product in the intended direction?

When to use

After a visual redesign, to confirm that the new design conveys the intended brand personality and emotional tone before development begins.
When comparing two or more design directions (e.g., minimalist vs. illustrated, warm palette vs. cool palette) and the team needs structured emotional feedback rather than vague opinions.
During concept testing, alongside a usability evaluation, to capture both functional and emotional dimensions of the user experience — usability testing reveals what works, and a desirability study reveals how it feels.
When stakeholders disagree about the “feel” of a design and the team needs participant data to move the conversation from personal preferences to evidence.
As a longitudinal benchmark: run a desirability study on the current product, then again after a major update, to track whether emotional perception shifts in the desired direction.

Not the right method when the primary question is about functionality, task completion, or information findability — usability testing, first-click testing, or tree testing are better choices for those. A desirability study captures reaction to visual and experiential qualities, not the mechanics of interaction. Also not suitable as the sole evaluation method: emotional reaction without usability data can produce a design that looks appealing but frustrates users in practice. Always pair desirability results with functional testing.

What you get (deliverables)

Word frequency table: a ranked list of the most-selected reaction words, with the percentage of participants who chose each word, revealing the dominant emotional impression.
Positive/negative word ratio: the proportion of positive, negative, and neutral words selected, showing the overall emotional valence of the design.
Brand alignment score: a comparison of selected words against the target brand attributes defined before the study, showing the percentage of participants who picked at least one brand-aligned word.
Comparative word profiles: when testing multiple designs, a side-by-side breakdown of which words cluster around each alternative, making the emotional trade-offs between options visible.
Venn diagram or overlap chart: a visual showing which words both user groups (or both designs) share and which are unique to each, highlighting points of emotional convergence and divergence.
Qualitative follow-up summary: if participants were asked “why did you choose that word?”, a thematic summary of the reasoning behind the most-selected words.
Recommendation deck: a short presentation linking the word data to design decisions — which design aligns best with brand goals, where mismatches exist, and what to adjust.

Participants and duration

Participants: 20-30 per design variant for reliable word-frequency distributions. MeasuringU recommends at least 20 participants for stable percentage rankings. For a moderated variant with follow-up discussion, 8-12 participants provide sufficient qualitative depth.
Session length: Unmoderated desirability studies take 5-10 minutes per participant (view the design, read the word list, select 5 words, optionally explain choices). Moderated sessions with think-aloud and follow-up run 15-25 minutes.
Setup time: 2-4 hours to select or adapt the word list, prepare the design stimulus (screenshot, prototype, or live site), define brand-target words, and configure the survey or testing tool. Add 30 minutes for a pilot test with 2-3 people.
Analysis time: 1-3 hours for a single design, 3-5 hours for a comparative study with two or more variants. Word frequency tabulation is fast; qualitative follow-up analysis adds time.
Total timeline: 2-5 days from preparation through final report, depending on whether the study is moderated or unmoderated and how many variants are being compared.

How to conduct a desirability study (step-by-step)

1. Define the research question and brand-target words

Start by clarifying what you want to learn. Are you testing whether a new design feels “trustworthy” and “professional”? Comparing which of two designs feels more “approachable”? Before selecting any word list, write down 3-5 brand-target words that represent the emotional qualities the design is intended to convey. These will be your benchmark during analysis.

2. Select and adapt the word list

The original Microsoft set contains 118 words, which is too many for most remote studies. Reduce the list to 20-30 words based on your research question. Include words that map directly to your brand-target attributes, words that represent opposite or undesirable qualities, and neutral words that act as a baseline. Aim for at least 40% negative or neutral words to prevent acquiescence bias — a list of only positive words produces meaningless results. If your product has domain-specific qualities (e.g., a financial app might need “secure,” “transparent”), add custom words. Randomize the order of presentation to prevent position bias.

3. Prepare the design stimulus

Choose the format in which participants will see the design: a static screenshot, an interactive prototype, or the live product. For pure visual-emotional measurement, a high-fidelity screenshot is sufficient and avoids the noise of usability struggles during interaction. If you want reactions that incorporate the experience of using the product, let participants complete 2-3 tasks in a prototype before presenting the word list. Export the stimulus at the resolution and device context users would actually encounter.

4. Build the study in a survey or testing tool

Set up the study in your chosen platform (Maze, Lyssna, Google Forms, Typeform, Optimal Workshop, or a custom survey). Present the design stimulus first, then show the word list and ask participants to select exactly 5 words that best describe the design. If comparing two designs, use a between-subjects design (each participant sees only one) to prevent the first exposure from anchoring responses to the second. Add one open-ended follow-up question: “Why did you choose these words?” This qualitative layer adds depth to the frequency data.

5. Run a pilot test

Test with 2-3 colleagues or friends to verify that the design image is clear and at the right size, that the word list is readable, that instructions are unambiguous (participants understand they should pick exactly 5 words), and that the survey takes under 10 minutes. Check that randomization works so words do not appear in the same order for every participant.

6. Recruit participants and launch

Recruit participants who represent your target audience. For unmoderated studies, distribute the survey link through your panel, email list, or recruitment platform. For moderated sessions, schedule screen-sharing calls where you display the design, give the participant time to look at it (or interact with it), then present the word list on screen and ask them to select and explain their choices. Avoid priming: do not describe the design’s intended feel before presenting the word list.

7. Tabulate word frequencies and calculate key metrics

Once data collection is complete, count how many participants selected each word and convert to percentages. Rank the words from most to least selected. Calculate the positive/negative ratio. Check how many participants selected at least one of your brand-target words. If comparing two designs, create parallel frequency tables and look for words that are strongly associated with one design but not the other.

8. Analyze patterns and generate insights

Look beyond individual words to patterns. Do the top-5 selected words cluster around a single emotional theme (e.g., “calm,” “clean,” “simple” — the design reads as minimalist) or scatter across contradictory emotions (e.g., “professional” + “boring” — competent but uninspiring)? Compare results against the brand-target words from step 1: does the design communicate what it was supposed to? If testing multiple designs, identify the one whose word profile most closely matches the brand target. Use a Venn diagram to visualize shared and unique words across designs or user segments.

Write a report that leads with the top 5 selected words and the brand-alignment score. Include the full word-frequency table, the Venn diagram (if applicable), and key quotes from the qualitative follow-up. End with clear recommendations: which design to pursue, which emotional gaps to address in the next iteration, and which word-list adjustments to make if the study is repeated.

How AI changes this method

AI compatibility: partial — AI can handle word-frequency analysis, qualitative coding of follow-up responses, and report generation, but the research design decisions (choosing brand-target words, adapting the word list, interpreting cultural nuance in word choices) require human judgment. The core participant interaction — looking at a design and selecting words — must be done by real users.

What AI can do

Adapt the word list to a specific domain: given the 118 original Microsoft reaction words and a product description, an LLM can suggest a 25-word subset tailored to the product category, flagging missing domain-specific terms and ensuring the positive/negative balance is maintained.
Analyze open-ended follow-up responses: when participants explain why they chose certain words, AI can code these responses into themes (e.g., “color-driven,” “layout-driven,” “typography-driven”), cluster similar explanations, and surface the most common reasoning patterns.
Generate word-frequency visualizations: given raw data, AI can produce ranked bar charts, word clouds weighted by selection frequency, and Venn diagrams comparing designs or user segments.
Draft the findings report: from the frequency table and qualitative themes, AI can produce a structured report with key metrics, pattern analysis, and design recommendations, which the researcher reviews and edits.
Translate the word list for cross-cultural studies: AI can translate reaction words into other languages while flagging words whose emotional connotation shifts across cultures (e.g., “sophisticated” carries different associations in American English vs. British English vs. Japanese).

What requires a human researcher

Selecting brand-target words and adapting the word list: these decisions depend on understanding the product strategy, competitors, and the emotional territory the brand occupies — context that requires human judgment and stakeholder conversation.
Deciding what participants see and how: choosing whether to show a static screenshot or let participants interact, selecting the right fidelity level, and framing the study context all require research design expertise.
Interpreting word choices in cultural context: a word like “bold” might be positive for a startup audience and alarming for a banking audience. Understanding what a specific audience means when they select a particular word demands cultural and domain knowledge that AI cannot reliably provide.
Moderating follow-up conversations: in moderated desirability studies, the researcher probes why a participant chose a word, follows unexpected threads, and reads non-verbal cues — skills that require real-time human presence.

AI-enhanced workflow

Before AI, a researcher would spend several hours manually reviewing hundreds of open-ended follow-up responses, grouping them into themes, and counting how often each theme appeared. With an LLM, the researcher pastes the full set of responses into a prompt, receives a thematic breakdown in minutes, and then spends time refining the themes and checking edge cases rather than doing the initial sort. This cuts qualitative analysis time by roughly 60-70%.

Word list adaptation used to be a slow, manual process of reading through all 118 Microsoft words, cross-referencing with brand guidelines, and debating which to keep. An LLM can propose a first-draft subset in seconds, which the researcher then adjusts — removing words that do not resonate in the product’s cultural context and adding ones the AI missed. The researcher’s role shifts from generating the list to editing and validating it.

For comparative studies involving multiple designs or user segments, AI-generated Venn diagrams and side-by-side frequency charts make visual differences immediately clear, so the researcher can focus the report on explaining what the data means rather than on producing the visualizations.

Tools

Survey and testing platforms:

Maze — supports desirability studies with image stimulus, word-list selection, and built-in analytics.
Lyssna (formerly UsabilityHub) — offers preference testing and first-click testing that can be adapted for desirability studies; includes panel access.
Optimal Workshop — provides card-based research tools including Chalkmark and custom surveys suitable for reaction card studies.
Google Forms / Typeform — for quick, low-cost desirability studies using a checkbox question with the word list; lacks built-in click maps but works for remote unmoderated studies.
UXtweak — offers survey and testing tools with panel access and result visualization.

Analysis and visualization:

Excel / Google Sheets — for word-frequency tabulation, percentage calculations, and creating ranked bar charts.
Miro / FigJam — for creating Venn diagrams comparing word selections across designs or user groups.
Dovetail — for coding and analyzing qualitative follow-up responses with tagging and theme clustering.

AI-assisted:

ChatGPT / Claude — for word list adaptation, qualitative response coding, and report drafting.
MonkeyLearn / Thematic — for automated sentiment and theme classification of open-ended responses at scale.

Works well with

Usability Testing Moderated (Ut): Running a desirability study immediately after a moderated usability session captures both the functional and emotional dimensions of the same experience in one session — participants already have an informed reaction to the design.
A/B Testing (Ab): A/B testing measures behavioral differences (clicks, conversions) while a desirability study measures emotional differences — combining both reveals whether the variant that performs better in metrics also feels better to users.
Concept Testing (Ct): During early-stage concept evaluation, desirability data adds an emotional layer to feasibility and functionality feedback, helping teams choose concepts that not only work but resonate.
Persona Building (Ps): Personas define the target audience’s values and emotional needs; desirability study results validated against persona profiles reveal whether the design speaks to the right emotional register for each audience segment.
First Click Testing (Fc): First-click testing checks whether users start on the right path, and a desirability study checks whether the design feels right while they do — together, they cover both cognitive and emotional first impressions.

Example from practice

A European fintech company redesigned its mobile investment app to attract younger users (25-35) who described the existing version as “corporate” and “outdated” in customer support tickets. The product team created two design directions: “Bold” (dark backgrounds, neon accents, motion) and “Clean” (white space, muted palette, static layouts). Both passed usability testing with comparable task-completion rates, leaving the team without a clear winner on functional grounds.

The research team ran a desirability study with 48 participants from the target demographic, split evenly between the two designs. Each participant interacted with a prototype for three minutes, then selected five words from a 25-word adapted reaction list. The “Bold” design’s top words were “exciting” (67%), “modern” (58%), “creative” (46%), and “overwhelming” (38%). The “Clean” design’s top words were “trustworthy” (71%), “professional” (54%), “calm” (50%), and “boring” (33%). The brand-target words were “trustworthy,” “modern,” and “approachable.” Neither design scored well on “approachable” — it appeared in fewer than 10% of responses for both.

The team chose the “Clean” direction as its foundation because “trustworthy” was the highest-priority brand attribute for a financial product, then ran a targeted iteration to address the “boring” signal — introducing subtle color accents and micro-interactions. A follow-up desirability study with 24 participants showed “boring” dropping to 8% while “trustworthy” held at 67% and “approachable” rose to 29%. The redesigned app launched to a 22% increase in new-account activations from the 25-35 age group in the first quarter.

Beginner mistakes

Using only positive words in the word list

When the list contains only flattering adjectives (“elegant,” “innovative,” “delightful”), every participant response looks like praise, and the researcher has no way to detect problems. This happens because beginners worry that including negative words will offend stakeholders or bias participants. In reality, the opposite is true — Benedek and Miner’s original research recommended at least 40% negative or neutral words. Without them, the study produces data that cannot distinguish a beloved design from a mediocre one. Always include words like “confusing,” “cheap,” “boring,” and “intimidating” alongside the positive ones.

Letting participants interact too long before presenting the word list

If participants spend 20 minutes completing tasks in a prototype before seeing the word list, their reactions reflect the full task experience (including usability frustrations) rather than the emotional impression of the design. This muddies the data: a word like “frustrating” might be about a broken dropdown, not the visual design. For studies focused on visual and aesthetic reaction, limit interaction to a brief viewing (30-60 seconds for a screenshot) or a short, controlled task (2-3 minutes in a prototype). For studies that intentionally capture the full experience, make this scope explicit in the research question and report.

Not defining brand-target words before the study

Without pre-defined target words, analysis becomes subjective — the researcher looks at the results and retroactively decides which words are “good” and “bad.” This leads to confirmation bias, where the researcher interprets any result as supporting the design. Define 3-5 brand-target words before data collection, share them with stakeholders, and use them as the benchmark for analysis. If the top-selected words do not overlap with the target words, the design has an emotional gap that needs addressing.

Treating word selection as a vote

A beginner might report that “62% of participants chose ‘modern’” and conclude the design is modern, full stop. But in a desirability study, each participant selects 5 words — the data shows relative emphasis, not absolute judgment. The meaningful analysis is comparative: how does “modern” rank against other words? Is it the top word or the fifth? Does it appear alongside compatible words (“clean,” “fresh”) or contradictory ones (“confusing,” “cluttered”)? The pattern of word co-selection tells the story, not any single word’s percentage in isolation.

Running the study with too few participants

With fewer than 15 participants per design variant, word-frequency percentages swing wildly — a single participant’s choice can shift a word’s ranking by 7-10 percentage points. This makes it impossible to distinguish signal from noise. Beginners sometimes run the study with 5-8 participants because they treat it like a qualitative method. A desirability study produces quantitative frequency data that requires sample sizes of 20+ for stable rankings and meaningful comparisons.

AI prompts for this method

4 ready-to-use AI prompts with placeholders — copy-paste and fill in with your context. See all prompts for desirability studies →.