Skip to content

How to run a concept test: a practical guide with AI prompts

A fintech startup was developing a personal finance app aimed at freelancers who struggle to separate business and personal expenses. The product team had three competing concepts: (A) an AI-powered app that automatically categorizes every transaction as business or personal, (B) a dual-wallet system where freelancers manually route each payment to a “business” or “personal” wallet at the moment of purchase, and (C) a monthly reconciliation tool that shows all transactions at month-end and lets freelancers sort them in one sitting.

The team ran concept tests with 12 freelancers, presenting each concept as a one-page description with a simple wireframe mockup. Concept A (auto-categorization) scored highest on appeal (4.2/5) but lowest on trust — participants worried about misclassified expenses causing tax problems. Concept B (dual wallet) scored highest on comprehension but participants described it as “more work than I already do.” Concept C (monthly reconciliation) had moderate appeal but participants lit up during probing: “This is exactly what I do in a spreadsheet every month, but it takes me three hours.”

The team proceeded with a hybrid: automatic categorization (Concept A’s appeal) with a monthly review screen (Concept C’s workflow match) where freelancers verify and correct the AI’s work before finalizing. The trust concern from Concept A was addressed by framing the AI as a “first draft” rather than a final answer. The app launched with this hybrid approach and achieved a 68% Day-30 retention rate, significantly above the 40% industry benchmark for personal finance apps.

That outcome is what concept testing produces: a shift from “we think users will want this” to “we tested three directions, found what resonated, and combined the best elements before writing a single line of production code.”

What concept testing actually is

Concept testing is a research method in which early-stage ideas, designs, or product concepts are presented to target users to evaluate whether the concept is understood, valued, and worth pursuing before the team invests in building it. The method captures reactions to what a product could be — not how it works — producing go/no-go signals and directional feedback that shapes the concept before development begins.

What questions it answers

Concept testing addresses questions about desirability and comprehension:

  • Does our target audience understand what this concept is and what it does, based on how we have described it?
  • Does the concept address a real problem that users care about, or are we solving something that does not matter to them?
  • Which of several competing concepts resonates most with the target audience, and what drives that preference?
  • What concerns, objections, or confusion does the concept trigger, and what would need to change to make users willing to try it?
  • Is the perceived value of the concept high enough that users would pay for it, switch from their current solution, or change their behavior?
  • At what stage should we stop pursuing this concept because user reactions consistently signal low interest or fundamental misunderstanding?

When to use

  • When the team has one or more early-stage product concepts (descriptions, sketches, mockups, or low-fidelity prototypes) and needs to decide which to develop further based on user reactions rather than internal opinions.
  • When a concept is moving from discovery into design and the team needs confirmation that the core value proposition resonates with target users before committing development resources.
  • When stakeholders disagree about the direction of a product and the team needs user data to make the decision — concept testing provides evidence that is harder to dismiss than internal debate.
  • When entering a new market or launching a product for a new audience segment and the team cannot rely on assumptions about what that audience values.
  • When the cost of building is high enough that validating the concept first is worth the research investment — enterprise products, hardware, regulated industries.
  • When rebranding, repositioning, or redesigning an existing product and the team needs to verify that the new concept communicates the intended message.

Not the right method when a working prototype or product already exists and the question is about usability (whether users can complete tasks). Concept testing evaluates desirability and comprehension — whether users want this thing. Usability testing evaluates functionality — whether users can use this thing. Running a concept test on a finished product wastes the method’s strength, which is catching flawed ideas before they become expensive to fix. Also not appropriate when the concept is too abstract to be represented in any tangible form — if users cannot see, read, or interact with something concrete, their reactions will be speculative and unreliable.

What you get (deliverables)

  • Concept viability score: a quantitative measure (often a Likert scale or purchase-intent scale) indicating how strongly participants responded to each concept.
  • Comprehension check: whether participants correctly understood what the concept does and who it is for, based on their own descriptions (not leading questions).
  • Preference ranking: when testing multiple concepts, a ranked order with rationale for why participants preferred one over others.
  • Qualitative feedback inventory: participant quotes organized by theme — what excited them, what confused them, what concerned them, and what they would change.
  • Kill/pivot/proceed recommendation: a research-backed decision about whether to proceed with the concept as-is, iterate on specific elements, or abandon it entirely.
  • Concept refinement brief: a list of specific changes participants suggested or that the data implies, prioritized by frequency and impact.

Participants and duration

Participants: 8-15 per concept for qualitative concept tests (interviews or moderated sessions). 30-100+ for quantitative concept tests (surveys with rating scales). When comparing multiple concepts, each participant evaluates all concepts (within-subjects) or one concept each (between-subjects, requires more participants per concept).

Session length: 30-45 minutes for moderated sessions (20 minutes of exposure and reaction, 10-15 minutes of probing and follow-up). 5-10 minutes for unmoderated survey-based tests.

Materials preparation: 1-3 days to create concept stimuli (descriptions, mockups, or prototypes) depending on fidelity level.

Total timeline: 1-2 weeks (stimulus preparation: 1-3 days; recruitment: 2-3 days; data collection: 2-5 days; analysis and report: 2-3 days).

How to run a concept test (step-by-step)

1. Define what you are testing and what a “pass” looks like

Clarify the research question before preparing any materials. Are you testing whether users understand the concept, whether they find it desirable, whether they prefer it over alternatives, or all three? Define success criteria upfront: “The concept passes if 70% of participants correctly identify the value proposition and 60% express purchase intent.” Without predefined criteria, the team will rationalize any result as positive.

2. Create the concept stimuli

Translate the concept into something participants can react to. This ranges from a written concept statement (one paragraph describing the product, its benefit, and who it is for) to visual mockups, storyboards, landing page prototypes, or video walkthroughs. Match the stimulus fidelity to the decision being made: a concept statement is enough to test whether the idea resonates; a mockup is needed to test whether the design direction works; a clickable prototype tests whether the interaction model makes sense. Avoid over-investing in stimulus fidelity — a polished prototype creates commitment bias and makes the team reluctant to abandon the concept even if user reactions are negative.

3. Write a discussion guide or survey

For moderated tests, write a guide that starts with context-setting questions (what the participant currently does in this domain), moves to concept exposure (showing the stimulus without explaining it), and then probes comprehension (“What do you think this is?”), desirability (“Would you use this?”), and concerns (“What would hold you back?”). Avoid leading questions that signal the expected answer. For unmoderated tests, design a survey that presents the concept, asks comprehension questions, collects ratings on desirability and intent scales, and includes open-ended questions for qualitative depth.

4. Recruit participants from the target audience

Recruit people who match the concept’s intended audience — not the general public, not colleagues, not users of a competing product unless that is the target segment. If the concept targets small business owners who currently manage invoices manually, recruit small business owners who manage invoices manually. Participants outside the target audience will react to the concept differently because the problem it solves is not their problem, producing misleading data.

5. Expose participants to the concept and capture reactions

Present the concept stimulus and let participants absorb it before asking questions. In moderated sessions, watch their face and body language during initial exposure — confusion, excitement, or indifference are visible before participants speak. Ask comprehension questions first (“In your own words, what does this product do?”) before desirability questions (“Would you use this?”). This order ensures you know whether a negative reaction is because the concept is bad or because the participant misunderstood it.

6. Probe for depth: why, not just what

Surface-level reactions (“I like it” or “I don’t like it”) are not useful without reasoning. Probe every reaction: “What specifically about it appeals to you?” “What would you use it for?” “What is missing?” “How does this compare to what you do today?” “What would make you not want to use this?” The difference between a concept that fails and one that needs refinement is often found in the follow-up probes, not in the initial reaction.

7. Compare concepts (if testing multiple)

When testing multiple concepts, control for order effects. Rotate the presentation order so each concept appears first an equal number of times. After participants have seen all concepts, ask for a forced ranking and the reasoning behind it. Track not just which concept wins but why — the winning concept’s advantage may be one specific feature that could be integrated into a different concept’s stronger overall framework.

8. Analyze patterns across participants

Look for patterns, not individual opinions. A single participant’s enthusiasm does not validate a concept; consistent patterns across 8-15 participants do. Organize findings by theme: comprehension (did they get it?), desirability (did they want it?), concerns (what worried them?), and comparisons (how does it compare to their current solution?). Calculate quantitative scores if using rating scales. Flag concepts where comprehension was low but desirability was high among those who understood — these concepts have a messaging problem, not a value problem.

9. Make a decision and document the rationale

Apply the success criteria defined in Step 1. If the concept meets the threshold, proceed to the next phase (design or prototyping). If it falls short, decide whether to iterate on specific elements (messaging, positioning, feature set) or abandon the concept. Document the decision and the data that supports it — this prevents the team from revisiting the same concept later without new evidence. A concept that fails in testing does not become viable because a stakeholder believes in it.

How AI changes this method

AI compatibility: partial — AI can generate concept stimuli (descriptions, landing page copy, visual mockups), analyze survey responses at scale, and synthesize qualitative feedback across participants. However, AI cannot replace the moderated session where a researcher watches a participant’s face as they first encounter the concept and probes the reasoning behind a hesitant “I guess I’d try it.” The nuances of desirability — the difference between polite interest and genuine excitement — require human observation.

What AI can do

  • Concept statement generation: An LLM can produce multiple variations of a concept description — different framings, benefit emphases, and audience angles — giving the team a range of stimuli to test rather than relying on a single internally written description.
  • Survey response analysis: For quantitative concept tests with open-ended questions, AI can code hundreds of text responses into themes, calculate sentiment, and identify the most common praise and criticism patterns in minutes rather than days.
  • Comparative analysis across concepts: When testing multiple concepts, AI can generate a structured comparison matrix showing each concept’s performance on comprehension, desirability, and concern dimensions, highlighting where concepts diverge.
  • Discussion guide drafting: An LLM can produce a first-draft discussion guide tailored to the concept, including comprehension checks, desirability probes, and comparison prompts, which the researcher refines based on the specific research context.
  • Stimulus creation: AI image generators (Midjourney, DALL-E) and prototyping tools (Figma AI) can produce visual mockups quickly, enabling the team to test visual concepts that would have required days of design work.

What requires a human researcher

  • Reading genuine reactions: In moderated sessions, the most valuable data comes from micro-expressions, pauses, and tone of voice during initial concept exposure. A participant who says “yeah, that’s interesting” while leaning back with arms crossed is communicating something different from one who says the same words while leaning forward. No AI can observe or interpret this.
  • Probing beneath the surface: When a participant says “I probably wouldn’t use this,” the researcher’s follow-up question — and the judgment about which thread to pull — determines whether the team gets actionable insight or a dead-end data point. This requires real-time empathy and domain knowledge.
  • Avoiding confirmation bias: Teams often want their concept to succeed and will unconsciously design stimuli, questions, or analysis to favor positive results. A trained researcher serves as a check on this bias, designing neutral stimuli, asking non-leading questions, and reporting results honestly even when they are unwelcome.
  • Making the go/no-go call: The decision to proceed, pivot, or kill a concept depends on factors beyond the data: market timing, competitive pressure, organizational capacity, and strategic fit. A researcher presents the evidence; a human team makes the decision.

AI-enhanced workflow

The biggest acceleration comes in stimulus preparation. Traditionally, creating concept stimuli — whether written descriptions, visual mockups, or landing page prototypes — requires collaboration between researchers, product managers, and designers, often consuming 3-5 days. With AI generating concept description variants and producing rough visual mockups, the team can prepare stimuli for multiple concepts in a single day and spend the saved time on more thorough testing with more participants.

Analysis speed also improves significantly for quantitative concept tests. A survey-based test with 100 participants and three open-ended questions produces 300 text responses that a researcher traditionally reads, codes, and synthesizes over 2-3 days. An LLM can produce a coded, themed summary in under an hour, which the researcher reviews and corrects — typically a half-day task. This means the team gets results faster, which matters because concept testing often sits on the critical path between discovery and design.

Where AI cannot substitute is the moderated session. The 30-minute conversation where a researcher watches someone encounter the concept for the first time, notices their confusion or excitement, and asks the right follow-up questions is the core of qualitative concept testing. No amount of survey data or AI analysis replaces the depth of understanding that comes from watching eight people react to the same concept and noticing what makes the ninth person different.

Works well with

  • In-depth Interview (Di): Concept testing sessions are structured interviews focused on a specific stimulus. The interviewing skills transfer directly, and insights from earlier in-depth interviews about user needs inform what concepts to test.
  • Card Sorting (Cs): After card sorting produces a category structure, concept testing can validate whether users understand category labels and what they expect to find inside each category.
  • Participatory Design (Pd): Concepts generated through participatory design workshops need validation with a broader audience. Concept testing checks whether ideas that resonated in a workshop also resonate with users who were not in the room.
  • Journey Mapping (Jm): Journey maps identify pain points and opportunities; concept testing validates whether proposed solutions for those pain points actually resonate with users before the team builds anything.
  • JTBD Switch Interview (Js): JTBD interviews reveal the forces driving users to seek new solutions. Concept testing checks whether the proposed concept activates those same forces — does it match what would make someone switch?

Example from practice

A fintech startup was developing a personal finance app aimed at freelancers who struggle to separate business and personal expenses. The product team had three competing concepts: (A) an AI-powered app that automatically categorizes every transaction as business or personal, (B) a dual-wallet system where freelancers manually route each payment to a “business” or “personal” wallet at the moment of purchase, and (C) a monthly reconciliation tool that shows all transactions at month-end and lets freelancers sort them in one sitting.

The team ran concept tests with 12 freelancers, presenting each concept as a one-page description with a simple wireframe mockup. Concept A (auto-categorization) scored highest on appeal (4.2/5) but lowest on trust — participants worried about misclassified expenses causing tax problems. Concept B (dual wallet) scored highest on comprehension but participants described it as “more work than I already do.” Concept C (monthly reconciliation) had moderate appeal but participants lit up during probing: “This is exactly what I do in a spreadsheet every month, but it takes me three hours.”

The team proceeded with a hybrid: automatic categorization (Concept A’s appeal) with a monthly review screen (Concept C’s workflow match) where freelancers verify and correct the AI’s work before finalizing. The trust concern from Concept A was addressed by framing the AI as a “first draft” rather than a final answer. The app launched with this hybrid approach and achieved a 68% Day-30 retention rate, significantly above the 40% industry benchmark for personal finance apps.

Beginner mistakes

Testing the concept description rather than the concept

If participants do not understand the concept, the test result may reflect bad communication rather than a bad idea. Beginners often write concept descriptions using internal jargon, product names that mean nothing to users, or abstract benefit statements (“streamline your workflow”) instead of concrete ones (“spend 2 hours less on invoices each month”). When a concept fails, check comprehension data first — if participants misunderstood what the concept does, the problem may be the stimulus, not the concept.

Asking leading questions

Questions like “Don’t you think this would save you time?” or “Wouldn’t it be great to have something like this?” signal the expected answer. Participants will agree to be polite or because the question frames the concept positively. Ask neutral questions: “What is your first reaction?” “Would you use this? Tell me more.” “What would hold you back?” The goal is to hear what participants actually think, not to confirm what the team hopes they think.

Over-investing in stimulus fidelity

Building a polished, high-fidelity prototype for a concept test creates two problems: it costs too much time and money before validation, and it makes the team psychologically committed to the concept. If the team spent two weeks building a beautiful prototype, they are less likely to act on negative feedback because abandoning it feels like wasting the investment. Use the lowest fidelity stimulus that can communicate the concept clearly — often a written description and a simple sketch are enough.

Testing with the wrong participants

A concept test with friends, colleagues, or people outside the target audience produces unreliable data. Friends will be too positive to avoid hurting feelings. Colleagues already understand the domain and the internal reasoning behind the concept. General-public participants may not have the problem the concept solves and will react based on abstract preference rather than genuine need. Recruit people who match the concept’s intended audience and who currently experience the problem it addresses.

Ignoring negative feedback because some participants loved it

In a concept test with 12 participants, 3 enthusiastic responses and 9 lukewarm or negative responses is a failing result, not a sign that the concept appeals to a niche. Beginners sometimes anchor on the positive outliers and dismiss the majority as “not the right audience.” If the concept consistently fails with the recruited target audience, the problem is the concept — not the audience. Look for patterns in the negative feedback to understand what needs to change.

AI prompts for this method

4 ready-to-use AI prompts with placeholders — copy-paste and fill in with your context. See all prompts for concept testing →.