Skip to content
research-methodsname-testingconcept-testingmaxdiff

How many names can one respondent evaluate: a methodology breakdown

The question comes up regularly in research chats: how many category, flavor, or product names can one respondent realistically evaluate without data quality collapsing? The shared intuition that fewer is better is correct in principle. But without a concrete number behind it, the conversation with a client or colleague quickly drifts into “maybe five or six, I guess.” This article fills the gap. It walks through where the thresholds come from, what methodology vendors and industry practice say, and how to choose the right format for your task.

Consider a typical brief. A brand has six candidate names for a strawberry ice cream — “Strawberry,” “Strawberry Rush,” “Strawberry Fresh,” “Strawberry Burst,” “Forest Tale,” “Berry Kiss” — and wants to know which one resonates best. The options are tightly clustered within one category, some descriptive, some evocative. This is not a hypothetical scenario but a real working brief that most teams encounter at the naming stage.

There is no universal standard

The first thing worth saying plainly: the literature does not provide a single methodological number for “maximum N names per respondent.” The threshold depends on three variables — the test format, the complexity of the stimulus (a bare word versus a word plus product context), and the length of the evaluation scale applied to each option.

What does exist are empirical corridors that appear consistently across sources. Below are the three main formats with concrete numbers and links, ordered by increasing permissible names per respondent.

Format 1. Monadic test: one name, one respondent

Each subgroup of the audience sees exactly one name. The respondent evaluates it on scales (appeal, product fit, memorability, purchase intent) without comparing anything. The final comparison happens between sample cells rather than inside one person’s head.

This design is considered the gold standard specifically for naming, because it most closely reproduces real-world conditions: on the shelf, the consumer sees a package with one name, not a list of six. Northbound, in their guide to designing naming research, argues explicitly for monadic designs in quantitative name validation — their reasoning is that any comparison with neighbors forces analytical thinking, which does not happen in a real brand encounter.

Practical reference points for name count and sample size:

  • Qualtrics Product Naming recommends testing from 3 to 15 names at a typical sample of around 300 completed interviews.
  • Fastuna, an agile online name-testing platform, runs as a monadic test with a maximum of 10 names — each respondent evaluates only one.

The limitation of the format is straightforward: cost. Six names at 100–150 respondents per cell works out to 600–900 interviews. For a pilot-testing budget, that volume is often unrealistic.

Format 2. Sequential monadic test: several names in sequence

The respondent sees several names one after another in randomized order and evaluates each on the same scale. This is a compromise between the purity of monadic testing and sample size economy.

This is where the “how many is too many” question becomes substantive. Sources give different numbers depending on what counts as a stimulus.

For full concepts (name plus description plus illustration plus pricing), conservative recommendations are stricter. The UserIntuition guide to concept testing explicitly limits sequential designs to three or four concepts per respondent, noting that “evaluation quality degrades after the third concept and fatigue becomes a meaningful factor.”

For bare names — where the stimulus is just a word or phrase plus one line of product context — substantially more is allowed. SurveyMonkey gives an operational formula: the number of concepts multiplied by the number of metrics should be under 30 questions. With five metrics per name (appeal, category fit, uniqueness, memorability, purchase intent), the ceiling works out to six names. With three metrics, up to ten.

Zappi warns about the exposure effect: in a sequential test, an average option rates higher against two weak ones and lower against two strong ones. The effect intensifies with more stimuli, which is why the vendor recommends sequential testing as a screening tool for early-stage ideas rather than a replacement for monadic testing in final validation.

The practical ceiling for names: six to eight per respondent, with mandatory order rotation and the inclusion of a “warm-up” dummy name. The warm-up stimulus is presented first and its scores are excluded from analysis — it absorbs the score inflation characteristic of the first position. The UserIntuition guide describes this technique as the standard primacy-effect correction.

Format 3. MaxDiff: names in paired duels

MaxDiff, or best-worst scaling, solves a different problem. Respondents are shown a subset of four or five names and asked to pick the best and worst. This repeats eight to fifteen times across different combinations, so each name appears in multiple sets. A statistical model (typically hierarchical Bayes) then reconstructs relative preferences and assigns each name a score on a unified 0–100 scale.

This approach was proposed by Finn and Louviere in 1992 and has become an industry standard for tasks where many close variants need comparison without exhausting the respondent. Sawtooth Software, which effectively sets standards in the discrete-choice modeling field, describes MaxDiff as a tool for evaluating lists of 15–40 items, scaling to hundreds in advanced designs.

The key empirical work is Chrzan’s 2006 study, published by Sawtooth, which compared three, four, five, and seven items per screen with real respondents. The conclusion is clear: four to five items work best, seven drives up dropout rates and extends task length nonlinearly. The resulting time formula: survey length in seconds equals 9.4 × number of questions plus 17.5 × number of items per question.

Worth mentioning separately is a commercial variant — Ipsos Duel, a specialized product for name testing. It works neither as a monadic test nor as classic MaxDiff, but as a series of paired duels with response-time measurement. Direct stated preference is combined with response time, which serves as a proxy for unconscious engagement — the layer that Likert scales cannot see.

For six strawberry-ice-cream names, MaxDiff is methodologically the strongest choice. It forces the respondent into a real trade-off, which rating scales do not: with similar variants, Likert scales compress, and all six end up at roughly 4 out of 5. The SurveyLab overview of MaxDiff puts it directly: “MaxDiff handles long lists of 20–30 items better than traditional methods, which tend to run into respondent fatigue.”

The specific problem of close variants in one category

A set like “Strawberry” — “Strawberry Fresh” — “Strawberry Burst” is not a group of competing brands on different shelves but a cluster of adjacent positions on one branch of an FMCG matrix. Their semantic proximity creates a specific risk that general name-testing guides rarely discuss.

In a sequential monadic format with a five-point scale, six close variants collapse together. The respondent cannot discriminate finely between “pleasant” and “slightly more pleasant,” especially when both stimuli are about strawberry ice cream. Five out of six end up with a Top-2 Box score around 60 percent, and that says nothing about real ranking.

An additional risk in this set: evocative names (“Forest Tale,” “Berry Kiss”) sit in a different cognitive slot when placed next to descriptive ones, and they pick up either strongly inflated or strongly deflated ratings. The contrast effect in sequential tests grows with stimulus heterogeneity.

For sets like this, monadic testing or MaxDiff is the methodologically correct choice. If the budget cannot sustain a monadic design across six cells, MaxDiff with 200–300 respondents solves the task with more sensitivity than an economy-sized sequential test.

A systemic caveat for any quantitative name test

There is a deeper limitation worth keeping in mind when designing any name test.

Kahneman describes two systems of thought: System 1 — fast, intuitive, emotional — and System 2 — slow and analytical. In a real brand encounter, the consumer operates in System 1. The name either lands or it does not, in a fraction of a second. A survey form with scales and ratings forces System 2 engagement — and in that system, the most descriptive name almost always wins, because it is easiest to rationalize analytically.

A quote from Northbound: “Our System 2 loves descriptive names that say exactly what something is. Yet, our System 1 — the one judging all the high-profile names in the real world — often prefers something else.” The implication: if your set were tested on a full scale, “Strawberry Fresh” or “Strawberry Burst” would almost certainly beat “Forest Tale” — not because they are stronger, but because they are easier to legitimize in a survey form.

A partial counterweight to this systemic bias is combining standard scales with response-time measurement as a System 1 proxy and open-ended association questions. That combination is what Ipsos Duel uses, along with the NameStormers methodology, where rating scales are supplemented with reaction-time measurement.

Concrete recommendation for six names in one category

Three options, from methodologically strongest to most economical:

  1. Monadic test across six cells. Each cell — 100–150 respondents, 600–900 interviews total. Seven metrics per name: appeal, category fit, flavor fit, memorability, uniqueness, purchase intent, plus one open associative question. Eliminates the comparison effect, delivers absolute scores that can be benchmarked against category norms.
  2. MaxDiff across all six names. Four names per screen, ten to twelve rounds, 200–300 respondents. Returns a percentage-scale measure of relative appeal with far better sensitivity between close variants like “Strawberry Rush” and “Strawberry Burst” than rating scales provide. This is the recommended primary instrument for your specific case.
  3. Sequential monadic test. All six names plus one warm-up, order rotation, metrics reduced to three or four. 200–300 respondents. The format ceiling is exactly six names; anything more is a reasonable risk. Keep the exposure effect in mind: the results will tell you which of the six is relatively stronger, but will not deliver absolute appeal levels comparable to norms.

In all three options, the name must be shown not in isolation but together with a short product description and, ideally, a package mock-up. Lab42’s name-testing guide is direct: “An important part of a name test is to provide respondents with context — a short description of the product or service that the name is meant to represent.” Without it, you are testing a reaction to a word in a vacuum, not to a brand.

Related material on this site: the MaxDiff guide and the concept testing guide — with more detail on samples, use cases, and the limits of each method.