How to run a MaxDiff study: prioritize features with best-worst scaling

What is MaxDiff?

MaxDiff (Maximum Difference Scaling, also known as Best-Worst Scaling) is a quantitative survey method for ranking a list of items by relative importance, preference, or appeal. Instead of asking respondents to rate each item on a 5-point scale — which produces flat data where everything looks “important” — MaxDiff shows small sets of 3 to 5 items and forces the respondent to pick the best and the worst from each set. By repeating this comparison many times across different combinations, the method produces a fully ranked list with numerical scores that reflect how much more important each item is than the next. MaxDiff is the standard tool for feature prioritization, message testing, value-driver discovery, and any situation where the team needs to know what users care about most when forced to choose.

What question does it answer?

Of these features, claims, or value propositions, which ones do users care about most when they cannot pick everything?
How much more important is item A than item B — not just whether one ranks higher, but the size of the gap?
Which features should make it into the MVP, and which can be cut without significant pain?
Which marketing message resonates the most with the target audience and which falls flat?
Do different customer segments prioritize differently, and where are the biggest divergences?
Which pain points have the largest negative impact on customers, and which are background noise?

When to use MaxDiff

When the team needs to prioritize a list of 10–30 features, value drivers, messages, or pain points but cannot test them all in a usability study or A/B test.
When earlier rating-scale surveys produced flat results — every item rated 4 or 5 out of 5 — and the team still cannot tell what matters most.
When stakeholders disagree about feature priority and the decision needs quantitative evidence rather than opinion.
When you need to compare priorities across customer segments and want a single instrument that produces stable, comparable scores per segment.
When pricing or packaging decisions depend on knowing which features deliver the most perceived value.
When the design team needs to choose which 3–5 messages to test in copy or ads from a list of 15 candidates.

Not the right method when the list is shorter than 8 items — a simple rank-order question is faster. MaxDiff also does not explain why users prefer one item over another, so it should be paired with qualitative research to understand the reasons behind the rankings. Finally, MaxDiff scores are relative, not absolute — the method tells you which item is most preferred from the list you tested, but not whether the list itself contains good options.

What you get (deliverables)

Ranked list of all tested items with numerical scores on a -100 to +100 scale.
Visual chart showing each item’s score and the gaps between them.
Segmented score tables: the same ranking calculated separately for each user segment.
Preference share simulation: a model showing what percentage of users would pick each item.
Top-3 and Top-5 reach metrics complementing the average score with reach data.
Written report tying scores to product decisions: which features to build, cut, or postpone.

Participants and duration

Participants: 100 minimum for one segment, 200+ for stable rankings, 100–200 per segment for subgroup comparisons.
Survey length: 10–20 sets per respondent, 3–5 items per set. A respondent completes a MaxDiff block in 5–10 minutes.
Item list size: 8–30 items per study.
Setup: 1–3 days to design the list and configure the tool.
Field time: 1–2 weeks for data collection.
Analysis and reporting: 2–5 days.

How to run a MaxDiff study (step-by-step)

1. Define the decision the study will inform

MaxDiff is expensive to run if the result will not change a decision. Write down what the team will do differently depending on the outcome — “if features A and B rank in the top 5, we will build them in Q2; if they rank in the bottom 10, we will cut them from the roadmap.” This forces the study to be useful and prevents the common trap of running MaxDiff “to see what’s interesting.”

2. Build the item list

Draft 8–30 items to test — features, messages, pain points, value props, or claims. Each item must be mutually exclusive (no overlap), self-contained (understandable without context), and written in similar length and style so respondents are not biased by format. Avoid items that are exact opposites of each other. Pilot the list with 5 internal users and remove anything that confuses them.

3. Configure the survey

Use a tool that supports MaxDiff (Conjointly, Sawtooth Lighthouse Studio, Qualtrics, Displayr, OpinionX, SurveyMonkey, Pollfish). Set the items per set (typically 4) and the number of sets per respondent using the formula r·x / n·p = s, where r is the robustness target of 200 appearances per item, x is the number of items, n is items per set, and p is the expected sample size. Round up. For most studies this lands at 10–20 sets per respondent.

4. Write a clear introduction for respondents

Explain that the same items will appear repeatedly in different combinations and that this is intentional. Without this warning, respondents see an item three times and assume the survey is broken, then either drop out or click randomly. A 2–3 sentence intro before the first set fixes this and protects data quality.

5. Recruit and field

Recruit through your usual channel (panel provider, customer email, in-app intercept, Prolific, UserInterviews). Match the sample to the population the decision concerns — recruiting paying customers when the question is about new-user activation gives misleading results. Field for 1–2 weeks and monitor completion rates daily.

6. Run the analysis

Most modern tools calculate scores automatically, either using the simple aggregate formula (best − worst) / appearances or the more sophisticated Hierarchical Bayes (HB) model. HB produces individual-level scores that allow for segmentation and choice simulation; the simple formula gives population-level averages that are easier to explain. For studies under 200 respondents, the simple formula is usually enough. For 500+ with segmentation, use HB.

7. Read scores three ways

Look at the average score (overall ranking), the top-3 reach (percentage of respondents who put each item in their personal top 3), and the segment differences. The average score alone hides important nuance — an item with a moderate average might be loved by one segment and ignored by everyone else.

8. Compare against the randomness threshold

Calculate the randomness threshold by dividing 100% by the number of items. Items scoring well above this threshold are clear winners; items scoring well below are clear losers; items hovering around it are not statistically distinguishable from chance and should be reported as “no clear preference” rather than ranked.

9. Report and decide

Write a short report tying each finding to the decision the team committed to in step 1. Lead with the items the team should build, cut, or postpone. Include the score chart, segment differences, and any items in the “no clear signal” zone. Stakeholders read the first page only — keep it tight.

How AI changes this method

AI compatibility: partial — AI is fully capable of designing the item list, calculating scores, segmenting respondents, and writing the report. It cannot replace the human respondents whose preferences are the entire point of the method. Synthetic respondents (LLM-generated answers) have repeatedly been shown to fail at predicting real human choice in MaxDiff studies. Use AI to accelerate the workflow around the human data, not to replace the data itself.

What AI can do

Generate the initial item list: An LLM takes a product description and produces 30 candidate features, value props, or pain points to test. The researcher then prunes and edits the list rather than starting from a blank page.
Pilot the item list: A model can read the candidate items and flag overlaps, opposites, format inconsistencies, and ambiguous wording before any human respondent sees the survey.
Calculate scores and run Hierarchical Bayes: Open-source libraries (R choicetools, Python pymc) and commercial tools (Sawtooth, Displayr, Q, Conjointly) automate the math.
Segment analysis at scale: AI can run the same model across dozens of segments and surface only the segments that diverge meaningfully from the average.
Draft the findings report: An LLM given the score table, segment cuts, and decision context can produce a first-draft report grouped into “build,” “cut,” and “investigate further.”

What requires a human researcher

Defining the decision: Choosing which decision the study will inform depends on roadmap context, resource constraints, and stakeholder politics — outside the data.
Real human respondents: Synthetic respondents systematically miss real human trade-offs. Ipsos and others have published evidence that LLM-generated MaxDiff data does not match human data.
Interpreting “why”: MaxDiff scores tell you what users prefer, not why. Pairing the scores with qualitative interviews is human work.
Choosing the right tool and statistical method: Deciding between aggregate scoring and Hierarchical Bayes depends on the study size and segmentation needs.
Defending the result to stakeholders: Translating scores into roadmap decisions requires reading the room and tying the data to business goals.

AI-enhanced workflow

Before AI, a MaxDiff study was a multi-week effort: stakeholder interviews to draft the item list, manual review for overlaps, survey configuration, fielding, statistical analysis (often outsourced to a quant specialist), and report writing. The analyst’s time was spent mostly on assembly, not on insight.

With AI in the loop, the analyst drops a one-page product description into ChatGPT and gets back 30 candidate items in minutes; pastes them back to ask for overlap and ambiguity flags; and after fielding, exports the response data to Hierarchical Bayes via Sawtooth or Displayr. The first-draft findings report is then generated by an LLM, which the analyst edits for tone and verifies for accuracy. The whole workflow can compress from 4 weeks to 5–7 days, freeing the analyst to spend time on interpretation and stakeholder conversations that genuinely move decisions.

The unchanged part is the data itself. Synthetic respondents — asking an LLM to “answer as a 35-year-old SaaS power user” — produce data that looks plausible but does not match real human MaxDiff results. The respondents must be real humans, recruited and incentivized like any other survey. AI accelerates everything around them, but does not replace them.

Tools

Survey platforms with MaxDiff: Sawtooth Lighthouse Studio, Qualtrics CoreXM, Displayr, Q research software, Conjointly, OpinionX (free tier), SurveyMonkey MaxDiff, Pollfish, SurveyKing.

Statistical libraries: R package choicetools by Chris Chapman, R bayesm, Python pymc for custom Bayesian models.

Recruitment: Prolific, UserInterviews, Respondent.io, dscout, traditional panels (Cint, Dynata, Toluna), in-product intercepts (Sprig, Maze).

Analysis and visualization: Sawtooth Lighthouse reports, Displayr dashboards, Tableau or Looker, Excel for the simple aggregate formula.

AI assistance: ChatGPT or Claude for item list generation, overlap detection, results interpretation, and report drafting.

Works well with

Survey (Sv): MaxDiff is a survey method, but the rankings often raise follow-up questions that require open-ended items immediately after the MaxDiff block.
In-depth Interview (Di): MaxDiff tells you what users prefer; interviews tell you why. Running 5–8 interviews after the survey turns scores into actionable insight.
Concept Testing (Ct): When MaxDiff identifies the top 5 features, concept testing builds quick mockups and validates that the predicted preference shows up in real interaction.
Kano Model (Ka): Kano classifies features as basic, performance, or delight; MaxDiff ranks them by relative importance. Together they answer both “is this expected?” and “is this what users want most?”
Persona Building (Ps): Personas describe segments qualitatively; MaxDiff segmented by persona shows quantitatively which priorities differ.

Example from practice

A B2B project management SaaS had a backlog of 47 candidate features and a roadmap budget for 8. The product team had been arguing for two months about which features to build, with each PM championing a different subset. The head of product decided to run a MaxDiff study to settle the question with data.

The team narrowed the list to 28 features through internal review, then ran a MaxDiff survey with 4 features per set, 13 sets per respondent, and 320 customers from the active user base — split across three segments: solo users, team leads, and admins of 10+ user accounts. Hierarchical Bayes scoring revealed that the top 5 features had utility scores between 42 and 68, while the bottom 8 features all clustered near zero with no statistically meaningful difference from random.

The biggest finding was the segment divergence. Solo users overwhelmingly wanted personal task management improvements (score 71); team leads wanted approval workflows (score 64); admins wanted user permission controls (score 58). The top-line ranking blended these into a misleading “everyone wants X” narrative. The team chose to build the top item from each segment (3 features), plus the 2 features that scored in the top 7 across all three segments (a unified focus), and dropped the 8 lowest-scoring features from the roadmap entirely. Six months later, NPS for the three segments rose by 12, 9, and 14 points respectively, confirming that the segmented approach had picked the right work.

AI prompts for this method

4 ready-to-use AI prompts with placeholders — copy-paste and fill in with your context. See all prompts for MaxDiff →.