How to conduct content analysis: a practical guide with AI prompts

What is content analysis?

Content analysis is a systematic method for organizing and interpreting textual or visual material — interview transcripts, open-ended survey responses, support tickets, app store reviews, social media posts, policy documents, marketing copy — by breaking each source into small units of meaning and tagging those units against a structured codebook. The method works through three core decisions: what counts as one unit of data (a word, a sentence, a paragraph, a post), what categories the codebook contains, and whether categories come from theory before coding (deductive) or emerge from the data during coding (inductive). Content analysis produces both qualitative description (“here is what users actually say about X”) and quantitative summary (“category Y appears in 34% of tickets”) from the same dataset, which is why product teams reach for it whenever they need to turn a pile of unstructured text into a defensible answer about what their users are talking about.

What question does it answer?

What topics, problems, or feature requests are users actually raising in their own words across our support tickets, reviews, or survey comments?
Which categories of feedback appear most often, and how does that distribution shift between releases, segments, or time windows?
How do users describe a specific concept (a feature, a brand, a competitor, a pain point), and which words and framings should we adopt in our own copy?
Which categories of issues or sentiments dominate one channel (App Store) versus another (support email, Reddit, NPS comments)?
How does the proportion of positive, negative, and neutral mentions of a feature change over time as we ship updates?
Which evidence in a body of qualitative material supports or contradicts a specific hypothesis or framework the team already holds?

When to use content analysis

After collecting a large body of open-ended responses (200–10,000+) where reading every entry by hand is impractical and the team needs both a structured summary and the ability to drill back into raw quotes.
When the team needs to prioritize a backlog by frequency — which complaints appear most often in support tickets, which feature requests recur in feedback forms — so that engineering effort tracks user voice rather than internal opinion.
When tracking a metric over time on top of qualitative data — for example, measuring how the share of “performance complaints” in app reviews shifts month over month after a release.
When building or validating a taxonomy of user problems, feature requests, or jobs-to-be-done from real evidence rather than from the team’s mental model.
When comparing how two groups talk about the same thing — competitor users vs. own users, free vs. paid, segment A vs. segment B — and the team needs structured categories to make the comparison defensible.
When auditing internal documents (interview notes, sales call transcripts, complaint logs) for evidence of recurring themes that no one has formally analyzed yet.

Not the right method when the team needs to understand the deeper why behind user behavior — content analysis can tell you that 34% of users mention “navigation problems,” but it stops at the count and the surface label. Pair it with thematic analysis or interview synthesis when the question is about meaning, motivation, or emotional context. It also struggles with sarcasm, cultural nuance, and statements whose meaning depends on a larger conversation, because the unit-by-unit logic strips the surrounding context. Finally, content analysis is overkill for very small datasets (under 30–50 units) where simple reading and note-taking will produce the same insight in a fraction of the time.

What you get (deliverables)

Codebook: a structured document listing each category, its definition, inclusion and exclusion criteria, and 2–3 example excerpts per category, used by every coder on the project.
Coded dataset: every unit of analysis tagged against the codebook, exportable as a spreadsheet, CSV, or report from a dedicated tool.
Frequency table: counts and percentages per category, often broken down by segment, channel, time window, or product version.
Category summaries: a paragraph per category describing what the code captures, the most common subpatterns within it, and 3–5 illustrative quotes from the data.
Cross-tabulations: matrices showing how categories overlap or how they distribute across groups (segment × category, channel × sentiment, version × issue type).
Reliability report: when multiple coders are involved, a measure of intercoder agreement (Cohen’s Kappa, Krippendorff’s alpha, or negotiated agreement notes) showing how consistently the codebook was applied.
Insight brief: 3–8 pages presenting the top categories, the patterns the analyst noticed, the surprises, and concrete recommendations tied to the research question.

Participants and duration

Participants: none directly — content analysis is applied to material that already exists. The “subjects” are the documents, posts, transcripts, or reviews being coded.
Dataset size: 100–500 units for an exploratory inductive study, 500–5,000 for a typical product feedback analysis, 10,000+ for large-scale review or social media studies.
Coders: 1 analyst is enough for an exploratory pass; 2–3 coders are recommended for any project where reliability matters or the findings will support a high-stakes decision.
Setup time: 0.5–2 days to define the research question, build the first draft of the codebook, and pilot it on a small sample.
Coding time: strongly dependent on dataset size and tool — manual coding runs at 50–150 units per hour for an experienced analyst; AI-assisted coding can compress a 5-day pass to a few hours, but adds a verification step.
Synthesis and writing: 1–3 days to summarize categories, build cross-tabulations, write the insight brief, and prepare illustrative quotes.

How to conduct content analysis

1. Define the research question and pick the analytic logic

Write down the specific question content analysis should answer, in concrete terms — “Which categories of complaints appear in our App Store reviews from the last 90 days, and how has their distribution shifted since the v4.2 release?” rather than “Look at our reviews.” Then decide between three logics: inductive (let categories emerge from the data, best when the topic is unfamiliar or exploratory), deductive (apply a predefined framework from theory or prior research, best when testing a hypothesis or measuring against a known taxonomy), or summative (start with specific terms or themes and interpret their frequency in context). The choice shapes the codebook, the timeline, and how you defend the findings later.

2. Choose and prepare the dataset

Pull every relevant source into one place — export support tickets, scrape App Store reviews, gather transcripts, copy survey comments. Clean the data: remove duplicates, anonymize names and identifiers, fix obvious encoding errors. Decide what falls inside the scope (date range, channel, language, product area) and what is excluded, and document those decisions because they will be challenged later. Read 20–30 randomly sampled items end to end without coding, just to feel the tone, vocabulary, and format of the material.

3. Define the unit of analysis

Decide what counts as one codable item: a single word, a sentence, a paragraph, a full post, or a whole transcript. The unit should be small enough to hold one main idea but large enough to carry context. App Store reviews are usually one unit per review; long support tickets may be one unit per paragraph; interview transcripts are usually one unit per turn or speaker block. State the choice explicitly and apply it consistently — switching unit size mid-project breaks every count and comparison the analysis produces.

4. Build the first draft of the codebook

For inductive work, open-code 10–20% of the dataset by hand: read each unit and write a short label that captures its main idea. After the first pass, group similar labels into broader categories, give each category a name and a one-sentence definition, and write down what does and does not belong in it. For deductive work, take the categories from the framework or prior literature and write the same definitions. Aim for mutual exclusivity at the lowest level — a single unit should fit one category cleanly, with overlap allowed only where the analyst can justify it. The codebook is the contract between the analyst and the data; vague definitions are the single biggest source of bad findings.

5. Pilot test and refine the codebook

Apply the draft codebook to a fresh 5–10% slice of the data. If the project has more than one coder, each coder should code the same slice independently, then compare every disagreement. Where coders disagree, the codebook is unclear — sharpen the definition, add an exclusion rule, split a code that hides too many things, or merge two codes that turned out to overlap. Repeat the pilot until the coders agree on most decisions and the remaining disagreements feel like edge cases rather than confusion. Skip this step and the full coding pass produces noise instead of patterns.

6. Code the full dataset

Work through every unit systematically, in the order the data was collected or in random order if order might bias the coder. Apply the codebook strictly. Keep a running memo file open: any unit that does not fit, any new pattern that suggests a missing category, any quote that captures a category especially well — write it down rather than trying to remember. If the codebook starts changing during the full pass, stop and re-pilot the new version on a small slice; do not retroactively recode without tracking what changed and when. Finish coding when every unit has at least one code (or a documented “out of scope” tag) and the memo file shows no unresolved patterns.

7. Check intercoder reliability or do a self-audit

For team projects, calculate intercoder agreement on a double-coded sample (10–20% of the dataset is typical). Cohen’s Kappa above 0.7 is the usual floor for “acceptable”; below that, the codebook needs more work or the coders need more training. For solo projects, do a self-audit: re-code a random 5% sample a week after the first pass, and check whether you assign the same codes the second time. Document the reliability score in the brief — readers and peer reviewers will ask for it.

8. Summarize categories and build cross-tabulations

Export the coded dataset into a spreadsheet or analytics tool. For each category, count how many units it appears in, what percentage of the total that represents, and how that count breaks down across the segments that matter (channel, version, segment, sentiment). Build cross-tabulations where two dimensions intersect — category × segment, category × time, sentiment × feature. Look for the strongest patterns, but also for surprises: a category that is much rarer or more common than expected often points to something the team did not know.

9. Pull illustrative quotes and write category summaries

For each major category, write a one-paragraph summary that defines the code, names the most common subpatterns inside it, and reports the count and share. Pull 3–5 direct quotes per category that capture the range — the most representative, the most extreme, the most surprising. Quotes are what make content analysis findings stick in stakeholder memory; counts alone do not.

10. Write the insight brief and present the findings

Produce a 3–8 page brief that opens with the research question, summarizes the top findings in the first page, and then walks through each major category with its count, summary, and quotes. Add cross-tabulations as charts where they help, and end with concrete recommendations tied back to the question. Reserve a section for the limitations of the method on this dataset — sample bias, language coverage, time window — so reviewers know where the findings stop generalizing. Present to stakeholders with the brief in hand, not a deck of bullets.

How AI changes content analysis

AI compatibility: partial — AI dramatically accelerates the mechanical parts of content analysis (initial codebook drafting, bulk coding, frequency counts, summary generation) while leaving the high-judgment parts in human hands (defining the research question, validating coding decisions, interpreting context, catching the categories that matter even when they are rare). The realistic split is about 70% of the time saved on coding and counting, with the remaining human time concentrated on framing, verification, and interpretation.

What AI can do

Draft a starting codebook from a sample: Given 50–100 sample units, an LLM can suggest 8–15 candidate categories with definitions and example excerpts in a few minutes — work that would take a human analyst a half day. The analyst then revises, merges, splits, and renames before piloting.
Apply the codebook at scale: Tools like ATLAS.ti’s Intentional AI Coding, Dovetail’s AI categorization, NVivo’s AI Assistant, and Insight7 can apply a finished codebook to thousands of units in minutes instead of days. The analyst spot-checks a sample to verify consistency.
Run sentiment, named entity, and topic analysis: Off-the-shelf NLP turns raw text into sentiment scores, brand mentions, feature mentions, and topic clusters — useful as inputs to a deeper coding pass or as a quick first cut before a real analysis.
Cluster similar units: Embedding-based clustering (in Atlas.ti, Dovetail, Looppanel, and most modern feedback analytics platforms) groups semantically similar quotes automatically, surfacing candidate categories the analyst might miss.
Generate category summaries and pull representative quotes: Once coding is done, an LLM can read all units in a category and write a one-paragraph summary with three illustrative quotes — cutting the synthesis time per category from 30 minutes to 5.
Translate multilingual datasets: First-draft translation of foreign-language reviews and posts is good enough for coding, with a native speaker validating edge cases. This unlocks multi-market analyses that used to require local researchers.

What requires a human researcher

Defining the research question and the unit of analysis: AI can suggest categories endlessly, but it cannot decide which question is actually worth answering or which unit size will produce comparable counts. Both choices are upstream and shape every later step.
Validating coded decisions on edge cases: AI is reliably wrong about sarcasm, mixed sentiment, and statements whose meaning depends on context outside the unit. The analyst has to spot-check the AI’s assignments and correct the systematic errors.
Interpreting the patterns against business context: Knowing that “billing complaints” jumped 18% after a release is data; knowing that the change matches a pricing experiment in one segment is insight. The bridge between the count and the action requires a human who knows the product and the business.
Catching the rare-but-important category: AI tools optimize for frequency and similarity, which means a category that appears in 2% of the data but represents a critical safety, legal, or churn signal is the one most likely to be missed by the auto-coder. Human analysts notice these because they are paying attention to consequence, not count.

AI-enhanced workflow

Before AI, a content analysis project on a 5,000-review dataset took an experienced analyst 2–3 weeks: draft the codebook, pilot on a sample, refine, code every review by hand, calculate reliability, build the summary, write the brief. The analyst spent the majority of the time on mechanical coding and almost none on synthesis, which is the part that actually moves decisions.

With AI in the workflow, the same project compresses to 3–5 days. The analyst spends a half-day framing the question and pulling the dataset, then feeds 100 sample units to a model and gets a 12-category draft codebook back in minutes. They revise the codebook by hand, pilot it on 200 units (still by hand, because the pilot is where the rules get sharpened), then hand the full dataset to a tool like ATLAS.ti, Dovetail, or a custom GPT that applies the codes at scale. The analyst spot-checks 5–10% of the auto-coded units, finds the systematic errors (sarcasm, mixed messages, context-dependent meanings), and either corrects them by hand or rewrites the codebook definitions and re-runs. Once coding is locked, the model generates first-draft category summaries and pulls quotes; the analyst rewrites them in the team’s voice. The brief itself stays human work because stakeholders read voice, not text.

The catch is the same as for every AI-assisted research workflow: the time saved depends entirely on the analyst running a real verification pass. Skipping the spot-check produces a polished brief built on quietly miscoded data, which is worse than no analysis at all because it borrows the credibility of structure without the substance. The discipline that used to go into manual coding now goes into auditing the auto-coder, and analysts who treat the AI output as final ship findings that fall apart under the first stakeholder question.

Tools

General-purpose qualitative analysis (CAQDAS): ATLAS.ti, NVivo, MAXQDA, Dedoose, QualCoder, Quirkos, Taguette — all support manual coding, hierarchical codebooks, intercoder reliability statistics, and exportable reports. Modern versions of ATLAS.ti, NVivo, and MAXQDA include AI-assisted coding modules.

AI-first feedback analytics platforms: Dovetail, Insight7, Looppanel, Marvin, EnjoyHQ, Condens, Reduct.video — specialized for product teams analyzing user feedback, with AI tagging, sentiment analysis, and shareable repositories.

Survey and review-specific tools: Thematic, Chattermill, Wonderflow, MonkeyLearn, Kapiche — built for analyzing large volumes of open-ended survey responses, support tickets, and review-site comments at scale.

Lightweight and free options: Taguette (open source), QualCoder (open source), Google Sheets with manual coding, Notion or Obsidian databases for small projects.

LLM-based coding (custom workflows): Claude, ChatGPT, Gemini for codebook drafting, batch coding, and category summaries via custom prompts; Python notebooks with the OpenAI or Anthropic API for repeatable pipelines on larger datasets.

Reliability and statistics: ReCal (online intercoder reliability calculator), built-in modules in ATLAS.ti / NVivo / MAXQDA, Python libraries like krippendorff and statsmodels.

Works well with

Survey (Sv): Open-ended survey responses are one of the most common inputs to content analysis. The survey produces the data, content analysis turns the open-ended block into a structured frequency table that complements the closed-ended quant.
In-depth Interview (Di): Transcribed interviews are coded with content analysis when the team needs structured comparison across many sessions, especially when a project ran 20+ interviews and reading them by memory no longer scales.
Desk Research (Dr): Desk research surfaces the source material (academic papers, competitor blogs, regulatory documents); content analysis is the disciplined way to extract structured findings from those sources rather than reading them ad hoc.
Diary Study (Ds): Diary entries arrive as long open text and need to be coded across days, participants, and topics. Content analysis is the standard way to summarize what diaries actually contain.
NPS / CSAT / SUS (Np): The verbatim follow-up question on every NPS or CSAT survey is the canonical content-analysis input. Frequency-coded NPS comments turn a single number into a roadmap of what to fix.

Example from practice

A consumer fintech app shipped a major redesign in v5.0 and saw their App Store rating drop from 4.6 to 4.1 over three weeks. Product leadership knew something was wrong but did not know what — the engineering team blamed a bug, support thought it was the new onboarding, and the design team was sure the colors were the issue. The head of research had 5,800 reviews from the 90 days surrounding the release and three days to deliver an answer before the steering meeting.

She defined the research question as “What categories of complaints appear in App Store reviews after v5.0, how do they compare to the 30 days before the release, and which categories are new or significantly more frequent?” She pulled the reviews into a spreadsheet, sampled 100 randomly, and asked Claude to suggest a starting codebook. The model returned 14 candidate categories. She merged two redundant ones, split one that was hiding two issues, and added a “navigation regression” category the model missed. She piloted the revised codebook on 200 reviews by hand, sharpened three definitions, then handed the full 5,800 to ATLAS.ti’s auto-coder. She spot-checked a 10% sample, found that the auto-coder was systematically miscoding sarcastic 5-star reviews as positive, fixed the codebook, re-ran, and locked the coding.

The cross-tab by version showed the answer no one had guessed: the dominant new category was “I cannot find the recurring transfer screen” (28% of negative reviews after v5.0, 0% before), which mapped to a button that had been moved into a submenu in the redesign. Bug reports were stable, onboarding complaints were unchanged, and color complaints were a noisy 4%. Engineering shipped a fix in the next sprint that surfaced the moved button on the home screen, and the rating recovered to 4.5 over the following month. The total project took 22 hours of researcher time across three days; the same analysis manually coded would have taken two full weeks and missed the steering meeting.

Beginner mistakes

Vague codebook definitions

A codebook with one-line names like “Performance” or “UX issues” but no inclusion rules or examples produces inconsistent coding even with one analyst, and outright contradictions with two. Every category needs a one-sentence definition, an explicit inclusion rule, an explicit exclusion rule, and 2–3 example excerpts before the full coding pass starts. Without that contract, the counts mean nothing because two coders applying the same name will be coding different things.

Skipping the pilot

The temptation to skip piloting and start coding the full dataset is strong when the deadline is tight, but it always backfires. The pilot is where the codebook gets sharpened against real data — every disagreement during the pilot is a definition that needs fixing, and fixing it after the full pass means re-coding everything. Budget half a day for the pilot regardless of dataset size; it pays for itself by the second hour of the full pass.

Treating the count as the answer

The frequency table is a midpoint, not the deliverable. “29% of tickets are about login problems” is data, not insight — the team needs to know which kinds of login problems, in which segments, with which workarounds, and why now. Always pair the counts with category summaries and direct quotes, and always interpret the patterns against the business context. A brief that ships frequencies with no synthesis is harder for stakeholders to act on than no brief at all.

Trusting AI auto-coding without spot-checking

Modern AI-coding tools look confident even when they are wrong. Sarcasm gets coded as positive sentiment, mixed messages get assigned the wrong primary code, and rare-but-critical categories get folded into the most similar large bucket. Every AI-coded dataset needs a 5–10% sample audit before the counts go into a brief — without it, the analyst is shipping the auto-coder’s blind spots as findings.

Letting the codebook drift mid-coding

When new patterns appear during the full coding pass, the right response is to stop, document the change, and re-pilot the new version on a small slice. The wrong response is to keep coding while quietly adding new codes or redefining old ones — that produces a dataset where the first half and the second half are coded against different rules, and no count is comparable across the whole project. Track every codebook change with a timestamp and version it like code.

AI prompts for this method

4 ready-to-use AI prompts with placeholders — copy-paste and fill in with your context. See all prompts for content analysis →.