Automating Evaluative Research
A longish post I didn't want to write since everyone and their dog is writing about Claude
How I Got Started
Over the past 2 years I have been running RITE studies on a voice agent. Each study follows the same shape. We recruit 8 to 15 participants via an unmoderated testing platform, ask them to call a staging version of the agent and place a service appointment, then grade them on a 13-question rubric and ask them to describe their experience. The result is a stack of artifacts: an Excel scorecard, transcripts of every call, machine-scored CX dimensions on every call, and a written research report that goes to product and engineering.
Somewhere in the second year, I noticed the work had stratified into two kinds. There was the mechanical work of pulling responses out of testing platform exports, computing scores, copy-pasting transcripts into a doc, and tracking which participants flagged which issue. And there was the judgment work of deciding which findings deserved to be elevated, which quotes carried the most weight, which observations were prompt-fixable and which were tenant-policy, what to lead with and what to bury.
The mechanical work was eating most of my time. I was running studies once or twice per sprint, and each one required roughly two days of data wrangling before I could even start thinking about what the findings meant. I wanted to push that ratio the other way: less time on the wrangling, more time on the thinking.
I had been using Claude conversationally for analysis work, but I had not tried to build it into the workflow as a tool. The opening of this project was a single message: I uploaded a topline summary draft and asked Claude to make it more readable. What I got back was a complete rewrite, and not the rewrite I wanted.
The eight rounds of edits
Claude’s first draft of the topline summary was confident, structured, and wrong in ways that taught me something. It led with the technical score and walked into a section on PM rule coverage. It used phrases like “the prompt worked” as the opening sentence. It treated the negatives I had collected as carrying roughly equal weight to the positives, which made the document read as more critical than the data supported.
I had not realized, until I saw that first draft, how much editorial judgment I had been quietly applying every time I wrote a report from scratch. The act of correcting Claude’s draft made all of those judgment calls explicit, one at a time.
I had not realized how much editorial judgment I had been applying every time I wrote a report from scratch.
Over the next several hours we went through roughly eight rounds of editing. Each round surfaced a different layer of tacit knowledge.
Round 1: “Lead with positives.” The draft buried what worked under a long preamble about score breakdowns. I asked for the wins to come first. Claude rewrote it. The wins came first, but were now overstated. The framing tilted promotional.
Round 2: “Trim the extravagant claims.” I told Claude to dial back the marketing voice. Specific phrases like “Prompt X worked” and “a huge round of applause” needed to come out unless they were direct quotes. Claude trimmed them. The voice shifted from celebratory to declarative.
Round 3: “The proportions are wrong.” Even with positives leading, the document spent more space on the negative patterns than on what worked. The data supported the opposite balance, with the majority of participants rating the experience as excellent. I asked Claude to recalibrate so the visible weight matched the actual data. It worked, and forced me to articulate what “proportional weight” even means in a research write-up.
Round 4: “Use the right dimensions.” Claude had picked four behavioral strengths to anchor the “what worked” section: pacing, diagnostic depth, closure flow, warmth. They were all findings supported by the data, but they were not the dimensions that mattered to my stakeholders. The dimensions stakeholders had been asking about were speed to usefulness, trust, naturalness, empathy, humanness. I pointed Claude at the right five. It rebuilt the table.
Round 5: “Surface the comparative testers.” Two of the participants in this run had tested earlier prompt versions in prior rounds and explicitly called this version the best yet. That comparison was the strongest signal in the entire study, and Claude had buried it in a paragraph deep in the analysis. I asked for it to be elevated to the opening. The Summary now leads with that comparison.
Round 6: “PM rules pass at 80%+.” The PM rule coverage section was framed as if 5 PARTIAL line-items represented 5 separate failures. They didn’t. They cascaded from two incidents. I told Claude to surface the actual pass rates instead of just the PARTIAL label. The framing shifted from “5 things broken” to “2 incidents, 0 systemic failures.”
Round 7: “This isn’t the right model.” Halfway through, I realized Claude was building a prompt version specific report, but future studies would be feature tests, not prompt iterations. The PM rule coverage section was the wrong section for what came next. I pointed at a feature report instead. Claude pivoted the structure.
Round 8: “Convert the summary to bullets.” The final shape adjustments were settled: section ordering, the boxed callouts, the dimensions table. The Summary went from two paragraphs to bullets so it could be skimmed in 30 seconds.
By the end I had a document I would have spent two days writing manually. But the outcome that mattered was the corrections themselves. Each round had produced a piece of explicit guidance that any future report would need: lead with positives, calibrate proportions to the data, name the five dimensions, foreground comparative testers, frame PM rules by pass rate, model on a prior research report format for feature tests, use bullets for the summary.
Those corrections looked like the contents of a style guide.
The pivot, from one report to a pipeline
At that point I made a decision that turned out to be the most important one in the whole project. I stopped trying to perfect this one report and started trying to capture the workflow.
The pipeline for a RITE study was always the same shape. Build a scorecard from the testing platform export. Write a per-participant analysis from the transcripts. Score every call on the CX rubric from the voice API logs. Then synthesize all of it into a final report. Each step took 1 to 4 hours of mechanical work plus an hour of judgment. If I could turn each step into a Claude skill, with the judgment encoded as instructions rather than re-derived every time, I would have something that could outlive any single study and could be handed off to other researchers.
The decision to package the work as skills had a side effect I hadn’t expected. It forced me to be explicit about the parts of my judgment that I’d never had to articulate before. Writing the SKILL.md for the per-participant analysis meant writing down the difference between a CRITICAL flag and an observation, the difference between a quote worth featuring and one worth dropping, the conditions under which a single-instance issue gets elevated to a cross-participant pattern. None of that had been written down anywhere. It had lived in my head as something I would have called “good judgment.”
The decision to package the work as skills forced me to be explicit about the parts of my judgment I’d never had to articulate before.
Building the skills
Over the next morning we built four skills.
Skill 1: the RITE Scorecard Builder. Takes a test metrics export and produces the participant-facing scorecard with formulas, conditional formatting, and iteration score. Pure mechanical work, fully automated. The most-used skill in the pipeline and the one with the least judgment in it.
Skill 2: the Per-Participant Analysis. Reads transcripts and writes a per-participant narrative for each caller. This one had judgment baked into the SKILL.md: a writing guide for voice, a flag taxonomy that distinguishes CRITICAL from observational findings, and rules for handling returning participants and thin transcripts. The extraction script does the mechanical work, and the writing step (where Claude follows the guide) is where the judgment lives.
Skill 3: the CX Scorecard. Reads voice API log exports and scores every call against a 5-dimension rubric. This skill predated the others. I’d built it weeks earlier as a separate project. It slotted into the pipeline cleanly.
Skill 4, the Final Report. Reads the outputs of the prior three skills and produces a research report modeled on the format I developed and refined over 2 years. This is the most judgment-heavy of the four. The SKILL.md contains a writing guide, section templates, and structural notes on why the report format reads the way it does.
The whole build took about eight hours over two work sessions. Maybe 30% of the time was on the scripts themselves: reading file formats, extracting data, generating Excel and Word output. The other 70% was on the documentation: writing the SKILL.md files, the writing guides, the flag taxonomies. The documentation was the product.
The handoff
The pipeline is able to be used by people who are not me. A UX researcher with no engineering background can drop a unmoderated test result export and a Voice API log into a folder, open Claude, run four prompts, and produce a complete research report within two hours. I tested this end-to-end against the same prompt version study, and the skill-driven output reproduced the manual output cleanly: same scorecard, same per-participant analysis, same final report structure.
I wrote a separate instruction document for the researcher who will run this next. The technical content of it could be summarized in a paragraph. The rest is what to look for, how to verify each step, what to do when something goes wrong, and which judgment calls remain the researcher’s responsibility even after the skills do their work.
That last category, the judgment calls that remain the researcher’s responsibility, is the part I most want other researchers to think about, which is what the rest of this document covers.
Lessons I Learned
Lesson 1: The first draft is a probe
When Claude produces a first draft of a research deliverable, the draft is not an attempt at the final document. It’s a probe that surfaces where your judgment differs from the model’s defaults. Treating it as a probe rather than as a starting point changes how you respond to it.
The wrong response to a bad first draft is to fix it and move on. The right response is to ask: what made it wrong? What unspoken rule did I violate by writing the manual version differently? The corrections you make in that moment are the most valuable artifacts the conversation produces, not because the corrected draft is better, but because the corrections themselves are the style guide you didn’t know you had.
On the topline summary, I made eight rounds of corrections. Each one taught me something about how I write research deliverables that I had never had to articulate before. By the end, the corrections were more useful than the document.
Lesson 2: Judgment doesn’t transfer; structure does
Claude is fast at structure. It is bad at judgment. By “structure” I mean: building a 17-row dimension table, extracting names and phone numbers from transcripts via regex, applying a rubric mapping to a column of testing responses, generating consistent docx output with formatted quotes. By “judgment” I mean: deciding which participants’ quotes are most representative, deciding when a single-instance issue is a pattern, deciding which of three valid framings best serves the stakeholders’ actual question.
The first time I tried to get Claude to do the judgment work directly, I got generic answers. The pattern that worked was different. I described the judgment in concrete terms (with examples, with anti-examples, with edge cases), wrote it into the SKILL.md, and then let Claude follow the guide. The judgment didn’t get outsourced. It got externalized into a document that I could iterate on and reuse.
This is, I think, the more honest framing of “AI in research.” The AI doesn’t replace the researcher’s judgment. It gives the researcher a forcing function to write the judgment down.
The AI doesn’t replace the researcher’s judgment. It gives the researcher a forcing function to write the judgment down.
Lesson 3: Proportionality is harder than direction
The single biggest editorial battle on this project was about proportions. Not whether the findings were correct (they were), but whether the visible space given to each finding matched the strength of the evidence.
Surfacing negatives is easy. Surfacing positives is easy. Getting the ratio right is the hard part. A study where the majority of participants gave high satisfaction ratings should not read as a study about the minority who didn’t, but my first drafts kept landing there. That’s a calibration question, not a content question, and it took several rounds of explicit instruction to fix.
My current rule of thumb: if a draft reads like a list of problems, ask whether the data supports that framing. Usually it doesn’t. Usually the negatives are loud and the positives are quiet, and a faithful summary requires actively amplifying the quiet parts.
Lesson 4: The audience for a finding shapes the framing of it
Mid-project, I had a Slack exchange with a stakeholder asking us to spend a week improving the agent’s “responsiveness, naturalness, human-likeness.” I’d been measuring all of those things in every study, but the reports weren’t landing because my framing didn’t match his vocabulary.
The fix was to reframe the dimensions in the report using the language stakeholders were already using. Instead of my own internal categories (pacing, diagnostic depth, closure flow), I used the five-dimension framework: speed to usefulness, trust, naturalness, empathy, humanness. I showed which participant quotes anchored each one. The data didn’t change. The framing did. And the same data, framed in stakeholder-native vocabulary, became readable in a way it had not been before.
Claude was the right collaborator for this because it could rewrite the entire framing in seconds. The hard part was knowing what to ask for. Recognizing that the framing was the problem, not the data or the structure, required research training, not AI tooling.
Lesson 5: Iterate on the system, not the output
Most of the time I spend with Claude is now on the skills, not on individual deliverables. When a report needs a change (different section ordering, different tone in the verdicts, different inclusion criteria for flags), I update the SKILL.md once, and the next study benefits. When a deliverable has a one-off problem, I fix the deliverable. But I try to ask, every time: is this a one-off, or is this telling me something the skill should know?
The result is that the system gets better at every iteration. The first study took two days of editing. The fifth study, using the same skills, should take a couple of hours. The corrections from each round flow back into the skills as updates to the writing guide, the flag taxonomy, or the section templates. REMEMBER to update those skills!
This is the part I most want other folks to take from this account. The point is not to produce one great deliverable with AI. It is to build a system that produces consistent deliverables, and then to spend your judgment on improving the system.
Lesson 6: Some things still don’t transfer
Deciding whether a finding matters is still mine. The skills can surface every flag and observation in the data, but the choice of which finding to elevate to a TL;DR verdict, which to demote to an Opportunity, and which to drop entirely sits with me. The skills produce drafts. The drafts are starting points. The researcher is still the editor.
The skills don’t replace the conversation with stakeholders. The reframing I did mid-project, after the Slack exchange about “responsiveness,” required understanding what was being asked and translating it back through the data. A skill can apply that translation once it has been articulated. It cannot have the conversation that surfaced the need for the translation in the first place.
And no skill can tell me when a study design is wrong. Several times during this build, the right answer was “the question this study is trying to answer is the wrong question.” That kind of redirect comes from being in the work, talking to people, and noticing what the data isn’t telling you. A skill can score what was collected. It can’t tell you that you should have collected something else.
These limits aren’t problems with the tool. They’re the part of research work that stays with the researcher, and I don’t think I would want it any other way.
What I’d tell another researcher
If you’re considering using Claude or another model to handle parts of your research workflow, here’s the short version of what I’ve learned:
Start with the mechanical work. The places where you copy-paste, reformat, compute, or shuffle data are the places where the gain is immediate and the risk is low.
Treat the first draft of any analysis as a probe. The corrections you make are more valuable than the corrected draft.
Externalize your judgment. Every time you find yourself making the same correction twice, write it down. The writing is the system.
Build for the second study, not the first. The first study with new tooling is always slower; the payoff is in the fifth study.
Calibrate proportions explicitly. AI is good at finding both positives and negatives; it is bad at weighting them by evidence strength. That calibration is your job.
Speak in your stakeholders’ vocabulary. The same data, framed in the language the stakeholder uses, is a different document. The AI can make the translation; you have to know it’s needed.
Think about how you are going to break up your skills! Four skills covered the entire RITE pipeline and followed my natural flow as I had worked.
The collaboration I described here is closer to pair programming than to delegation. Claude can build any structure I describe and follow any rule I write down, but it cannot tell me what to build or which rules matter. That division of labor is what made this project work.
What this changes for me is concrete. I can automate the evaluative work. I can hand the process off to teammates so they can keep up the longitudinal regression tracking that prompt engineering requires. And I can spend my own time on the generative and strategic architectural research that gets us to the next phase on our roadmap.
The views expressed here are my own. Examples have been anonymized or generalized to respect confidentiality, and may represent a composite of multiple research insights. No specific company, product, or tool should be assumed unless explicitly stated.
This perspective comes from design iteration and from deliberate, ethically grounded research. Every participant whose feedback appears here gave informed consent. All identifying details were anonymized. The analysis was done with transparent frameworks. In every study, whether moderated interviews, feedback roundtables, or synthetic evaluations, human impact came first.
The technology underneath is not the point. Talking systems are built on human expectations, needs, and vulnerabilities. As researchers and designers, our responsibility is to treat that data with care and to advocate for systems that respect and empower the people who use them.



