Making an AI conduct user interviews (and why getting it to actually dig deep was a nightmare)

Hey guys,

We all know the drill: "Talk to your users." But when you're bootstrapping, blocking out 15 hours a week for Zoom calls across 5 time zones is brutal. On the flip side, sending out a Google Form feels like a massive cop-out. You get the "what" but completely miss the "why."

So I spent the last few weeks building an AI that conducts qualitative user interviews via chat. Not to pitch anything here, just wanted to share some of the technical rabbit holes I fell into. Because as it turns out, getting an LLM to actually act like a competent human researcher is much harder than just throwing an API key at it.

The "Polite Robot" problem Initially, I just fed the LLM a list of questions and told it to interview the user. The result was completely useless. It acted like a polite surveyor. The user would say something incredibly vague like "I didn't like the UI," and the AI would literally reply, "Thanks for the feedback! Moving on to question 3..."

Real insights come from probing. To fix this, I had to stop treating the LLM like a chatbot and wrap it in a state machine. Now, the system evaluates every user response against the core research goal before deciding its next move. It scores whether the user gave a surface-level answer or a deep one, and forces a follow-up (e.g., "What specifically about the UI felt off?") until it hits a depth threshold.

Keeping it on the rails Users go off on tangents. A human researcher knows how to gently steer the conversation back. Left to its own devices, an LLM will happily spend 10 minutes discussing the user's dog.

Building dynamic guardrails was a headache. I had to tweak the prompt architecture heavily so the AI knows how to be empathetic ("Oh, sorry to hear your dog is sick") while immediately pivoting back to the actual goal ("...but regarding how you handled that data export last Tuesday...").

Context window bloat Once an interview goes on for a while, feeding the entire raw transcript back into the LLM starts confusing it. It loses track of the immediate conversational thread and starts bringing up things from 10 minutes ago out of context.

My workaround was implementing a rolling summary pipeline. It compresses older parts of the conversation into a dense "what we've learned so far" block, keeping only the last few exchanges as raw text. It drastically reduced hallucinations and kept the AI hyper-focused.

It’s been a crazy challenging build. The baseline tech is finally good enough to handle the nuance of qualitative research, but I realized the UX layer and the architecture around the LLM is where 90% of the actual work lives.

Curious if anyone else here has tried automating their qualitative research? Did you hit the same walls?