Home Services About Blog FAQ Contact
← Back to blog

Lead Qualification with Claude: What We Tested Before Recommending It.

Every agency demo shows you the happy path. The AI reads the submission, scores the lead perfectly, pings the right person. We wanted to see what happened on the messy ones before we shipped this to clients. Here's what six weeks of in-house testing actually looked like.

If you've looked at AI lead qualification tools, you've probably been shown a polished demo: a clear, well-written inquiry lands in the inbox, the model assigns a score of 8 out of 10, and the calendar invite fires automatically. That's not a lie. It's just the 30% of your actual submissions that look like that.

Before we built this for anyone else, we built it for ourselves — and we tested it against the full picture.

Why we ran this in-house first

We were spending 3–4 hours a week on manual lead qualification. The intake form asked the right questions. The problem was translating the answers into a clear call — "worth a 30-minute discovery call" vs. "send the pricing guide and follow up in 90 days" — fast enough that good leads didn't wait two days for a response.

The goal wasn't to automate the decision entirely. It was to handle the clear cases so the genuinely ambiguous ones could get real attention. Anyone who tells you AI handles the ambiguous 20% well right now is selling something.

We pulled roughly 200 anonymized form submissions from the past year, each already tagged with the actual outcome — booked call, sent to nurture, or not a fit. That became the test set. Three models went in. One came out as our recommendation.

What the qualification rubric looks like

Before any model touched a single submission, we defined the criteria. Seven binary questions — all yes or no:

  1. Is the stated budget within range for our minimum engagement?
  2. Is the project scope specific enough to actually scope?
  3. Is a decision-maker filling out the form, or is someone "gathering info for a boss"?
  4. Does the stated timeline fit our current availability?
  5. Is the vertical one we have direct experience in?
  6. Is the request within our service footprint — AI, web, or video?
  7. Does anything in the message suggest a commodity-only buyer?

The model's job was to answer each question and return a score from 0 to 7, plus a flag for anything it wasn't confident about. It wasn't writing a recommendation paragraph. It was filling out a rubric.

Getting the rubric right before you open a single API call is the most important step in this whole process. The model can only score what you've defined.

Three models, one test set

We tested:

On the clean submissions — clear budget, clear ask, obvious fit or obvious miss — all three performed identically. They agreed on the score for about 73% of the test set. That's the baseline we expected, and it's not particularly interesting.

The other 27% is where it got worth writing about.

The model that handles the clean prompts perfectly usually falls apart on the ambiguous ones.

What actually changed our mind about Claude

Two patterns broke the other models.

The first was ambiguous budget language. Submissions like "we have some budget but it depends on what the project actually needs" are common — probably a third of our real inquiries phrase it that way. GPT-4 Turbo scored criterion 1 as a yes roughly half the time and a no the other half, with no consistent logic we could identify. Llama 3 just picked one and committed. Claude Haiku flagged it as low-confidence and kicked it to human review — which is exactly the right call. That's not a disqualification; that's "I need a human to assess this one."

The second was indirect intent signals. A handful of submissions from people who said they were "just researching" — but whose questions were suspiciously specific: detailed scope description, asking about timelines, mentioning a competitor they'd already gotten a quote from. GPT-4 Turbo and Llama 3 took the stated framing at face value and scored them low. Claude picked up on the behavioral signals embedded in the language and scored them higher, which matched what we already knew about how those leads had actually converted.

Neither of these is magic. Claude's training appears better calibrated for reading business intent in informal, imprecise language — which is most of what a real form submission looks like. People don't write discovery form answers the way they'd write a job spec. They're vague, they hedge, they ask questions by implication.

The production setup

We run Claude Haiku on intake form submissions via n8n. The prompt includes the 7 criteria, a one-paragraph description of our ideal client, and an explicit instruction to flag low-confidence answers rather than guess. Output lands in our CRM as a score tag plus a one-line summary of the model's reasoning.

Anything that scores 5 or higher gets an automatic booking link triggered. Scores of 3 or 4 go into my review queue for a 2-minute human look. Anything below 3 triggers a nurture sequence and a 90-day follow-up.

For the API cost math on running Claude Haiku at this kind of volume, I covered the per-token arithmetic in detail in what an AI agent actually costs to run. At our current volume it runs under $8 a month — meaningfully cheaper than a single hour of manual review time.

A lead qualifier that escalates to a human is more valuable than one that confidently scores everything.

What to build before you build anything

The rubric matters more than the model. If you can't write down 5–7 specific criteria for a good lead — criteria with yes/no answers — no AI is going to save you. Get that right first. This is usually a 45-minute conversation, not a product purchase.

Test on real submissions, not demo data. Demo data is too clean. Your actual form submissions have typos, passive-aggressive phrasing, and people who respond to "what's your budget" with "what do you charge." Build your test set from your real history.

Build in the escalation path before launch. We almost shipped without a low-confidence flag. The model scoring everything with false confidence would have been worse than not having the system at all. The flag is what makes this trustworthy enough to run automatically.

If you want to set up something like this — or want an honest opinion on whether your current qualification process is even agent-shaped versus a well-configured Zapier filter — start a conversation. We'll tell you when the simpler tool is the right answer. And if you're wondering whether this kind of work fits under our AI agent services, it does.

— Cole

Want to qualify leads automatically — without the guesswork?

Tell us about your intake process and current form volume. We'll tell you honestly whether an agent is the right call or whether a filter would do.

Start the Conversation →