7 min read - How to Vet AI Vendors and Consultants (Technical Due Diligence)

Vendor Due Diligence

The AI consulting market has a lot of confident words and not enough operational proof.

If you're a founder, procurement lead, or tech leader, the cost of hiring the wrong vendor isn't just money. It's the lost quarter, the security rework, and the internal trust you burn when the “AI initiative” turns into a fragile demo.

So vetting AI consultants comes down to one question: can they show evidence of shipping and operating AI systems with clear boundaries and measurable outcomes?

This post gives you a technical due diligence process you can run in days, not weeks.

What you'll learn

The 5 buckets of due diligence that matter for AI work
The questions that separate “demo builders” from delivery teams
The artifacts you should request (and why they matter)
A copy/paste checklist you can use in vendor selection and interviews

TL;DR

To vet AI vendors and consultants, evaluate delivery evidence, not buzzwords. Ask for a data boundary plan, an evaluation approach (golden set + thresholds), an operational ownership model (runbooks, logging, incident handling), and clear scope boundaries in the SOW. Request artifacts, run a short paid pilot, and watch for red flags like “no evaluation needed” or “we'll figure security out later.”

The 5 buckets (a simple due diligence framework)

Use these buckets to structure your evaluation:

Outcome and workflow clarity (what are we changing, and how do we measure it?)
Data boundary and permissions (what data is allowed, and who can see what?)
Evaluation and quality (how do we detect regressions and “silent failures”?)
Operational reality (logging, monitoring, runbooks, incident response)
Delivery governance (scope control, handoff, and who owns maintenance)

If a vendor is weak in any one bucket, the project will get expensive later.

Questions to ask (and what good answers sound like)

You do not need to be an ML researcher to ask high-signal questions.

Ask these and listen for specificity:

“What is your evaluation plan?” Good answer: golden set, rubric, thresholds, regression checks. Bad answer: “we'll know when it's good.”
“How do you handle data permissions in retrieval?” Good answer: permissioning is part of retrieval. Bad answer: “we'll hide it in the UI.”
“What does your runbook look like?” Good answer: incident steps, rollback, owner. Bad answer: “we haven't needed one.”
“How do you control scope?” Good answer: backlog + change triggers. Bad answer: “just message us anytime.”
“What happens after launch?” Good answer: maintenance plan, SLA, ownership. Bad answer: “handoff to your team” with no detail.

Red flags (treat these as a “no”)

You don't need to be cynical, but you should be alert. These are common warning signs:

“Evaluation isn't necessary.” It always is.
“Security can be handled later.” It will be handled later, by forcing rework.
“We need production data to start.” You can start with a sanitized subset and a boundary.
“Unlimited scope” language with no intake path or change control.
A refusal to describe how data is stored, logged, and retained.

One red flag doesn't always mean “walk away,” but it should trigger deeper questions and a smaller pilot.

Artifacts to request (proof beats slides)

If the vendor is real, they have templates and artifacts because they've done this before.

Ask for:

an example evaluation report (even anonymized)
an architecture diagram (data flow + boundaries)
a sample risk register
a handoff checklist or runbook outline
a sample SOW with scope boundaries (what is excluded matters)

If they can't share anything (even sanitized), you're buying faith.

Turn answers into a vendor score (so you can compare fairly)

Due diligence breaks down when the loudest vendor wins the room. A simple scoring sheet keeps you honest.

Use a 0/1/2 system:

2: specific, shows artifacts, owns tradeoffs
1: reasonable answer, but mostly verbal, light on proof
0: vague, dismissive, or pushes risk onto you

Example scoring categories:

Category	0 looks like	2 looks like
Workflow clarity	“We’ll figure it out”	Clear scope + acceptance criteria examples
Data boundary	“We just need access”	Written allowed/prohibited data + permission model
Evaluation	“We’ll test manually”	Golden set + rubric + thresholds + regression plan
Security	“Handled later”	Threat model basics + logging/retention rules
Operations	“We’ll hand it off”	Runbook + monitoring + rollback mindset
Scope control	“Unlimited changes”	Intake + change triggers + decision checkpoints

This also makes internal alignment easier: you can show stakeholders why a vendor scored well (or didn’t) without making it personal.

The fastest safe test: a paid pilot sprint

For high-uncertainty AI work, a short paid pilot is usually better than weeks of unpaid pre-sales.

The pilot should have:

a narrow workflow,
a defined data boundary,
an evaluation baseline,
and an end-of-sprint decision (scale, iterate, stop).

What to ask for in a pilot (deliverables that prove competence)

A pilot is not “build a demo.” A pilot is “prove you can ship safely.”

Deliverables that signal a real delivery team:

Workflow brief: inputs, outputs, owners, and definition of done
Evaluation pack: a small dataset, rubric, and baseline results
Architecture notes: data flow, boundaries, where logs live, rollback plan
Change process: what counts as a material change and who approves it
End-of-sprint decision memo: what to scale, what to stop, what the risks are

If a vendor can’t produce these in a narrow pilot, they won’t magically produce them in a larger engagement.

Reference checks: ask for the failure story

Vendors love success stories. You want the story where something went wrong.

If the vendor can provide references, ask questions like:

“Tell me about a time quality regressed after launch. How did they catch it, and what did they change?”
“Did they respect data boundaries, or did security have to intervene later?”
“How did they handle scope creep when stakeholders asked for ‘just one more’ feature?”
“Were they easy to work with when requirements changed?”

Good vendors answer with specifics: what broke, what they measured, what the runbook said, and what they changed to prevent recurrence.

Make responsibilities explicit (so risk doesn’t get silently transferred)

Due diligence isn’t just technical. It’s also “who owns what.”

Before you sign anything, write down:

who owns the evaluation set and thresholds
who owns incident response (and when)
who owns security review of logging/retention and vendor usage
who owns maintenance after handoff

If every answer is “the vendor,” you’re buying dependency. If every answer is “your team,” you’re buying a demo. The right answer is usually shared, but explicit.

Copy/paste: technical due diligence checklist

Use this in interviews, RFPs, or vendor scoring.

Due diligence checklist (AI vendor/consultant)

Outcomes
- Target workflow defined:
- Success metric defined:

Data boundary
- Allowed/prohibited data written:
- Permissions model described:
- Logging and retention defined:

Quality
- Golden set and scoring plan:
- Regression testing plan:
- Rollback plan:

Operations
- Monitoring (quality/latency/cost):
- Runbooks + incident response:
- Maintenance ownership/SLA:

Delivery
- Scope boundaries + change control:
- Handoff artifacts:
- Clear timeline and dependencies:

Insist on operational reality

Vetting AI consultants is mostly about insisting on operational reality: data boundaries, evaluation, ownership, and clear scope. If a vendor can show artifacts and speak concretely about tradeoffs, you'll feel it. If they can't, you will pay to discover it later. Need a second opinion on a vendor? Let's talk.

Thinking about AI for your team?

We help companies move from prototype to production — with architecture that lasts and costs that make sense.

Talk to us How we work

Our offices

Follow us