January 2026

Thin wrapper or true AI? Technical due diligence for AI investments

Despite widespread AI claims in company pitch decks, 95% of generative AI pilots are failing, creating a massive gap between marketing promises and reality that requires rigorous technical due diligence to distinguish genuine AI capabilities from superficial implementations.
article splash

Every company is an AI company now. At least, that’s what the pitch decks claim.

The reality is more sobering. According to MIT’s NANDA initiative, 95% of generative AI pilots at companies are failing. Yet valuations continue to assume these technologies will deliver. The gap between AI marketing and AI reality has never been wider—and investors are increasingly left holding the risk.

Traditional due diligence wasn’t built for this. Legal review can tell you about IP ownership. Financial review can validate revenue claims. But neither can answer the question that actually matters: Is the AI real?

In our experience assessing AI companies across multiple sectors, we’ve found that technical claims often fall into predictable patterns: some genuine, many less so. This framework is designed to help investors, board members, and M&A teams ask better questions before committing capital.

The AI Implementation Spectrum

Not all “AI companies” are created equal. Before assessing red flags, it helps to understand where a company sits on the implementation spectrum.

Tier 1: Thin Wrappers

At the lowest end are companies whose entire “proprietary AI” consists of API calls to third-party models—OpenAI, Anthropic, or similar—wrapped in a custom interface. The technical differentiation is essentially a React frontend and some prompt engineering.

This isn’t inherently problematic. Valuable businesses can be built on third-party infrastructure. But these companies should be valued as application businesses, not AI businesses. The moat is in distribution, user experience, or domain expertise, not technology.

What to watch for: No ML engineers on staff. “Prompt engineering” described as core IP. Reluctance to discuss what happens if API pricing changes or access is revoked.

Tier 2: Augmented Applications

The middle tier comprises companies using third-party models enhanced with proprietary data, fine-tuning, or retrieval-augmented generation (RAG). There’s genuine technical work here, but the defensibility depends entirely on the data advantage.

The critical question: Is the proprietary data truly unique, or could a competitor license or generate equivalent training data within 18 months?

Tier 3: Proprietary AI

At the top end are companies with custom models trained on proprietary datasets, often with novel architectures or training approaches. These carry the highest potential defensibility, but also the highest execution risk.

The critical question: Does the team have the depth to maintain and improve these systems, or is the capability concentrated in one or two individuals who could leave?

Most AI companies we assess fall into Tier 1 or Tier 2. There’s nothing wrong with that but investors should price accordingly.

Five Technical Red Flags

These patterns appear repeatedly in AI companies that don’t survive technical scrutiny. None is automatically disqualifying, but each warrants deeper investigation.

1. The Buzzword Gap

Certain terms have become markers for technical imprecision. When we hear invented terminology that doesn’t map to standard ML concepts, or established terms like “reasoning” applied loosely to any multi-step process, it often signals that marketing has outpaced engineering.

This doesn’t mean the founders are being deliberately misleading. Often, they’re translating genuine technical work into language they believe investors want to hear. But the translation reveals a gap between what the technology actually does and how it’s being positioned.

What to ask: “Can you describe the model architecture without using analogies? What specific machine learning techniques are you using, and why those over alternatives?”

2. The Validation Vacuum

One of the clearest signals of technical maturity is how a company measures and reports model performance. Genuine AI teams obsess over metrics. They can tell you their precision, recall, F1 scores, or task-specific benchmarks without hesitation. They know where their models fail and have plans to address it.

Companies with superficial AI implementations often struggle here. We’ve seen Series B companies claim to have “solved” entity resolution across unstructured datasets without any rigorous validation methodology. When pressed on accuracy, the answers become vague: “customers are happy” or “it works well in practice.”

What to ask: “Show me your evaluation framework. What benchmarks do you use? What’s your current performance, and what are the known failure modes?”

If the answer is unclear or deflects to anecdotal customer feedback, treat every accuracy claim with scepticism.

3. The Demo-to-Production Gap

Impressive demonstrations are easy. Production systems that work reliably at scale are hard.

We regularly encounter companies with compelling demos that have never processed real customer data at volume. The prototype works beautifully on curated examples; the production system struggles with edge cases, latency requirements, and the messy reality of real-world data.

Warning signs:

  • No production monitoring dashboards to show
  • Metrics reported only on test datasets, not live data
  • “We’re focused on R&D” after two or more years of operation
  • Customer count that hasn’t grown despite claimed product-market fit

What to ask: “Can you walk me through your production monitoring? What does your error rate look like over the past 90 days? How do you handle cases where the model fails?”

4. The Disappearing Data Moat

“Proprietary data” is perhaps the most overclaimed moat in AI. Companies assert data advantages that don’t survive scrutiny.

Common patterns we’ve observed:

  • Licensable data: The “proprietary” dataset is actually available for purchase or licensing from data providers
  • Replicable data: The data could be generated by a well-funded competitor within 12–18 months
  • Stale data: The data advantage existed historically but the collection mechanism is no longer unique
  • User-generated data without lock-in: The data comes from users who could equally contribute to a competitor’s platform

True data moats are rare. They require data that is simultaneously valuable for model training, expensive or impossible to replicate, and continuously refreshed through mechanisms competitors can’t easily copy.

What to ask: “If a competitor raised £20m specifically to replicate your data advantage, how long would it take them? What would prevent them?”

5. The Key Person Illusion

AI systems are complex, and the knowledge required to maintain them often concentrates in a small number of individuals. We’ve assessed companies where the entire model architecture existed primarily in one engineer’s head, with minimal documentation and no realistic succession plan.

This creates acute risk: the departure of a single technical leader can leave a company unable to maintain, debug, or improve its core technology.

Warning signs:

  • Technical documentation described as “in progress” or “on the roadmap”
  • Model training processes that only one person has successfully executed
  • Inability to answer technical questions without deferring to a specific individual
  • Recent departure of founding technical team members

What to ask: “If your lead ML engineer left tomorrow, how long before someone else could retrain your models? Who else has successfully done it?”

What Good Looks Like

Red flags are useful, but investors also need to recognise genuine technical capability. Here’s what we look for in companies that survive rigorous technical due diligence.

Metrics Obsession

Strong AI teams measure everything. They have dashboards showing model performance over time, can segment accuracy by use case or customer type, and actively track where their systems fail. They’re often more eager to discuss their weaknesses than their strengths—because they’re genuinely working to fix them.

Honest Capability Boundaries

Mature technical teams are precise about what their technology can and cannot do. They’ll say “we handle X well, but Y is still a challenge” rather than claiming universal capability. This precision signals genuine understanding rather than marketing optimism.

Documented, Reproducible Systems

In well-run AI teams, model training is a documented process that multiple team members have executed. There’s version control for models, not just code. Experiments are logged. The system could survive the departure of any individual. Uncomfortably, perhaps, but it would survive.

Scalability by Design

Companies with genuine technical depth have thought about scale from the beginning. They can articulate their cost structure at 10x current volume. They’ve made architectural decisions that support growth, not just current operations. They know what breaks next and have plans to address it.

Data Strategy, Not Just Data

Rather than claiming a static data moat, strong companies articulate an ongoing data strategy: how they continue to acquire valuable training data, how that data improves their models over time, and why their data acquisition mechanism is defensible.

Responsible AI Integration

We increasingly view responsible AI practices as a signal of technical maturity. Companies that have thought seriously about bias, fairness, and failure modes tend to have more robust systems overall. It’s not just ethics—it’s engineering discipline.

Building Technical Due Diligence Into Your Process

Traditional due diligence frameworks need adaptation for AI-native companies. Based on our experience, we recommend:

1. Technical review before deep financial analysis. There’s limited value in detailed revenue modelling if the core technology doesn’t survive scrutiny. A focused technical assessment early in the process can save significant time and cost.

2. Direct technical access. Demos and pitch decks are insufficient. Request documentation, architecture diagrams, and access to technical leadership. Reluctance to share technical details (even under NDA) is itself a finding.

3. Independent technical perspective. Internal technical teams often lack specific ML/AI expertise, and may be too polite to challenge founders directly. External technical due diligence provides both expertise and independence.

4. Questions designed to reveal depth. The questions throughout this article are designed to distinguish genuine capability from confident presentation. The goal isn’t to catch founders lying, it’s to understand what you’re actually buying.

Conclusion

The AI investment landscape rewards those who can distinguish signal from noise. Every company claims differentiation; few can demonstrate it under technical scrutiny.

This isn’t about being cynical toward AI. Genuine AI capabilities create enormous value and that’s precisely why rigorous assessment matters. The companies with real technology benefit from due diligence that separates them from the crowd. The investors who develop technical evaluation capabilities gain an edge in a market where most rely on demos and pitch decks.

The cost of getting AI assessment wrong is significant: overvalued acquisitions, technology that doesn’t scale, and moats that evaporate when foundation models improve. The cost of getting it right is a few weeks of focused technical review.

In a market full of AI claims, that seems like a reasonable trade.


Agathon provides independent technical due diligence for investors evaluating AI companies. Founded by Dr Colin Kelly (PhD in Natural Language Processing, Cambridge), we combine academic rigour with practical experience building and assessing AI systems across multiple sectors.

Ready to evaluate an AI investment?
Download our AI Due Diligence Checklist: 12 technical questions to ask before any AI investment.

Or book a confidential discussion about a specific opportunity.
Subscribe to our newsletter
Join our newsletter for insights on the latest developments in AI
No more than one newsletter a month