What I've Learned Shipping AI Products That Most Consulting Advice Gets Wrong

I need to be blunt about something. Most of what passes for "AI product strategy" advice is written by people who have never shipped an AI product.

I don't mean that as a throwaway provocation. I mean it literally. The strategy decks, the thought leadership pieces, the conference talks: the vast majority are produced by consultants who advise on AI from a comfortable distance, by analysts who study it from the outside, or by founders who shipped one product and are now generalising from a sample size of one. The advice sounds reasonable. It's often directionally correct. And it falls apart the moment it meets the reality of building and shipping AI products at a level that actually matters.

I've spent fifteen years building AI systems commercially. Not advising on them from the outside but building them, shipping them, maintaining them, watching them fail and figuring out why. Financial services, telecoms, automotive, government. From early NLP systems that would look primitive by today's standards through to current-generation compound AI architectures. Every one of those engagements taught me something that contradicted the conventional wisdom of its era.

This piece is the distillation of those lessons. Not research findings. Not best practices extracted from case studies. What I've personally seen go wrong, what actually works, and why the gap between AI strategy and AI reality is wider than anyone wants to admit.

The Gap Between AI Strategy and Reality

Let me start with the three things that surprise almost every first-time AI product builder, almost universally. These are so consistent that I will now usually flag them in the first week of any engagement.

Surprise 1: Your First Architecture Will Be Wrong

Not "suboptimal." Wrong.

Every AI product I've seen ship has undergone at least one fundamental architectural revision within six months of launch. Not a tweak. A rethinking of how the core components fit together.

This happens because AI products reveal their requirements in production, not in planning. You can design the most elegant architecture on a whiteboard, informed by every best practice document ever written, and production traffic will invalidate your assumptions within weeks.

The user queries you anticipated account for maybe 40% of what people actually ask. The retrieval strategy that worked on your curated test set breaks on real-world data with its inconsistencies, ambiguities, and sheer volume. The latency budget you allocated turns out to be incompatible with the quality users expect.

The conventional advice -- "plan carefully, get the architecture right, then build" -- is backwards for AI products. What I've learned is that you should plan to be wrong. Design for replaceability. Build your system so that swapping out the retrieval layer, changing the model, or restructuring the orchestration flow is a measured effort, not a rewrite.

The teams that suffer most are the ones who invested heavily in a "perfect" architecture before they had production data. They're emotionally and financially committed to decisions made with inadequate information, and the sunk cost makes them resistant to the changes that production reality demands.

Surprise 2: Evaluation Is Your Actual Product

I can almost hear the objection: "No, the user-facing capability is the product." Technically true. Practically misleading.

Here's what I mean. In traditional software, the product works or it doesn't. A button either submits the form or it doesn't. An API either returns the correct data or it throws an error. Testing is important, but the gap between "works in testing" and "works in production" is manageable.

In AI products, the output is probabilistic. The same input can produce different outputs. "Correct" is often subjective and context-dependent. The system can fail in ways that look like success -- a confident, fluent, completely wrong answer is far more dangerous than a crash.

This means your ability to evaluate your system's output is, in a very real sense, your ability to build the product at all. Without robust evaluation, you cannot:

Know whether a change improved the system or degraded it
Identify failure modes before users discover them
Prioritise what to work on next
Make any claim about quality with confidence

I've worked with teams that spent months building features and days building evaluation. Every single one of them regretted it. The features they built might have been brilliant -- they had no way to know, because they couldn't measure the impact.

The teams that ship great AI products invest disproportionately in evaluation infrastructure. They build it first, not as an afterthought. They treat it as a core product capability, not a testing concern.

What does good evaluation infrastructure look like in practice?

Layered evaluation: Unit tests for individual components (does the retrieval return relevant documents?), integration tests for the full pipeline (does the end-to-end response answer the question?), and behavioural tests for emergent properties (does the system handle adversarial inputs gracefully?).
Human-in-the-loop calibration: Automated metrics are necessary but insufficient. You need regular human evaluation to calibrate your automated metrics against actual quality. This doesn't need to be expensive -- even a few hours per week of structured human review dramatically improves your understanding of system performance.
Regression detection: Every change to the system -- model update, prompt modification, retrieval parameter adjustment -- should be evaluated against a held-out test set before reaching production. Not a full test suite for every change, but a calibrated sample that catches regressions in critical capabilities.
Business-connected metrics: Ultimately, your evaluation needs to connect to business outcomes. Response quality that users don't notice or don't value is irrelevant, regardless of how well it scores on benchmarks. The best evaluation frameworks I've seen track the chain from technical metrics through user behaviour to commercial outcomes.

Surprise 3: The Feedback Problem Is Harder Than the Model Problem

Every AI strategy deck includes a slide about "continuous improvement" and "learning from data." None of them adequately describe how difficult this is in practice.

The challenge is this: to improve an AI system systematically, you need signal about what's working and what isn't. In theory, user interactions provide this signal. In practice, extracting useful signal from user behaviour is one of the hardest problems in AI product development.

Consider: a user asks your AI product a question, receives a response, and moves on without providing explicit feedback. Was the response good? You don't know. They might have been satisfied. They might have been dissatisfied but too busy to complain. They might have reformulated their question elsewhere. They might have used the response despite it being wrong, because they couldn't evaluate its accuracy.

Explicit feedback mechanisms (thumbs up/down, ratings) help but are biased -- users who provide feedback are not representative of users who don't. Implicit signals (time spent reading, follow-up actions, return visits) are noisy and ambiguous.

Building feedback loops that actually work requires:

Careful signal design: What user behaviours genuinely indicate quality? This varies enormously by product and use case. I've seen teams build feedback systems around signals that, upon investigation, correlated more with user mood than with response quality.
Bias-aware aggregation: Explicit feedback over-represents power users and under-represents the silent majority. Implicit feedback over-represents easily measured actions and under-represents actual value delivered. You need to account for these biases, not just average the numbers.
Closing the loop: Having feedback data is necessary but not sufficient. You need processes that convert feedback signal into system improvements -- whether that's prompt refinements, retrieval tuning, model fine-tuning, or architectural changes. This requires infrastructure, tooling, and disciplined prioritisation.
Latency tolerance: AI system improvements often take time to manifest and measure. Unlike a UI change where you can A/B test in days, AI improvements may require weeks of accumulated usage data to evaluate reliably. Teams accustomed to rapid iteration cycles in traditional software find this deeply uncomfortable.

The Hardest Part Isn't the Model

Let me say this as plainly as I can: the model is the easiest part of an AI product.

I realise that's counterintuitive. The model is the headline. It's what investors ask about. It's what the press covers. It's the thing that feels most like "real AI." But in terms of the effort required to build a production AI product, the model -- whether you're training it, fine-tuning it, or calling an API -- accounts for perhaps 15-20% of the total work.

The other 80% is everything else. Data pipelines that ingest, clean, chunk, and index your domain data reliably and at scale. Retrieval systems that find the right information for each query, not just the most semantically similar text. Orchestration logic that routes requests through the right sequence of components. Evaluation frameworks that tell you whether the system is actually working. Monitoring infrastructure that catches degradations before users do. Feedback pipelines that convert usage data into improvements.

None of this is glamorous. None of it makes for good demo material. All of it is essential, and all of it is harder than most teams expect.

The most common failure pattern I see is teams that allocate 80% of their effort to model selection and prompt engineering, and 20% to everything else. They get the ratio exactly backwards.

AI Product Timelines Are Not Software Timelines

This is perhaps the lesson I've paid the most to learn, and the one that most consistently trips up experienced software leaders entering AI product development.

Traditional software development is approximately linear. Twice the effort yields roughly twice the progress. Experienced teams can estimate timelines with reasonable accuracy. A feature that took three weeks last quarter provides a useful reference point for a similar feature next quarter.

AI product development is fundamentally nonlinear. Progress comes in discontinuous jumps separated by plateaus. The first 70% of capability arrives quickly -- often encouragingly quickly. The next 20% takes as long as the first 70%. The final 10% takes longer than everything before it combined. And you often can't predict which 10% will be hardest until you're deep into it.

I've seen this pattern so consistently that I now build it explicitly into project planning:

Phase 1: Rapid progress (weeks 1-6). Everything works better than expected. Demo quality is impressive. Stakeholders are excited. The temptation to set aggressive launch dates is overwhelming.

Phase 2: The plateau (weeks 7-16). Progress slows dramatically. Edge cases multiply. The gap between "works on the demo" and "works reliably in production" becomes apparent. Quality improvements require disproportionate effort. This is where projects stall, where morale drops, and where inexperienced teams make their worst decisions.

Phase 3: Hard-won gains (weeks 17+). With discipline and the right approach, the system begins to improve again -- but through systematic engineering (evaluation frameworks, data pipeline improvements, architectural refinements) rather than the quick wins that characterised Phase 1.

The teams that ship successfully are those that expect Phase 2 and plan for it. The teams that fail are those that mistake Phase 1's progress for a sustainable trajectory and make commitments accordingly.

What This Means for Planning

Traditional estimation techniques don't work for AI products. I've tried story points, t-shirt sizing, evidence-based scheduling -- none of them produce reliable forecasts for AI development work.

What works instead:

Milestone-based rather than time-based planning. Define what "good enough" looks like for each capability, and track progress toward that milestone rather than estimating when you'll arrive.
Explicit uncertainty budgets. For any AI capability, allocate at least 50% of your estimated time as uncertainty buffer. This isn't padding -- it's acknowledging the nonlinear nature of the work. I've been doing this for fifteen years and still regularly underestimate.
Progressive commitment. Don't commit to a launch date until you're through Phase 2 for your core capabilities. Promise a demo by a date, if you must. Promise progress reviews at regular intervals. Do not promise a production-quality product by a specific date until you have production-quality evidence.

The Demo Trap

This is the single most common pattern I see leading to bad decisions.

AI demos are seductive. A well-crafted demonstration can make an early prototype look like a nearly-finished product. The model produces fluent, confident output. The carefully chosen example queries showcase the system's strengths. The audience -- typically investors or executive stakeholders -- comes away believing the product is 80% complete when it's closer to 30%.

I've watched this play out dozens of times. The demo goes well. Expectations set accordingly. Timelines committed. And then reality: the system handles the demo queries beautifully because those queries were used to calibrate the system. Real users, with their unpredictable queries, edge cases, and adversarial inputs, expose every limitation the demo concealed.

The demo trap creates three specific problems:

Premature timeline commitments. Stakeholders who've seen an impressive demo expect a short path to production. When Phase 2 arrives and progress plateaus, they interpret it as a team performance problem rather than an intrinsic characteristic of AI development.

Architecture lock-in. The demo often enshrines specific architectural decisions that were optimised for the demo scenario rather than for production reality. Teams resist changing an architecture that "already works" -- even though it works only for a narrow set of carefully selected inputs.

Resource misallocation. Based on the demo, leadership allocates resources for a "polish and ship" phase when what's actually needed is a "solve the hard problems" phase. The latter requires different skills, different timelines, and different expectations.

How to Demo Honestly

I'm not suggesting you avoid demos. They're necessary for fundraising, stakeholder alignment, and team morale. But there are ways to demo that don't create the trap.

Include failure cases. Show queries where the system struggles, alongside your plan for addressing them. This builds credibility and sets realistic expectations.
Quantify performance. "Here's our evaluation score across 500 test queries, broken down by category" is more useful than "watch it nail this one example."
Separate demo quality from production quality. Be explicit: "This demo represents our best-case performance. Production quality across all user queries is currently lower, and here's our plan to close the gap."
Demo the evaluation, not just the output. Showing your evaluation infrastructure demonstrates sophistication and builds confidence that you know what "done" looks like -- even if you're not there yet.

Practical Patterns for Managing AI Product Uncertainty

After fifteen years of navigating this uncertainty, I've converged on a set of patterns that work. None of them are revolutionary. All of them are underused.

Pattern 1: Parallel Experimentation with Shared Evaluation

Rather than pursuing a single approach and hoping it works, run two or three approaches in parallel with a shared evaluation framework. This sounds expensive, and it is -- in the short term. In the long term, it's dramatically cheaper than pursuing one approach for months, discovering it doesn't work, and starting over.

The key is the shared evaluation framework. If each experiment has its own definition of "success," you can't compare them meaningfully. Define your evaluation criteria upfront, build the infrastructure to measure consistently, and let the data tell you which approach wins.

In practice, I structure this as a time-boxed spike: two weeks to implement each approach at a basic level, followed by evaluation against the shared framework. The winner gets full investment. The losers provided information that's often as valuable as the winner's success.

Pattern 2: Progressive Rollout with Instrumentation

Don't launch to everyone at once. Roll out to a small cohort with comprehensive instrumentation, learn from their usage, adjust, expand. This is standard practice in software, but AI products need more aggressive instrumentation than most teams implement.

At minimum, log:

Every input query and the system's response
Retrieval results and ranking scores
Model confidence signals (where available)
Latency at each pipeline stage
Any explicit user feedback

This data is simultaneously your evaluation set, your debugging tool, and your training data for future improvements. Teams that instrument aggressively in early rollout build an enormous advantage over those that ship and hope.

Pattern 3: The "What Would We Need to Believe?" Test

Before committing to any significant architectural decision, I force the team to articulate what assumptions need to be true for this decision to be correct. Not "why is this a good idea?"; the team will always have reasons. But "what specific, testable assumptions are we making, and how will we know if they're wrong?"

For example: "We're choosing to use RAG rather than fine-tuning. This assumes that our domain knowledge can be effectively encoded in retrievable documents rather than model weights. We'll know this assumption is wrong if retrieval quality on our production query distribution falls below X, or if users consistently report that the system lacks domain expertise despite having access to the relevant documents."

This practice doesn't prevent wrong decisions. It prevents wrong decisions from persisting long past the point where the evidence should have triggered a course correction.

Pattern 4: Dedicated "Red Team" Time

Every sprint -- or every two weeks at minimum -- dedicate time to actively trying to break your system. Not automated testing (though that's necessary too). Dedicated human effort to find the queries, edge cases, and interaction patterns that expose weaknesses.

This is psychologically difficult. The team has spent the sprint building capabilities and wants to celebrate progress. Spending time finding failures feels demoralising. But the failures you find internally are infinitely preferable to the ones your users find. And every failure you discover is a test case that strengthens your evaluation framework going forward.

I've found that rotating this responsibility across the team works best. Different people have different intuitions about where systems break, and the diversity of approaches yields better coverage than any single person's red-teaming instincts.

Pattern 5: Ship the Guardrails Before the Features

This is counterintuitive, and it's the pattern I have to argue for most strenuously. Before shipping a new AI capability, ship the guardrails that will constrain it.

What does the system do when it doesn't know the answer? When the user's query is ambiguous? When the retrieval returns irrelevant results? When the model's confidence is low? When the generated output contradicts known facts in your knowledge base?

These guardrails are not afterthoughts. They're the difference between an AI product that degrades gracefully and one that fails catastrophically. And they're far easier to build before the feature ships than after -- when you're scrambling to patch failures in production while users are actively encountering them.

In my experience, the quality of an AI product's guardrails is a better predictor of its long-term success than the quality of its primary capability. A moderately capable system with excellent guardrails consistently outperforms a highly capable system with poor guardrails, because the latter's failures destroy user trust in ways that are extraordinarily difficult to recover from.

When This Advice Doesn't Apply

I want to be careful not to overstate. There are contexts where the patterns I've described are less relevant.

If you're building internal tools where the user base is small, expert, and tolerant of imperfection, you can move faster and with less instrumentation. The cost of failure is lower, and the feedback loop is tighter because you can talk directly to your users.

If you're in a research context rather than a product context, the emphasis on evaluation infrastructure and guardrails may be premature. Explore first, productionise later.

If your AI component is genuinely simple -- a single-turn classification, a straightforward summarisation -- the complexity I've described may not apply. Not every AI feature requires compound architecture and sophisticated evaluation. Sometimes a well-engineered API call genuinely solves the problem, and over-engineering it is its own form of failure.

But if you're building a product where AI capability is central to the value proposition, where users will interact with the AI in complex and unpredictable ways, and where the quality of the AI's output directly impacts your business -- then these patterns apply. I've learned them through years of getting things wrong before getting them right, and they've proven consistent across every domain and company stage I've worked in.

The Real Competitive Advantage

Here's what I've come to believe after fifteen years of this work: the competitive advantage in AI products isn't the model, the architecture, or even the data. It's the operational maturity to ship, evaluate, learn, and improve faster than your competitors.

The team that ships an imperfect product with excellent evaluation and rapid learning loops will outperform the team that spends a year building a "perfect" system every single time. Because the first team is accumulating real-world signal -- the only kind of signal that matters -- while the second team is optimising against assumptions that may or may not hold.

This is the operational knowledge that separates teams that ship successful AI products from teams that produce impressive demos and never quite get to production. It's not glamorous. It's not the kind of insight that makes for a good conference talk. But it's what actually determines outcomes.

These are the challenges I work through with founders and technical leaders every day. Not strategy at arm's length, but the operational reality of building AI products that work -- for real users, at production scale, with all the mess and uncertainty that entails.

When you engage Agathon, you work directly with me. I've built these systems. I've made the mistakes. I've developed the patterns that prevent them. My background -- from mathematics at Oxford through NLP research at Cambridge to fifteen years of commercial AI systems -- gives me the technical depth to work at the level where these problems actually live, not the level where they get discussed in strategy decks.

I work with founders to build their teams' capability to navigate this uncertainty independently. Because the reality of shipping AI products is that the challenges don't stop -- they evolve. And the most valuable thing I can leave you with isn't a strategy document or an architecture diagram. It's the operational judgment to make good decisions when the next unexpected challenge arrives.

That's what building capability rather than dependency means in practice. And it's the only approach I've seen that works over the long term.