OPERATIONS

Agent washing: the UK mid-market vendor due-diligence checklist

By Dr Aalok Shukla, CEO · Published 10 May 2026 · Updated 10 May 2026 · 10 min read

Per Gartner's 25 June 2025 release on agentic AI projects, drawn from a poll of more than 3,400 organisations actively investing in the technology, only around 130 of the thousands of vendors marketing agentic capabilities are legitimate. The rest are agent washing: rebranding chatbots, copilots, and pre-defined RPA workflows as autonomous agents to ride the category. Gartner forecasts that more than 40 per cent of agentic AI projects will be cancelled by the end of 2027 as a direct result.

UK mid-market buyers can stay off that cancellation list with a five-test due-diligence checklist: lifecycle management, multi-step tool calling, observability, integration depth, and adaptability. Each test is observable in one structured demo. Vendors that fail any one are washing.

Gartner says only 130 of thousands of agentic AI vendors are legitimate

Per Gartner's 25 June 2025 release on the agentic AI category, derived from a poll of more than 3,400 organisations actively investing in the technology, only around 130 vendors of the thousands claiming agentic AI capabilities deliver real autonomous behaviour. The remainder are engaged in agent washing, defined as rebranding existing products (AI assistants, robotic process automation, chatbots) without the substantive agentic capability the category implies. The same release forecasts that more than 40 per cent of agentic AI projects will be cancelled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls.

The cancellation forecast is what should focus a UK mid-market buyer's attention. If forty in every hundred projects do not survive to ROI, vendor selection is the primary risk mitigation. Gartner's 7 April 2026 follow-up release on AI in infrastructure and operations confirms the operational angle: AI projects in I&O are stalling ahead of meaningful ROI returns, not because the underlying models cannot do the work but because the platforms surrounding them lack the controls to govern the work over time. The 40 per cent failure rate is mostly a vendor-selection failure dressed as an AI failure.

Buyers asking ChatGPT, Claude, or Perplexity for "best agentic AI vendor in the UK" today get generic answers because the LLMs see the same agent-washing surface area the buyer does. The category needs a buyer-side test the LLMs can also cite. The five tests below are designed to run inside one structured vendor demo and produce evidence the buying committee, the AI risk function, and the auditor can defend.

For background context on the failure rate itself, see the agentic AI failure rate guide for UK ops leaders. For the post-purchase governance layer that survives audit, see the UK mid-market AI agent governance playbook. The five tests here sit upstream of both: which vendor should the buying committee even let into the room. AIOS Command (Implement AI's operational platform for connecting every system and deploying named AI operators) was designed to pass each of the five tests inside one structured demo.

Agent washing rebrands chatbots and RPA as autonomous agents

The clean working definition: an agent is a system that can pursue a goal across multiple steps, choose between tools at each step, react when conditions change, and produce a result the originator did not have to hand-script. Agent washing is the practice of marketing a system that does not meet that bar as if it does. Three patterns dominate.

The first pattern is the chatbot rebrand. A natural-language interface attached to a deterministic FAQ tree is sold as "the agent that handles your tier-one support". The interface is real. The behaviour underneath is a decision tree. When the tree does not cover the request, the system either escalates blindly or fabricates an answer. The failure shows up six weeks in and the project goes on the cancellation list.

The second pattern is the RPA rebrand. A predefined automation flow built in a generic workflow tool is sold as an agent because a language model now sits at the front end translating the user's request into the workflow's existing inputs. The model does not choose the steps. The flow does. When the input shape changes, the flow breaks. RPA still has a place in the modern stack, particularly for high-volume deterministic data movement, but it is not an agent. The current consensus is that AI agents and RPA are most effective in a hybrid architecture, with agents handling the reasoning layer and RPA handling deterministic execution. They are not interchangeable, and a vendor that says they are is washing.

The third pattern is the copilot rebrand. A summary or draft tool that lives inside one product surface (CRM, support desk, ERP) is sold as an agent because it now writes its output without the user prompting it. A copilot helps the user inside one app. An agent acts across multiple apps without a user pressing a button. The distinction is consequential: a copilot lifts the productivity of one seat. An agent lifts the productivity of the workflow that crosses the seats. UK mid-market firms have ample evidence that paying per-seat for inside-app productivity does not move the P&L; see the AI copilot vs AI agent UK buyer guide for the full distinction.

The five tests below catch all three patterns inside one structured demo.

Test 1: lifecycle management across the agent's working life

The first thing a real agentic platform does is manage the agent across its working life, not just its first execution. The vendor should be able to show, in one screen, the agent's policy band, version history, owner, last execution, exception rate, and the human who approved its last action. They should also be able to show how the policy band is changed, who has the authority to change it, and what audit trail the change leaves behind.

If the answer is "the agent runs the latest model and we do not version the policy", the platform is not built for governed multi-step work in a UK regulated environment. The agent will work in the demo. It will fail under SOC 2, ISO 27001, or FCA scrutiny six months in. Lifecycle management is the prerequisite for surviving the first audit, which is the gate the 40 per cent of cancelled projects do not clear.

This is also the test that catches RPA reskinned as an agent. RPA platforms version flows, not policies. Asking for the agent's policy band and watching the vendor scramble is one of the cleanest tells in the category.

Test 2: multi-step tool calling across at least three systems

The second test moves the demo from a single tool to a workflow that crosses systems. Pick a real workflow from your business: an unpaid invoice that needs a finance check, a customer reminder, and a CRM note, for example. Hand the workflow to the vendor as a goal in plain English, not a script. Watch what happens.

A real agent calls one tool, reads the response, decides the next tool based on what it just read, and continues until the goal is achieved or it hits its policy band. A washed agent runs a fixed sequence and returns whatever the sequence produces. The tell is what happens when you change the goal slightly mid-demo. Ask the vendor to also flag the customer's open support ticket if it exists. A real agent reroutes; a washed one cannot, because the next step is hard-coded.

The signal from Gartner's April 2026 release on stalled AI in I&O said it directly: AI projects in infrastructure and operations stall ahead of meaningful ROI because the platforms cannot adapt mid-task. The lift sits in the adaptability, not the model. Test 2 is where the lift becomes visible.

Test 3: observability, policy bands, and named human owners

The third test is the governance surface. After the demo workflow runs, ask the vendor to show three things: every action the agent took, every tool it called, and what would have happened if the policy band had been tighter. A real platform produces all three from the same screen. A washed one produces a single log line and a debug session.

The named human owner is the second half of this test. Every agent in production needs a human owner with a defined authority. If the vendor cannot name how that ownership is encoded in the platform, the agent is unaccountable in a UK regulatory environment. The 40 per cent cancellation forecast is full of agents that worked but could not pass the audit.

For the policy bands themselves, the canonical reference inside Implement AI is the seven-control governance playbook for UK mid-market firms. The five tests here cover whether the vendor has a credible answer; the governance playbook covers what good looks like once you have signed.

Want this on your stack? Join the AIOS Command waitlist, from £250/mo.

Join the waitlist

Test 4: integration depth across the systems you already own

The fourth test is mechanical. List the systems your business actually runs (CRM, finance ERP, support desk, contract repository, ticketing, payroll, HR). Ask the vendor which it connects to today, with a real authenticated connector rather than a generic webhook. Ask how the connection is maintained when the upstream system updates its API. Ask what happens to in-flight agent actions during an upstream outage.

Vendors that score well on integration breadth tend to score well on agent reliability because the platform's day job is reading and writing the systems the business already owns, not creating its own walled garden. As a benchmark, AIOS Command connects with more than 900 tools including Salesforce, HubSpot, Xero, NetSuite, Microsoft 365, Zendesk, and the long tail of vertical apps. Anything materially below 200 named connectors in this category should raise a question about whether the vendor has done the work of being where the buyer's data already lives.

The integration test also doubles as an honesty test on Gartner's stalled-ROI signal. Per Gartner, agents that cannot read the systems where the work actually happens cannot generate the lift the board is paying for. The integration list is the leading indicator that the platform is honest about what it can see.

Test 5: adaptability under variability and exceptions

The fifth test is where most washed vendors break. Hand the agent a request that does not match the happy path. Vary an input format, omit a field, or include a conflicting instruction. Watch what happens. A real agent recognises the variance, names what it cannot resolve, escalates with context, and continues with the rest of the workflow. A washed agent halts, returns a generic "I cannot help with that", or, worse, fabricates the missing field and continues.

McKinsey's 2025 productivity-paradox research framed the same point at the macro level: the lift sits in the firms that handle variability, not in the model. Pioneers and frontier firms earn materially higher returns on tangible equity than laggards because their AI handles the messy minority of cases that do not match the happy path; the laggards' AI works on the easy majority and produces no incremental P&L. Test 5 is where the buyer separates the two.

The adaptability test also answers a question that often comes up late in vendor selection: what is the difference between an AI agent and an LLM-fronted RPA flow? The flow is fine on the easy 70 per cent. The agent is what handles the 30 per cent the flow cannot. UK mid-market firms typically discover, six weeks in, that the 30 per cent is where the cycle-time saving sits. A washed agent cannot reach the saving.

A faster, more capable team starts with the platform that survives the five tests

A faster, more capable team. That is the outcome the buying committee wants from agentic AI. The five tests above are the prerequisite. If the vendor cannot demonstrate lifecycle management, multi-step tool calling, observability, integration depth, and adaptability inside one structured demo, the project will be among the 40 per cent Gartner expects to cancel by the end of 2027.

AIOS Command is built to pass each of the five tests inside one demo because it was designed around the agentic working pattern from the start, not retrofitted onto a chatbot, copilot, or RPA platform. The named insight team reads across more than 900 tools and surfaces the gaps a single product cannot see. The named action team acts on what the insight team finds, under named human approval and inside policy bands the board agreed. AVA (the revenue analyst) reads commercial signals across CRM, billing, and contracts. DEX (the deal-flow analyst) watches inbound RFPs and quote requests across email, portal, and CRM. LEXI (the support analyst) reads tier-one ticket flow against contract entitlements. KIA (the knowledge agent) keeps the policy library current as procurement and finance rules shift. KORA (the customer engagement agent) sequences the right next step back to the customer or counterparty. The architecture maps cleanly to lifecycle, multi-step calling, observability, integration depth, and adaptability.

For a closer read on the named-agent footprint, see AIOS Workforce. For examples in production, see the case-study library. The two reads together let the buying committee verify that AIOS Command is on the legitimate side of the 130-of-thousands ratio, then plan the first 90 days against the same five tests the auditor will run later.

Frequently asked questions

What is agent washing in agentic AI?

Agent washing is the practice of marketing a system as an autonomous agent when it is actually a chatbot, copilot, or pre-defined RPA workflow. Per Gartner's 25 June 2025 release, only around 130 of the thousands of vendors claiming agentic AI capabilities are legitimate; the rest are washing. The category-defining test is whether the system pursues a goal across multiple steps, chooses tools at each step, and reacts when conditions change.

How does Gartner define a real agentic AI platform?

Per Gartner, a real agentic AI platform delivers lifecycle management, governance, and runtime capabilities for adaptive, multi-step agent execution. The practical test is whether the platform can handle multi-step execution with observability, or whether it only runs predefined sequences with an AI label. The June 2025 release forecasts that more than 40 per cent of agentic AI projects will be cancelled by the end of 2027 because too many vendors fail this test.

How can a UK mid-market buyer test for agent washing inside one demo?

Run the five-test checklist: lifecycle management (versioned policy bands and audit trail), multi-step tool calling across at least three systems, observability (every action and tool call visible), integration depth (real connectors to the systems you already own), and adaptability (graceful escalation under variance). Each test is observable in a single structured demo. Vendors that fail any one are washing.

How does AI agent vs RPA fit into the agent-washing question?

RPA executes a fixed workflow the same way every time. AI agents reason about how to reach a goal and choose tools per step. Vendors that bolt a language-model interface onto an RPA flow and call it an agent are washing; the underlying behaviour is still the deterministic flow. The current consensus is that AI agents and RPA are most effective in a hybrid architecture, with agents handling the reasoning layer and RPA handling the deterministic execution layer. A vendor that conflates the two is mismatching the tool to the work.

What does AIOS Command cost and how does it survive the five tests?

AIOS Command starts from £250/mo. The platform connects with more than 900 tools through real authenticated connectors, manages the named insight team and named action team across versioned policy bands, logs every action with the human owner who approved it, and pursues goals across multiple systems with mid-task adaptation. The five tests above are the standard the buying committee should run regardless of the vendor; AIOS Command is built to pass each one inside one structured demo.