“The era of AI agents,” as Microsoft claimed at its recent Build 2025 conference, was meant to mark a new era of self-driven digital assistants. However, as this year draws to a close, Microsoft has significantly scaled back its forecasts for Copilot and similar AI app products’ growth, struggling to persuade businesses that these solutions are well worth an additional cost.

On the internal side, quotas for its AI Foundry platform, a developing and managing AI agent agent kit, were reduced across several U.S. sales divisions. In one scenario, a 50% growth target actually got reduced for failing to meet it by less than 20% of its salespeople. Then, a double-digit Foundry sales increase actually reduced a target for 50% growth. Microsoft denies that “aggregate sales quotas for AI products have been lowered” when, in reality, it is dealing with a larger issue where adoption in the business sector is slowing and return on investment for AI agents is uncertain.
The trouble beneath is technical as much as it is commercial, argues the author. According to the research conducted by METR, the current state-of-the-art AI agents, including those used by Copilot, succeed in nearly 100% of the tasks that can be solved by a human in under four minutes, while the success rate plunges below 10% once the time for the completion of the tasks goes beyond four hours. The “task length,” a metric that has doubled every seven months since 2018, means that the most advanced models like the Claude 3.7 Sonnet can finish a 59-minute task successfully just 50% of the time.
Copilot’s design incorporates OpenAI’s GPT-4o capabilities and Microsoft Graph functionality that extracts information from email messages, documents, and calendars. On paper, that would make it more competitive in productivity tasks: summarizing meetings in Teams, developing slide presentations in PowerPoint, or data analysis in Excel. But experiments conducted in the real world reveal that limitations have emerged. When matched against Google’s Gemini in comparisons for side-by-side testing, Copilot has trailed Gemini in organizing itineraries, developing creative assets, and conducting fact-finding tasks, whereas Copilot’s prowess was limited to Microsoft-integrated tasks and coding tasks using GitHub Copilot.
The difference is reflected in the market share. The US generative chatbot market share is led by ChatGPT at over 61%, followed closely by Gemini at slightly under 15%, while Copilot lags distantly at around 14% which it may soon be edged out of due to Gemini’s double-digit quarterly growth rate outpacing it. Microsoft’s head start in incorporating OpenAI has not helped its penetration, as in the majority of business applications like the rollout to 20,000 employees at Amgen, employees have dropped in favor of ChatGPT.
In terms of engineering for AI agency, the limitations arise from the current capabilities of agentic AI technology itself. This technology coordinates “worker tasks” beneath the supervision of a directing model through tool calls and self-correcting loops. This is obviously better than the single-pass output for generative models but retains the problem of potential confabulation inherent in the LLM on which they are based.
In messy and unstructured settings, with inconsistent data inputs and the need for active information retrieval, these agents might fail by misunderstanding the task, retreading the same actions that previously failed, or cutting the process short. The failure analysis carried out by METR on GPT-4-based agents identified the predominance of retreading the same actions that failed for over a third of instances and the premature mission abortion for new models on more difficult tasks.
“The competitive field is changing rapidly.” With a 1,000,000-token context window, Gemini 2.5 Pro allows ingestion of full codebases/multi-document environments, along with real-time Google Search functionality, and in-built image/video output. In contrast, ChatGPT has GPT-4.1, which allows accurate coding, multimodal input, and a plugin architecture allowing it to be applied in specific domains. “The strength of Copilot, which is being deeply integrated into the Microsoft ecosystem, is certainly useful in environments where companies have deeply invested in Office 365 and Windows, but is limited by the reliability ceiling of AI.”
When it comes to businesses trying to make sense of their AI ROI, the math now is about tool capabilities and task profiles. For tasks with short, structured workflows, such as summarizing a piece of document or boiler plating some code, this plays directly into Copilot’s strengths. For projects of a long-term nature, this kind of work will continue to call for the presence of a safety net of some sort, whether in the form of human judgment or an automatic one yet to appear, as exemplified in the exponential growth of METR. “Microsoft’s sales growth has slowed, of course, but this reflects, I believe, not an end to innovation, but simply

