AI agents in the workplace? A new benchmark raises doubts 2026

Nearly two years have passed since Microsoft CEO Satya Nadella forecasted that artificial intelligence would fundamentally transform knowledge work—the domain of lawyers, investment bankers, accountants, and countless other professionals. While foundational AI Agents models have since achieved remarkable feats in research and planning, the anticipated disruption of white-collar professions has progressed more slowly than many predicted.

This disconnect represents one of the most significant puzzles in AI today. New research from training-data leader Mercor, however, provides critical insights into why this is the case and how far current technology still has to go.

Introducing APEX-Agents: A Benchmark for Real Professional Work

Mercor’s recent study moves beyond theoretical knowledge tests to evaluate how leading AI models perform actual tasks drawn from consulting, investment banking, and law. The result is a novel benchmark called APEX-Agents. The initial findings are sobering: every major AI model tested received a failing grade.

When presented with queries formulated by real professionals, even the most advanced models struggled to achieve above 25% accuracy. In the vast majority of cases, models either returned incorrect answers or failed to provide any answer at all.

According to Mercor CEO Brendan Foody, a core challenge lies in a capability essential to human professionals: synthesizing information across multiple domains and tools.

“One of the big changes in this benchmark is that we built out the entire environment, modeled after real professional services,” Foody explained to TechCrunch. “In real life, you’re operating across Slack, Google Drive, and numerous other platforms. For many agentic AI models, that kind of multi-domain reasoning is still hit or miss.”

The Complexity of Real-World Tasks

The benchmark’s scenarios, developed by experts on Mercor’s marketplace and publicly available on Hugging Face, illustrate the nuanced demands of professional work. Consider this example from the legal section:

“During the first 48 minutes of the EU production outage, Northstar’s engineering team exported one or two bundled sets of EU production event logs containing personal data to a U.S. analytics vendor … Under Northstar’s own policies, can it reasonably treat the log exports as consistent with Article 49?”

Arriving at the correct answer (“yes”) requires an intricate analysis of both internal corporate policies and complex EU privacy regulations—a task that would challenge many humans. This is precisely the point, Foody notes: “The benchmark is very reflective of the real work that these people do.”

A Different Kind of Test

This approach distinguishes APEX-Agents from other evaluations, such as OpenAI’s GDPval benchmark. While GDPval assesses broad general knowledge across professions, APEX-Agents measures a system’s ability to execute sustained, deep tasks within specific high-value fields. It is a more difficult—and more directly relevant—test of automation potential.

The Current Standings and a Path Forward

In the initial assessments, no model proved ready to assume the role of an investment banker or lawyer. However, some performed notably better than others:

  • Gemini 3 Flash led with 24% one-shot accuracy.
  • GPT-5.2 followed closely at 23%.
  • Models like Opus 4.5, Gemini 3 Pro, and GPT-5 clustered around 18%.

Despite these modest results, the AI field has a history of rapidly overcoming challenging benchmarks. Foody expects the public release of APEX-Agents to catalyze further innovation and improvement.

“It’s improving really quickly,” he observed. “Right now, it’s fair to say the technology is like an intern that gets it right a quarter of the time. Last year, it was the intern that got it right five or ten percent of the time. That kind of year-over-year progress can have an impact very rapidly.”

The journey to automating complex knowledge work is clearly ongoing. While the revolution Nadella envisioned has not yet arrived, benchmarks like APEX-Agents provide the necessary roadmap, revealing both the significant hurdles that remain and the accelerating pace at which they are being addressed.

FAQ Section

Q1: What is the main finding of the new Mercor research?
A: The research found that even the most advanced AI models currently fail at performing realistic, multi-step white-collar work. When tested using the new APEX-Agents benchmark—which simulates tasks from consulting, banking, and law—the top model achieved only 24% accuracy.

Q2: What is the APEX-Agents benchmark?
A: APEX-Agents is a new benchmark designed to test AI on actual, complex tasks performed by professionals. Unlike tests of general knowledge, it models a complete work environment where AI must find and synthesize information across multiple tools and domains, mimicking real-world workflows on platforms like Slack and Google Drive.

Q3: Why are models struggling with this type of work?
A: According to the researchers, the primary stumbling block is “multi-domain reasoning.” Real knowledge work requires pulling together context and data from various, disparate sources. Current agentic AI models are inconsistent at tracking down and connecting information across these different domains.

Q4: How is APEX-Agents different from other AI benchmarks like OpenAI’s GPTrval?
A: While benchmarks like GPTrval test broad general knowledge across many fields, APEX-Agents focuses on measuring an AI’s ability to perform sustained, deep work within a few specific, high-value professions. This makes it a more direct and difficult test of true automation potential for jobs like law or investment banking.

Q5: Does this mean AI won’t replace knowledge workers?
A: Not necessarily. The results show AI is not yet ready, but the pace of improvement is rapid. As Mercor’s CEO notes, accuracy has jumped from 5-10% to about 25% in a year. This suggests that while the revolution is delayed, significant advancements that could automate core professional tasks are likely on the horizon.

Q6: Who created the APEX-Agents benchmark, and how was it designed?
A: The benchmark was developed by researchers from Mercor, a training-data company. To ensure real-world relevance, the tasks and evaluation standards were designed in collaboration with actual professionals—consultants, bankers, and lawyers from Mercor’s expert marketplace. These experts provided the complex queries and defined what constituted a correct, professional-grade response.

Q7: What is the significance of testing models on “multi-domain” tasks?
A: It reflects how professionals actually work. A lawyer doesn’t have all answers in one document; they review case files, statutes, client communications, and internal notes. This benchmark forces AI to navigate a simulated ecosystem of separate information sources (like a Drive folder, a Slack channel, and a policy document), testing its ability to “connect the dots”—a critical and currently deficient skill for automation.

Q8: What happens next now that this benchmark is public?
A: The APEX-Agents benchmark serves as an open challenge to AI labs. Public availability allows researchers worldwide to test their models, identify specific failure modes, and work on improving agentic reasoning capabilities. This will likely accelerate targeted development in this crucial area, just as past public benchmarks have done for other AI capabilities.

Q9: Were any models close to being useful, despite the low scores?
A: Yes, the results showed a notable performance range. Gemini 3 Flash (24%) and GPT-5.2 (23%) were at the top, significantly outperforming others. This suggests these models have a stronger, though still insufficient, grasp on the required reasoning. They might function as highly error-prone assistants rather than replacements, highlighting the gap between partial aid and full automation.

Q10: Where can I learn more about the technical details or see the test questions?
A: The research paper and the full suite of benchmark tasks are publicly available on the Hugging Face platform. This transparency allows other researchers, professionals, and interested parties to examine the complexity of the queries and understand the evaluation methodology firsthand.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top