Understanding Agentic AI: Insights from Two Years of Research

Two years ago I wrote my first article about agentic AI†. I had collected 43 platforms that could build agents, I was running Crew AI experiments in a setup where multiple AI instances talked to each other to get things done, and I was so friggin’ excited about where this was going in the way you get excited when you can finally see the shape of something important. That was April 2022. I was also wrong about the timeline, right about the direction, and completely unprepared for how much of the next two years would be spent explaining to enterprise stakeholders why their agent pilot was not, in fact, an agent in any meaningful sense.

Let me be precise about what I mean by agent, because I still find myself explaining it roughly once a week and the confusion is not getting better. And no, I’m not talking about the Copilot rebrand and certainly not the Teams integration with the chat window and the “Ask Copilot” button that 97% of the people who paid for it have never pressed. I am in the 3% who actually use it. I checked. It is not running an autonomous loop. What practitioners mean by agent is this. An autonomous agent perceives its environment, reasons about a goal, takes a sequence of actions, and does not pause to ask a human what to do next. It runs without a human command to trigger each step. A chatbot is human-bound by design and it sits and waits. An agent does not wait. The minimum viable definition of an agent I use is of “a system that can run inside a workflow without a human at the ignition switch”, and that distinction is doing more work than most vendor marketing wants you to notice.

A comic-style image depicting two characters in a digital, green matrix setting. The first character, wearing sunglasses and a black coat, stands confidently with one hand raised, illustrating the concept of an 'agent' that perceives and acts. The second character, also in sunglasses, is shown with arms crossed, engaging in conversation, emphasizing the contrast between a chatbot that waits and an agent that does not.

From those 43 platforms, a research program eventually took shape. Eigenvector Research, in collaboration with Inholland University of Applied Sciences, tracked 177 agentic deployments‡ across 20 sectors over the course of two years. Those were real production systems, running in companies that had real life consequences when they went wrong. The findings were not flattering to the narrative that Big Tech is trying to spin. Only 35% of processes can be automated with the current state of AI. I am talking about Zone I and II where agentic automation is genuinely reliable, governance is manageable, and the ROI survives contact with reality. The rest hit a wall.

Data quality is a big issue, but also the cost of post deployment governance exceeding the savings, and then there’s the fact that our current line-up of agents cannot stand long-horizon multi-step workflows without breaking down. Adding insult to injury, vendor ROI claims, when measured against observed outcomes, are overstated by roughly a factor of two. Governance was the primary bottleneck in the majority of failures, not so much model capability or hallucinations. The organisations that failed had the right models, but they simply choose the wrong processes to automate. That body of evidence is what the Zone framework is built on (see paper in comments), and it is what the ATLAS Book of Knowledge synthesizes alongside the academic literature.

Yes, I built a website called ATLAS (link in comments) which stands for Autonomous Task and Long-Horizon Agentic Systems, that runs at multi-step-agents .com and it pulls in new agentic research every other day, classifies it for relevance to the long-horizon problem, and extracts patterns that practitioners can actually use. I’ve released it to the public so everyone can get access to it, and it’s also a community where we can share our findings. The reason it exists is that the research on what makes agents break, and what might fix them, is moving faster than any practitioner can track alone, and the consequences of not tracking it show up in production systems, not in benchmark scores.

This blog post is about what ATLAS found, and a few days ago, based on the research collected on the website, we created “The ATLAS Book of Knowledge” which is a booklet of a few hundred pages of exegesis across 305 papers and 773 patterns, and it’s available for download in the comments. That is the longer version. What follows next is the argument.

Comic panel depicting characters dressed in suits with sunglasses extending red and blue pills, referencing choice and knowledge.

† 43 Autonomous AI platforms | LinkedIn

‡ Paper in the comments

More rants after the messages

Connect with me on Linkedin 🙏
Subscribe to TechTonic Shifts to get your daily dose of tech 📰
Please comment, like or clap the article. Whatever you fancy.

Most of what vendors call “Agentic” isn’t ready

The ATLAS corpus contains 773 documented architectural patterns for agentic AI systems. Of those, 569 are classified as experimental. That is 85.6% of the pattern library that has not been validated in production, has not been stress-tested against enterprise failure modes, and is not ready for deployment without significant risk management on top.

Take a moment with that number. The industry is selling Zone III agentic capability as a near-term enterprise deliverable. The actual research community, I mean the people doing the work at Princeton, Stanford, UC Berkeley, CMU, Microsoft Research, they’re all saying that the overwhelming majority of the architectural approaches for building those systems are still in the experimental category. They are not hiding this. It is in the papers. However, nobody in the vendor deck is mentioning it.

The 96 mature patterns – the ones that have been validated, replicated, and tested at production scale – are not glamorous, they cover multi-agent topology, orchestration basics, governance mechanisms, and failure mitigation. These patterns are all about the plumbing, the part of the system that does not make it into the product demo because you cannot demonstrate a checkpoint-and-resume pattern to an executive audience and get a standing ovation.

The ATLAS Book of Knowledge organizes all 773 patterns by zone, by category, and by maturity level. The reason this matters practically is that when you are selecting architectural components for a Zone III deployment, knowing whether you are pulling from the 96 or the 569 determines your risk exposure more than the model you choose or the orchestration framework you build on top. Most teams do not make this distinction, they pick whatever was in the conference talk last month and use that pattern.

A stylized illustration depicting a warehouse filled with crates, some marked 'Experimental - Do Not Deploy' and others 'Production Ready'. Several figures in suits discuss logistics, with one character smiling confidently. The scene conveys a theme of oversight and potential risks in technology deployment.

Six ways your agent will fail you

The CMU Systematic Failure Analysis paper from 2024† is one of the more useful things produced by academic AI research in recent memory, and it’s not because it contains surprises, but it names things that us practitioners have been describing informally for two years without a shared vocabulary. There are six primary failure categories, and all six are independently validated across multiple research groups. And the thing is that all six were present in the 177 deployments we tracked.

An infographic illustrating six potential failures of an agent: 1. Planning Failure - depicted with a character showing a map; 2. Execution Failure - a character struggling with a plan; 3. Memory Failure - displaying a progress screen; 4. Tool Use Failure - a character mishandling a tool; 5. Coordination Failure - characters arguing over directions; 6. Goal Drift - a character moving towards a misleading goal.

The first one we came across is about “planning failures”, and that, my smart friends, is what happens when an agent generates a plan that references tools it does not have, capabilities it cannot exercise, or a sequence of steps that falls apart three moves in because it optimised locally without modelling what comes next. The Princeton ReAct paper called these hallucinated planning and myopic planning. My favourite example from our deployment data involves a procurement agent that planned to “verify approval status” as step four of a twelve-step workflow, despite having no access to the approval system and no mechanism to request it. Upon review, the plan looked reasonable, but it was not possible.

Execution failures are different. The agent has a correct plan and cannot execute it reliably. Temporal’s durable execution research is the foundational work here. They found that workflow state must be externally persisted, because a Zone III agent that fails at step 97 of a 100-step workflow and has to restart from step one is bloody inefficient and also operationally broken, because the first 96 steps may have had side effects that cannot be undone. And so, we came to the conclusion that most enterprise agentic implementations do not have durable execution. What they do have is a retry button and an optimistic agentic process operator (a human).

Then there’s memory failures that are unique to agentic systems and have no clean analogue in conventional software. I’m referring to context handoff, context overflow, context contamination, memory inconsistency. Yes, these are the three subtypes documented in the MemGPT paper from UC Berkeley, and each of them produces a different failure signature. Context overflow occurs when the agent is losing track of what happened early in the workflow because the context window filled up. Context contamination is subtler whereby the agent continues to function, but the quality of its decisions degrades because there is irrelevant or misleading information sitting in its working memory. Memory inconsistency is when the agent holds contradictory beliefs simultaneously and cannot resolve them, and agent handoff degradation occurs when one agent hands a task over to another one and the quality of the context (or a document) is diminished. Referring to the latter issue, Microsoft – in April 26 – found that LLMs corrupt your documents/context when you delegate up to a high percentage, and that problem gets worse further down the line (multiple steps).

Tool use failures were the primary failure cause in 68% of WebArena task failures, more than planning and execution failures combined. I’m talking about an agent using a wrong tool, or right tool with wrong parameters, or right tool executed correctly but output misinterpreted. Then there’s the Model Context Protocol paper from Anthropic in 2024 that addresses this at the infrastructure level by standardizing the interface between agents and tools. But in a paper currently discussed on ATLAS, researchers from Tsinghua university called “Large Language Models Can Self-Correct with Tool-Interactive Critiquing”, in which they added an evaluator agent to a tool calling agent pattern that revised action if needed, led up to a 40% improvement compared to a single agent with self-reflection.

By the way, the ATLAS Book of Knowledge (the booklet) documents 23 Zone III patterns specifically for tool use failure mitigation.

Coordination failures emerge in multi-agent systems and have no single-agent equivalent. I’m talking about role confusion, communication failures between agents, and conflicting world models, where multiple agents develop incompatible understandings of what the shared task actually is. The MetaGPT paper documented these systematically in software development contexts. The fix, as both MetaGPT and AutoGen found independently, is structured coordination protocols, not free-form agent communication. Agents talking to each other in natural language without a formal protocol produce spectacular misalignments.

And then there’s goal drift. And that’s the one that keeps me up. It is the failure mode where the agent does gradually pursues a slightly different goal than the one it was given. The Microsoft Research Agent Drift paper from 2024 documented the mechanism behind how that happens. First you see context contamination building up, then proxy goals displace the original goals, and tool call patterns become increasingly random. Also semantic coherence degrades exponentially with task length. An agent at 95% coherence at step 10 may be at 60% coherence by step 60. And when that happens, the agent will still be working, but it’s working on the wrong thing.

A digital graphic featuring two characters in a futuristic, green-tinted environment. The top section shows the characters discussing coherence in goals, labeled 'Original Goal' and 'Proxy Goal', with text about an agent still working. The bottom section features a character providing statistics about deployment failures and a comment on semantic coherence degrading over time, with the phrase 'Goal Drift' prominently displayed.

† Just google it 😉

What “working” actually means in Zone III

34% of Zone III deployments that passed their acceptance criteria showed significant goal drift within 30 days of going live. This is from the Eigenvector Research 2025 paper in which we tracked the 177 deployments I mentioned in the introduction. Again, that number is not from a benchmark, but from production systems, with real consequences.

The Microsoft Research paper from April 2026 (“LLMs Corrupt Your Documents When You Delegate”) provides the clearest mechanistic explanation for what that drift looks like at the document level. The researchers ran 19 different LLMs through a benchmark called DELEGATE-52, that simulates long delegated workflows across 52 professional domains including coding, crystallography, and music notation.

The finding is that even frontier models like Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4, corrupt an average of 25% of document content by the end of long workflows. Other models fail more severely.

The word “corrupt” here needs a bit of unpacking. I’m not talking about hallucinations in the sense of the model inventing facts from nothing. No, the corruption here is more subtler. Information gets modified without announcement, structural elements get dropped without a flag, domain-specific notation gets approximated badly, and then the agent introduces changes that are locally plausible at each step but globally wrong across the full document, and because each individual step looks reasonable, no single point of intervention catches it. The errors are sparse, but they are severe, and they compound over time. And the thing is that they are silent. You won’t see your observability platform announce thems as failures.

But adding agentic tool use to the setup did not help. The researchers tested this explicitly, and the degradation rate was not meaningfully lower with tool access than without it. This is important because the standard response to document quality problems in agentic workflows is to add more tools. More retrieval, more verification steps, more structured outputs. The DELEGATE-52 results mentioned that at sufficient workflow length, the fundamental problem is the agent’s inability to maintain coherent intent across a long context.

The ATLAS Book of Knowledge addresses this through the Semantic Coherence Monitor pattern and the Goal Anchoring pattern. Both are Zone III patterns dedicated to solving this problem. Both require architectural decisions made before deployment, not patches applied after the 30-day cliff‡ becomes visible.

A dystopian scene depicting a character standing on the edge of a cliff labeled 'Day 30', looking over a ruined cityscape with digital data streams in the background. Two characters engage in conversation, one expressing confidence in passing all tests, while the other warns about instability. Visual elements include green checkmarks, warnings about system failures, and a chart showing deployment goal drift.

‡ The 30-day cliff is a term coined here for the finding that 34% of Zone III deployments showing significant goal drift within 30 days of passing acceptance criteria. The system looked fine at launch. A month later, a third of them were pursuing a different goal than the one they were given.

Memory is the architecture

Five independent research groups arrived at the same conclusion without coordinating on it. This is worth taking seriously, because in a field prone to hype cycles and institutional groupthink, convergence across groups with different methodologies is one of the few signals that something structurally true is being discovered.

A graphic illustrating collaboration between five research labs on memory types, featuring charts on working, episodic, semantic, and procedural memory, with a central figure asserting a unified conclusion among diverse methodologies.

UC Berkeley’s MemGPT paper proposed the memory hierarchy framework, drawing the explicit analogy to operating system memory management. A Zone III agent needs what an operating system needs, different memory stores at different latency and capacity tradeoffs, with an explicit management layer that decides what to keep in working memory and what to offload. Stanford’s Generative Agents paper arrived at a similar architecture independently with memory streams for episodic memory, reflection mechanisms for semantic memory consolidation, planning for procedural memory. Princeton’s CoALA framework formalised all four components – working, episodic, semantic, procedural – as architectural requirements. Microsoft Research’s Agent Drift paper demonstrated empirically that memory failures are the primary cause of semantic drift, and Eigenvector Research’s GPR (Governed Process Runtime) framework implemented a four-level memory hierarchy in production architecture.

For readers who have not encountered this framing, working memory is what the agent can see right now, the current context window. It is fast and zero-latency, but it is also reasonably small. When it fills up, old information gets pushed out. And episodic memory on the other hand, is a record of recent experience, what the agent did in the last 50 steps, what it learned, what failed. It lives in an external store and requires retrieval, but it allows the agent to maintain awareness of its own history. This type of memory has been made popular by OpenClaw, Hermes and AgentZero, and now you see ChatGPT, Claude and Gemini implementing an improved version of it, because it is simply a necessity. Then there’s semantic memory which is general knowledge about the domain with facts, relationships and patterns that are relevant across multiple tasks. Procedural memory is knowledge about how to do things like reusable routines, tool invocation patterns, workflow templates and so on.

The reason this matters to Zone III agentic systems, is that without hierarchical memory, agents fail in a specific way that they produce good outputs early in a workflow but progressively produce worse outputs later, because they are operating on a degrading picture of what they are doing and why. The MemGPT paper demonstrated that agents with hierarchical memory significantly outperform agents with flat memory on long-horizon tasks.

In the ATLAS Book of Knowledge, we documented 38 memory-related patterns across Zone II and Zone III – more than any other architectural category – and the Pattern-Paper Correspondence Matrix in the book maps each pattern back to its primary research foundation.

An infographic illustrating different types of memory: working memory, episodic memory, semantic memory, and procedural memory, with a central character resembling a hacker surrounded by digital elements and green matrix code.

Self-improving scaffolding – the Princeton signal

The most interesting paper in the ATLAS corpus for May 2026 is “Continual Harness”, from Princeton University and Google DeepMind, published on May 12, 2026. It describes what happens when you let an agent improve its own scaffolding during a run, and the fun thing is that it does this using Pokémon, which is either the most academic use of a Game Boy game ever or the most entertaining enterprise AI research paper, depending on your perspective.

A harness, for readers encountering the term for the first time, is the scaffolding layer between a foundation model and its environment. It contains the system prompt that tells the model how to behave, the sub-agents it can invoke for specific tasks, the skills it has available as reusable routines, and its memory store. A hand-engineered harness is built by humans who know the domain and have encoded good strategies into all four components before the agent runs.

A futuristic illustration featuring a character in a long coat standing in a digital environment filled with green binary code and holographic interfaces. The top section includes various elements like system prompts, skills, sub-agents, and a memory store. A second character, holding tools, appears to adjust the system, expressing concern about too many variables. The bottom section emphasizes continuous improvement with the statement 'The harness improves. I keep going.' and mentions Princeton University and Google DeepMind.

The standard approach is to hand-engineer the harness and call it done, but Continual Harness replaces the human engineer with an automated refiner that reads the agent’s recent trajectory, identifies failure signatures like navigation loops, tool call failures, stalled objectives, and rewrites the harness components in place, without resetting the run. The agent keeps going, and the harness simply evolves around it. I’ve described this process some time ago in the Governed Process Runtime paper, published on TechrXiv and eigenvector/research (in comments).

The results on Pokémon Emerald with Gemini Pro are something to hold onto for your next board presentation. Continual Harness reached 100% of milestones at a median cost of $130, against the minimal baseline achieving 98% of milestones at $215. That is roughly 40% cost reduction with no performance loss, starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding.

It is clear to me that the future of agentic AI in the enterprise, will not come from smarter or bigger models, but from runtimes that prevent otherwise capable agents from behaving like a Weiner dog, but with access to production systems and a commitment to doing whatever feels right in the moment.

That is why I am an absolute fan of the current batch of open-source models. They are capable enough for the task at hand. Even for the Zone III tasks. And when you implement something like my GPR or the Continual Harness, you will be able to tackle the most difficult tasks. Especially when you combine them with the latest model from Alibaba, Qwen 3.7-Max, which is especially designed for long-duration autonomous agent execution, multi-agent workflows, MCP/tool orchestration – even thousand-step workflows, and a sustained operation for up to 35 hours with more than 1000 tool calls.

So yeah, I am a huge fan of combining Qwen 3.7-Max with Eigenvector’s GPR / Princeton’s Continual Harness + our neurosymbolic AI that injects governance right into the agent instead of a post-factum check-box.

Talking about model capability, on Flash-Lite, every Continual Harness variant performed worse than the minimal baseline. The harness improvement requires a model good enough to make use of the improved harness. This maps directly to the PASF capability-floor concept where there is a minimum capability threshold below which architectural sophistication does not help, and above which it compounds. The ATLAS Book of Knowledge connects this finding to the self-evolving agent scaffolding research and the estimated 25 percentage point automation ceiling uplift that self-evolving architectures can provide over the base PASF ceiling.

The co-learning loop they presented in the paper – where an open-source model’s weights and the harness state update jointly in the same reset-free training loop – is the piece that points furthest toward what Zone III architecture might look like in 2027. The models improve (like Qwen), harnesses gets better, and when combined, they improve each other. Without resetting. That is a qualitative shift in how we think about agent reliability, and it has not made it into any vendor deck yet. Another reason why I started ATLAS.

Comic-style illustration comparing two agents in a futuristic setting, highlighting the advantages of a 'Continual Harness' over a 'Standard Agent'. The left side shows a stressed agent with a price tag of $215, while the right side features a confident agent labeled $130. Key phrases like '100% milestones completed' and '40% cost reduction' are prominently displayed.

Governance is the building

The third major convergence in the ATLAS corpus is also the one most systematically ignored in enterprise deployments. In UC Berkeley’s Agent Safety paper, they argued that governance constraints must be built into the agent’s decision-making architecture, not bolted on afterward. Neurosymbolic AI, here we come!‡

Oxford’s AI Governance paper provided empirical evidence that procedural compliance checking — the if-then compliance rules that most enterprise AI governance frameworks use — fails to generalize to novel compliance scenarios. MIT CSAIL found that governance mechanisms integrated into the agent’s workflow are significantly more effective at maintaining human oversight than external monitoring. Tsinghua’s AI Governance Survey reached the same conclusion from a policy direction. Eigenvector Research’s OCG framework operationalizes all of it into one paper (see eigenvector/research).

The numbers behind the governance lag are stark. 67% of Zone III deployments in the State of AI Agents 2025 report went live without complete governance infrastructure. 73% of enterprise AI failures were attributable to inadequate human oversight infrastructure. 78% of Zone III failures could have been predicted by a proper PASF assessment before deployment began.

A comic panel featuring two characters, Mr. Smith and Atlas, discussing governance and compliance in a digital environment. Charts and statistics about governance issues are displayed prominently, emphasizing the need for oversight. The text highlights the importance of governance in organization and technology.

The Ontological Compliance Gateway – the OCG – is the neurosymbolic framework we use at Eigenvector for embedded agentic governance, and it turns governance into harness architecture rather than procedural compliance. The core distinction being that procedural compliance checking fires rules at specific points in a workflow. If the agent does X, check Y. This works for the cases you anticipated when you wrote the rules, but it does not work for the cases you did not anticipate, which is precisely the category Zone III agents encounter most often, because they are operating in long-horizon, low-human-oversight environments where novel situations are the norm.

The OCG represents compliance requirements as formal semantic constraints over a knowledge graph of the agent’s domain. Actions get evaluated against this semantic layer before execution. The compliance reasoning is explicit and traceable, producing audit artifacts that satisfy regulatory requirements for explainability. Changes to regulatory requirements update the ontology, not the procedural code. The practical implication for organisations deploying under GDPR, the EU AI Act, or SOX etc is that the OCG architecture produces the documentation these frameworks require without requiring a separate documentation effort. I see the audit trail as a byproduct of the agent architecture, not a manual addition.

The ATLAS Book of Knowledge documents 26 governance-related patterns across Zone II and Zone III. The patterns range from simple approval gates for high-stakes low-reversibility decisions, through exception-based review for routine decisions with escalation on confidence thresholds, to full OCG implementation for regulated industry deployments. The governance architecture chapter of the book includes a compliance coverage matrix template that maps regulatory requirements to architectural components, which is useful if you want to explain to a regulator how your agent satisfies the EU AI Act’s high-risk system requirements without spending three weeks writing documentation by hand.

Comic-style illustration depicting the Ontological Compliance Gateway (OCG) in action, featuring characters resembling data enforcement agents. The top panel shows a character preparing to initiate an action within a network of compliance variables. The middle panel depicts an attempt to bypass a semantic barrier, with an 'invisible wall' representing compliance enforcement. The bottom panel includes a character discussing the audit trail as a part of the architecture, emphasizing the evaluation of actions against semantic constraints.

‡ Read: The boring AI that keeps planes in the sky | LinkedIn

Judging the judge

The LLM-as-a-Judge paper from UC Berkeley is one of the most widely cited papers in the agentic AI evaluation literature, and deservedly so. In it they report that strong LLM judges like GPT-4 (back in the day) match human preferences at over 80% agreement, which is the same agreement level you get between two humans evaluating the same outputs. For a scalable alternative to expensive human evaluation, that is a compelling number.

The problems, however, are in the fine print. Position bias means the judge systematically favours whichever response appears first in a pairwise comparison (sic!). Verbosity bias means longer answers get rated higher regardless of quality. Self-enhancement bias means a model judging outputs tends to give higher scores to outputs that resemble its own style. Limited reasoning ability means the judge struggles with the exact category of outputs — complex, multi-step, domain-specific reasoning — that Zone III systems produce most often.

A humorous illustration depicting a character called LLM Judge seated at a courtroom with goblin-like creatures voicing biases. The scene features digital elements inspired by 'The Matrix', highlighting themes of AI evaluation and judgment.

In my mind’s eye, I see you just finishing up on your agentic business case, taking a short break and reading my post and then slowly sinking into a depression. I know, it’s a daunting task, to keep up with developments and there are so many things that can go wrong, but don’t worry, all these issues are manageable with good experimental design. In for LLM-as-a-Judge paper, they proposes mitigations. The problem is that most Zone III evaluation in enterprise deployments does not use good experimental design, because enterprise teams do not have time to run double-blind pairwise evaluation with carefully randomised position orderings. They use the LLM judge as a quick pass/fail mechanism, inherit all the biases, and report the results as if the benchmark were neutral.

The deeper problem is that even a perfectly calibrated LLM judge is not the right instrument for Zone III evaluation. The ATLAS corpus documents what it calls the evaluation crisis, existing benchmarks were designed for single-turn or short-sequence performance. OdysseyBench, which tested tasks requiring 50-200 sequential actions, found that no existing model achieved better than 8% success rate on tasks requiring more than 100 sequential actions. SWE-bench found that state-of-the-art models resolved only 12.5% of real-world GitHub issues because execution in a real environment with ambiguity and unexpected tool behavior is a categorically harder problem than anything the single-turn benchmarks measure.

The PADE testing methodology in our PASF/PADE paper (eigenvector/research) is the most complete operationalization of Zone III evaluation that is currently available. It combines functional testing, failure mode testing against the six-category taxonomy from chapter two, adversarial testing designed to expose Zone III failure modes specifically, and long-horizon testing on tasks significantly longer than those in the standard test suite. It is expensive, and that’s why I’ve paused the online tool that I had created to help you guys with your projects‡. It is also the only approach that actually predicts real-world performance rather than benchmark performance, and those two things diverge significantly enough that the distinction matters.

A comic-style illustration titled 'The Evaluation Crisis' featuring two scenarios: on the left, a character celebrating a successful benchmark with a 'PASS! PRODUCTION READY!' message; on the right, a chaotic scene depicting various obstacles in a construction-themed environment, including unexpected failures and branching paths. There are success rates from two benchmark tests displayed at the bottom, highlighting a significant gap between benchmarks and real-world performance.

‡ PASF PADE process analyzer at ai-automation .my (currently unplugged, being rebuilt to run on my own server)

What the frontier actually looks like right now

On the ATLAS website we classify papers on two scores. The LH Score or Long-Horizon Score that rates a paper’s relevance to Zone III operation on a scale of 0 to 10 and the ENT Score rates enterprise relevance. Papers above 8 on the LH Score are considered highly relevant to Zone III deployment. The corpus currently contains 305 relevant papers, and the platform adds new research every other day‡

The four open problems that the corpus has not resolved, and that the research community is actively working on, are worth naming clearly because they define where the field actually is in May 2026, as opposed to where the conference talks say it is.

A graphic illustrating four open problems in research: Reliability Ceiling, Metacognition Deficit, Governance Scalability, and Evaluation Problem, with three figures in dark clothing discussing them.

The reliability ceiling is at approximately 50% for complex Zone III tasks. The best current architectures, with hierarchical memory, durable execution, governance middleware, multi-agent coordination, fail on roughly half of the complex tasks they attempt. The exponential semantic drift curve documented by Microsoft Research and the metacognition deficit documented by the “Do AI Know What They Know” paper from 2026 together state that the ceiling is structural at the current level of model capability and architectural development. The neuro-symbolic integration work in which I’ve combined neural language models with symbolic reasoning engines for formal compliance checking and deterministic policy enforcement, is the most promising direction for pushing through this ceiling, and what we did with the Friedmann-Gleichung Machine architecture, presented in the ATLAS framework is something we position as a solution to this problem

The metacognition deficit is the finding that current language models systematically overestimate their own accuracy and are most overconfident precisely in the situations where they are most likely to be wrong. The CMU Systematic Failure Analysis paper stated that metacognitive failures (the agent not knowing it is failing) were present in 78% of documented Zone III failures. An agent that cannot recognise failure cannot self-correct, cannot escalate, and cannot trigger the human-in-the-loop mechanisms that the governance architecture provides for exactly this situation.

Until metacognition improves at the model level, Zone III architectures must compensate externally through semantic coherence monitoring and goal anchoring.

The governance scalability problem has no current solution in the research literature. Individual Zone III deployments can be governed adequately using the patterns documented in the ATLAS Book of Knowledge. A portfolio of forty Zone III deployments across an enterprise runs into a problem the research does not address:

The governance infrastructure required grows proportionally with the number of systems, but the organisational capacity to manage that infrastructure does not.

Nobody has published a framework for governing AI factories at scale. This is either an opportunity for research or evidence that most organisations running “AI factories” have not yet encountered the problem at the scale where it becomes acute.

A visually intriguing depiction illustrating the concepts of neural and symbolic intelligence integration, featuring a central figure resembling a character in a futuristic setting.

The evaluation problem remains the most practically consequential open question. Until there is an adequate evaluation methodology for Zone III systems, one that measures semantic coherence over long execution chains, tests all six failure modes, and assesses the sociotechnical system including human oversight effectiveness, enterprises cannot reliably know whether their Zone III deployments are working correctly or are a week away from the 30-day cliff. The ATLAS platform tracks evaluation methodology papers as a specific research stream. The frontier is moving, but it certainly has not arrived yet, no matter what Microsoft, Salesforce or Servicenow or SAP wants you to believe.

The ATLAS Book of Knowledge is the synthesis of where the research stands. I haven’t created it to be a textbook, but instead, a living document that gets updated as the corpus grows. The 773 patterns it documents are the current best understanding of what works, what does not work, and what we are not yet sure about. The 569 experimental patterns are the frontier, marked as such with appropriate candour, which is more than most of the industry manages.

The unsexy work is reading the papers, extracting the patterns, building the architecture, and not shipping until the governance is in place. Nobody gets a keynote slot for that. But the 30-day cliff does not care about those keynotes.

A stylized comic panel featuring three characters discussing research documents on a table, with a backdrop of digital code and cityscape. One character takes notes, another examines paper, and the third looks contemplative, highlighting the theme of complex analytical work.

‡ I will add an AI to the platform as soon as I have my own server up and connected, to keep inferencing costs down . . . Researching should not require me selling a kidney.

Signing off,

Marco

Eigenvector builds Agentification factories at scale, for production environments that actually have to pay-off, and Eigenvector Research occasionally publishes papers about why this is harder than the demos suggest.

👉 Think a friend would enjoy this too? Share the newsletter and let them join the conversation. LinkedIn, Google and the AI engines appreciates your likes by making my articles available to more readers.

Working on the frontier of unsexy AI

More rants after the messages

Most of what vendors call “Agentic” isn’t ready

Six ways your agent will fail you

What “working” actually means in Zone III

Memory is the architecture

Self-improving scaffolding – the Princeton signal

Governance is the building

Judging the judge

What the frontier actually looks like right now

Become an AI Expert !

Sign up to receive insider articles in your inbox, every week.

✔️ We scour 75+ sources daily

✔️ Read by CEO, Scientists, Business Owners, and more

✔️ Join thousands of subscribers

✔️ No clickbait - 100% free

Like this:

Related

Leave a ReplyCancel reply

More rants after the messages

Most of what vendors call “Agentic” isn’t ready

Six ways your agent will fail you

What “working” actually means in Zone III

Memory is the architecture

Self-improving scaffolding – the Princeton signal

Governance is the building

Judging the judge

What the frontier actually looks like right now

Become an AI Expert !

Sign up to receive insider articles in your inbox, every week.

✔️ We scour 75+ sources daily

✔️ Read by CEO, Scientists, Business Owners, and more

✔️ Join thousands of subscribers

✔️ No clickbait - 100% free

Share this smut:

Like this:

Related

Leave a ReplyCancel reply

Discover more from TechTonic Shifts