The industry’s obsession with static AI benchmarks is a form of collective delusion. We treat Large Language Models like legacy software—writing unit tests, running regression suites, and hoping for the best. But as we transition into an era of agentic AI, where systems are increasingly autonomous and self-optimizing, these static evaluations are becoming a liability.
If your evaluation strategy is a fixed dataset, you aren’t testing an agent; you’re testing a snapshot of a moving target.
The Failure of Static Evals
In traditional software engineering, we rely on CI/CD pipelines and chaos engineering to understand how systems behave under stress. We break things on purpose to find the edge cases. In the AI space, however, we’ve been stuck in a loop of “prompt engineering”—essentially word-smithing until the model stops hallucinating—and static benchmarks that measure narrow, often irrelevant capabilities.
These benchmarks are “calcified.” They provide a false sense of security because they measure the model’s performance on a set of questions defined today, ignoring the reality that the agent’s environment, user base, and intent will shift tomorrow. When the agent inevitably fails in production, we are left scrambling to patch a system we never truly understood.
From Intent to Outcome
The shift toward “malleable evals” requires a fundamental change in mindset: moving from measuring inputs to measuring outcomes.
In an agentic architecture, we shouldn’t be testing if the agent answers “X” when asked “Y.” We should be defining the desired end state and allowing the system to verify its own progress toward that goal. This is “intent engineering.” By breaking down complex agents into modular components—like tool-calling functions or specific MCP (Model Context Protocol) tools—we can isolate failure points and apply observability at the layer where it actually matters.
The Case for Always-On Evaluation
If code is cheap and tokens are fast, our testing infrastructure must be equally fluid. We need to move toward:
- Self-Curating Suites: Instead of manually crafting test cases, we should feed production traces back into the system. If 80% of your agent’s interactions are standard, the remaining 20%—the “weird” queries that break your business—should be automatically captured and used to evolve your test suite.
- Telemetry-in-the-Loop: The harness itself should be aware of its own telemetry. If an agent is aware of its error rates, latency, and costs, it can be programmed to self-correct or trigger a re-evaluation of its own logic.
- Always-On Optimization: Evaluation should not be a pre-deployment gate; it should be a continuous, background process. By using the agent to evaluate itself against an intent-based rubric, we create a feedback loop that evolves alongside the application.
The Reality of Agentic Drift
The most dangerous assumption in modern AI development is that an agent will behave the same way next month as it does today. As models get better at pattern recognition—evidenced by their increasing ability to solve complex, non-linear puzzles like ARC-I2—they will inevitably drift.
If your evaluation framework isn’t as malleable as the agent it’s measuring, you are effectively flying blind. We are moving toward a world where the “test” is no longer a static document, but a living, self-optimizing component of the software itself. Stop treating your evals like a final exam and start treating them like a heartbeat: if they aren’t constantly adapting, they’re already dead.