This week’s AI agent story is the rise of operating stacks for building, running, governing, and evaluating agents

AI Agents May 7, 2026

AI AgentGovernanceRuntimeEvaluationEnterprise AIMCP

Briefing summary

What this briefing helps you get to quickly

AI Agents Sources: 31

A concise read on why AI agents are now compared as platforms that combine build, governance, and evaluation, not just as models.

AI AgentGovernanceRuntimeEvaluationEnterprise AI

Snapshot

This week’s key question is no longer which model is smartest, but which stack can build, run, govern, and evaluate agents as one system

Reading the official material public by May 7, 2026 across Google, Microsoft, AWS, Anthropic, and OpenAI shows a clearer market shift. The comparison object for AI agents is changing. The main story used to be model quality or raw tool connectivity. That is no longer enough to explain real deployment. Teams now need to know where agents are designed, which runtime they use, who can authorize access, what policy boundary can stop them, and how failures are observed and improved over time.

This week is therefore better read as a convergence week than as a single launch. Google now presents Gemini Enterprise Agent Platform through four public pillars: build, scale, govern, and optimize. Microsoft pairs Agent Framework 1.0 with the Agent Governance Toolkit. AWS bundles Registry, Policy, Observability, and Evaluations into AgentCore. Anthropic and OpenAI fill in the operating details through public material on agent SDKs, permissions, sessions, sandboxed workspaces, tracing, and evaluation harnesses.

Primary sources

Official docs, official announcements, and official SDK references alone are enough to support the full thesis this week.

5 stacks

Compared side by side

Google, Microsoft, AWS, Anthropic, and OpenAI now expose enough of the operating layer to make direct comparison possible.

4 layers

New comparison frame

Build, runtime, governance, and evaluation provide a cleaner frame than model quality alone.

1 shift

The product changed

The product is no longer just an agent or model. It is the operating stack that keeps agents safe, observable, and reusable.

Why this week

Signals that were separate in spring 2026 now read as one operating stack

03-31

AWS made AgentCore Evaluations generally available and tied evaluation to live operations

Evaluation was framed not as a research add-on, but as a production monitoring and change-validation layer.

04-02

Microsoft published the Agent Governance Toolkit and made governance a standalone layer

Policy, identity, runtime, compliance, and marketplace governance were exposed as explicit packages rather than hidden internal glue.

04-03

Microsoft Agent Framework 1.0 fixed agents and workflows as production SDK primitives

Sessions, middleware, checkpointing, and human-in-the-loop support became standard product features rather than optional scaffolding.

04-08

AWS Agent Registry preview productized discovery and approval outside the runtime

Agents, tools, skills, and MCP servers can now be published with approval workflows and searchable metadata.

05-05

Google’s public Agent Platform docs now line up build, scale, govern, and optimize as one stack

Agent Studio, Runtime, Identity, Gateway, Evaluation, and Observability now sit on one visible public surface.

Add Anthropic’s public guidance on eval harnesses and long-running harnesses, and OpenAI’s public SDK material on sandboxing, sessions, and tracing, and the shift becomes easier to read. The market is no longer competing only on better answers. It is competing on how well humans can set boundaries, run agents for longer periods, interrupt them safely, trace failures, and reuse successful patterns. That is why this article works better as a weekly roundup than as a focused single-launch piece.

Four layers

The current agent platform race becomes clearer when it is split into build, runtime, governance, and evaluation

1. Build

The first question is how agents are designed and how much of the flow can be made deterministic. Google exposes Agent Studio and ADK. Microsoft exposes workflows and type-safe routing. OpenAI keeps a small set of primitives such as agents and handoffs. Anthropic distinguishes clearly between workflows and agents. The main difference is no longer feature count alone, but the granularity at which design can be controlled.

2. Runtime

The next question is where long-running work actually lives. OpenAI offers sandbox agents and sessions. Anthropic focuses on long-running harness design. Google exposes Agent Runtime and Memory Bank. AWS exposes AgentCore Runtime and Identity. The key comparison is no longer whether the agent can work once, but whether state, permissions, resumability, and isolation can hold over time.

3. Governance

Production deployment requires explicit answers to what is allowed, who approves it, and where the system can be stopped. Google’s Agent Gateway and agent identity, Microsoft’s middleware and governance toolkit, AWS Policy and Registry, and Anthropic’s permissions and security guidance all move governance outside the prompt. Agents are now compared partly by how they can be constrained.

4. Evaluation

The last layer is how improvement is measured. Anthropic separates tasks, trials, graders, transcripts, and outcomes. OpenAI standardizes tracing. Google combines eval cases, traces, metrics, and optimization. AWS ties Evaluations to Observability. The competitive question is no longer only how a model scores, but how a full agent run is judged, debugged, and improved.

Observation

AI agents are now easier to understand as operating platforms that connect design, execution, governance, and evaluation than as standalone smart features.

Vendor view

The direction is shared, but each vendor emphasizes a different layer of the stack

OpenAI emphasizes a lightweight runtime plus visibility

The OpenAI Agents SDK keeps the abstraction set intentionally small while still exposing sandboxed workspaces, sessions, human-in-the-loop flows, and tracing. Its distinctive move is to frame the operating layer primarily as a developer runtime.

Anthropic centers evaluation and harness design

Anthropic’s public material pushes the harness and eval structure to the foreground. It distinguishes tasks, trials, graders, outcomes, and long-running handoff patterns. Its strongest signal is that operational quality should be designed from the start, not retrofitted after deployment.

Microsoft puts the agent-workflow split at the center

Agent Framework 1.0 separates LLM-driven agents from type-safe, checkpointed workflows. Middleware and the Agent Governance Toolkit then let teams add policy and compliance outside agent prompts. Microsoft’s framing is the closest to enterprise software architecture.

Google gives the clearest end-to-end product grammar

Build, Scale, Govern, and Optimize appear side by side in one public documentation surface, connecting Agent Studio, ADK, Runtime, Identity, Gateway, Evaluation, and Observability. Google’s strongest move is to give “agent platform” a concrete product boundary.

AWS extends the stack outward into catalog and policy

AgentCore combines Runtime, Gateway, Identity, Policy, Registry, Observability, and Evaluations, with unusually strong emphasis on approval workflows and publish-time governance. AWS makes it especially clear that agent platforms include what gets registered, approved, and audited, not just what gets executed.

Workflows

This matters most when agents are treated as organizational operating targets rather than prototypes

Internal research and writing agents

Platform teams can expose only approved tools or MCP servers in a registry.
Agent identity and policy can separate readable data from writable actions.
Without traces and evals, post-deployment accountability becomes harder.

Long-running coding and operations agents

Sandboxed or secure runtimes make resumability and permission separation more practical.
Sessions and progress artifacts reduce quality collapse on longer tasks.
The ability to insert approvals mid-run remains a key threshold for real deployment.

Cross-team reuse

Without a catalog, the same business logic is rebuilt across teams.
Approval, deprecation, and ownership metadata determine what is safe to reuse.
This week’s convergence strengthens the move to manage reuse outside the runtime itself.

Customer-facing operations and continuous improvement

Safety, task success, and tool-use quality all need ongoing evaluation after launch.
Policies work better as external controls than as prompt-only instructions.
Without observability, failures remain expensive to explain and hard to improve.

What remains early

A cleaner public surface does not mean the industry has reached a stable operating answer

Protocol support is not governance

MCP and A2A can improve interoperability, but approval, revocation, and least-privilege design still live in separate control layers.

Low-code tooling does not guarantee production quality

Visual builders and studios still depend on strong task definitions, graders, and permission boundaries to produce reliable systems.

A catalog does not solve day-two operations

Registration and discovery help, but change management, memory updates, cost control, and failure analysis remain separate operating work.

Automated evaluation depends on task design

Most vendors now foreground automated evals, but weak graders or weak ground truth can still produce impressive numbers without trustworthy progress.

Takeaway

The real change this week is that AI agents are starting to be sold and judged as governed operating platforms

Across the official material available in early May 2026, the agent market is becoming harder to explain through model leaderboards alone. What matters in practice is who designs the system, where it runs, which permissions it receives, and how failures are measured and corrected. Implementations still differ, but the comparison unit is already moving from the individual agent app toward the governed operating stack.

This change matters most for organizations that need to distribute agents across multiple teams rather than trial them one by one. The next buying and deployment decisions will have to look past model quality and inspect the surrounding workflow, approval, policy, trace, and evaluation layers. This week is important because the major vendors have started to expose that full frame in public.

Topic hub

Back to topic

Return to the topic hub to continue with other published briefings in the same category.

AI Agents Articles: 28 Open topic hub

Published evidence

Public pages list only evidence that can be verified as official documentation or papers.