Snapshot

The real comparison axis for AI agents is now whether they can carry work forward

Reading the official material available by April 20, 2026 from OpenAI, Anthropic, Google, and AWS shows a subtle but important shift in how AI agents should be evaluated. The main question is no longer only how well the model answers the task in front of it right now, or how long it can keep going in a single pass. What matters more is whether the system can carry work across time: resume after pauses or failures, preserve useful context, keep intermediate outputs, reconnect tools only when needed, and expose enough operational visibility for humans to judge quality while the agent is still working. In other words, AI agents are starting to look less like one-shot responders and more like systems for carrying work over time.

24

Primary sources

Official docs and official announcements from OpenAI, Anthropic, Google, and AWS are enough to compare this week’s shift without relying on third-party coverage.

4 layers

Continuity stack

Project or session continuity, execution continuity, artifact continuity, and operational continuity now form a practical comparison frame.

4 vendors

Different emphases

OpenAI highlights user-facing surfaces, Anthropic explains the architecture, Google defines the state contract, and AWS emphasizes production operation.

1 question

The frame changed

The key question is shifting from “can it answer well once?” to “can it keep owning the work while the user is away?”

What Changed

In the third week of April 2026, the assumption that agents should own ongoing work became visible on both the product and infrastructure sides

03-24

Anthropic turned cross-session handoffs into an explicit design topic

Its harness-design writeup made structured artifacts and session-to-session handoff part of the public implementation story for long-running work.

03-31

AWS framed agent quality as something that must be measured continuously in production

AgentCore Evaluations going GA moved the discussion from isolated demos toward ongoing scoring of live traffic and repeatable test workflows.

04-15

OpenAI made resumable sandboxes and model-native harnesses first-class SDK surfaces

The updated Agents SDK publicly exposed a more explicit contract for long-horizon work across files, commands, and controlled execution environments.

04-16

OpenAI expanded Codex into ongoing and repeatable work

Codex now schedules future work, wakes up to continue long-running tasks, uses memory, and draws context from projects and connected tools.

04-17

Google kept pause-and-resume sessions at the center of its public agent story

Vertex AI Agent Engine Sessions continues to present progress persistence and cross-session continuity as core functionality, not optional add-ons.

What makes this week notable is that the same direction is now visible both to end users and to developers. OpenAI links Projects, Tasks, Apps, Codex, and the Agents SDK into a coherent continuity story. Anthropic makes durable sessions and resumable harnesses explicit. Google separates sessions, state, and artifacts. AWS brings async runtime, observability, and continuous evaluation into the same frame. AI agents are becoming easier to understand as systems for carrying work over time, not just systems for producing an answer in a single sitting.

Four Layers

Work continuity becomes easier to analyze when it is split into four layers

1. Project and session continuity

The first requirement is a durable unit of work that is wider than a single chat turn. ChatGPT Projects are described as workspaces for long-running efforts and repeatable workflows. Vertex AI Agent Engine Sessions play a similar role, keeping the history and progress of an ongoing interaction. Without this layer, the agent has to restart from scratch each time.

2. Execution continuity

The next layer is the ability to carry execution forward, not just remember text. OpenAI’s SandboxAgent model supports start, stop, and resume patterns. Anthropic’s Managed Agents architecture moves the session log outside the harness and sandbox so work can continue even when a container fails. AWS AgentCore Runtime similarly treats asynchronous agents and true session isolation as core infrastructure.

3. Artifact continuity

Long-running work depends on files and intermediate outputs, not only on final responses. Google ADK separates Artifacts from session state so larger generated outputs can be persisted, versioned, and shared. Anthropic also makes handoff artifacts and file-based communication part of the story for multi-session work. Without an artifact layer, agents struggle to hand off unfinished work safely.

4. Operational continuity

Carrying work forward also means being able to observe and judge it while it is still operating. AWS AgentCore Observability and Evaluations make session count, latency, errors, traces, and continuous scoring of production traffic visible. That means “ongoing work” is no longer only a prompt-engineering issue. It is also an operational design issue.

Observation

This frame shows why work continuity is bigger than memory alone. A durable agent needs the project or session, the execution environment, the artifacts, and the operational feedback loop together.

Vendor View

The direction is similar across vendors, but each one emphasizes a different surface

OpenAI ties together the user-facing continuity surfaces

Projects hold recurring work context. Tasks schedule work even when the user is offline. Apps add search, sync, and action surfaces across external tools. Codex now adds background computer use, ongoing and repeatable work, memory, and automations. OpenAI’s distinctive move is that the continuity story is no longer only in developer APIs. It is visible in end-user products too.

Anthropic makes the architectural principles explicit

Managed Agents defines the session as an append-only log, separates the brain from the hands, and treats the sandbox as an interchangeable execution surface. The Claude Agent SDK also exposes session resume directly, while the code execution tool makes pause-and-continue and isolated containers concrete. Anthropic’s strength is the clarity of its design language around long-running work.

Google separates the state contract cleanly

Vertex AI Agent Engine Sessions separates events, state, and memory, while ADK treats state as the session scratchpad and artifacts as the proper home for larger outputs. Session TTL and listing also appear in the public docs, which makes lifecycle control part of the documented contract for ongoing work.

AWS emphasizes production operation and governance

AgentCore Runtime centers asynchronous agents and session isolation, Observability exposes the operational path of each run, and Evaluations scores agents continuously against live and test traffic. AWS’s strongest message is that a durable agent is not only one that can continue working, but one that can be operated, measured, and governed while it continues working.

Separation

Continuity is not the same thing as autonomy. The real question is what gets preserved, and what can still be returned to a human

Scheduling is not full automation

Tasks and automations make it possible for an agent to keep moving while the user is away. But that does not mean every judgment should be delegated. The real value comes when the agent can restart on a schedule, move the work forward, and return the right decision points to a human at the right moment.

Memory is not a substitute for correctness

Codex memory and Google’s cross-session memory strengthen continuity, but they also preserve old assumptions. The important capability is not only that the system can remember, but that it can decide what to reuse and what to discard.

Resume is also a recoverability problem

OpenAI’s snapshotting and rehydration plus Anthropic’s durable session log show that continuity is not only about holding conversation context. It is also about whether the system can recover from container loss, session interruption, or partial failure without restarting the whole task.

Continuity matters most when it is paired with auditability

AWS puts traces, latency, error rates, and session count at the center for a reason. The longer an agent works, the more important it becomes to reconstruct what happened after the fact.

Workflows

This comparison axis matters most when the same work repeats, or when the work does not finish in one sitting

Weekly research and reporting

  • Keep prior notes, files, and instructions inside a project.
  • Use tasks or automations to run on a schedule and pick up from prior context.
  • Pull fresh context from Slack, Drive, Notion, or other connected apps only when needed.

Multi-day implementation or validation work

  • Resume a sandbox or container with its working state intact.
  • Keep artifacts and session logs available for review in the middle of the run.
  • Avoid re-explaining the full task from scratch every time work pauses.

Recurring SaaS follow-up work

  • Use apps or tools only when external context or write actions are needed.
  • Combine schedules, memory, and external data to keep follow-up loops alive.
  • Treat confirmations and permission boundaries as part of the workflow, not as optional extras.

Internal operations with audit requirements

  • Session count, latency, traces, and errors make it easier to take operating responsibility.
  • Continuous evaluation helps surface drift, not just headline success rates.
  • The ability to resume work and the ability to audit it are becoming equally important.

Adoption

The right product depends on which continuity failure hurts you most

For knowledge work, project surfaces and app connectivity matter first

In research, planning, document review, and follow-up work, durable compute is often less important than projects, apps, memory, and scheduled tasks. Quality depends on where prior context is stored and how the next run reconnects to fresh information.

For development work, sandboxes and artifacts matter more

Multi-day implementation, validation, and migration work requires more than session history. It needs persistent files, reproducible outputs, and resumable execution environments. In this segment, resumable sandboxes and reusable containers become a larger differentiator.

For cross-SaaS workflows, identity and approval become the bottleneck

Slack, Gmail, Notion, and internal SaaS follow-up work looks attractive on paper, but the real constraint is often not memory. It is write-path governance. Teams need to decide which actions can be automated and where confirmation gates must remain.

For internal operations, weak observability becomes the first limit

In ops and high-trust internal workflows, the problem is often not whether the agent can keep going. It is whether failures can be found quickly enough. Traces, error breakdowns, session counts, and token usage become part of product selection.

Evaluation

If the agent is meant to carry work over time, then evaluation has to measure resumption and handoff quality, not just answer quality

Shift in Measurement

For ongoing work, a single-run success rate is not enough. Teams need to measure interruptions, resumptions, context decay, artifact handoff quality, and the amount of human intervention required.

1. Resume success rate

How often the agent can return naturally to the prior task after a stop, container replacement, or session reload without meaningful quality loss.

2. Stale-context incident rate

How often the system continues working from outdated state, memory, or assumptions. This becomes a visible failure mode once work spans multiple days.

3. Artifact handoff quality

Whether intermediate files, generated outputs, and checkpoints remain readable, reusable, and reviewable by either the next run or the next human reviewer.

4. Intervention cost

How often humans need to stop the run, restate context, or redirect the work. Better continuity should reduce the granularity of human supervision.

  • Measure on fixed multi-day scenarios, not only on short benchmark tasks.
  • Store traces and diffs before and after session resume points.
  • Track memory or state update failures separately from ordinary task failures.
  • Measure the human reviewer’s rework time on artifacts and resumed runs.

Org Impact

The adoption question is moving beyond model choice and toward operational ownership of each continuity layer

Product teams need to define the unit of work first

Whether the system is organized around a conversation, a project, or a recurring weekly task changes which continuity surfaces are actually required.

Platform teams should separate state from compute

Managed Agents and the Agents SDK both point toward the same design lesson: durability and recovery become easier when context storage and execution environments are not tightly coupled.

Operations teams should standardize traces and review loops early

An ongoing agent starts to resemble an operational workload more than a chatbot. Monitoring, failure classification, and rerun procedures should be standardized accordingly.

Business teams should start with work that benefits from being carried forward

Recurring research, PR follow-up, unresolved comment tracking, and report drafting are better entry points than trying to automate every high-risk workflow at once.

Still Early

The continuity stack has moved forward, but schedules and memory alone do not make a reliable agent

Scheduling alone does not improve quality

A recurring trigger is useful, but without a clear success definition and review loop, the agent can simply repeat stale work more efficiently.

Persistent context can also preserve bad assumptions

Project memory and session state are powerful, but they can carry old judgments forward too. Work continuity depends as much on forgetting and pruning as on remembering.

External actions still require identity and approval design

As apps and browser actions spread, write permissions, OAuth scopes, workspace controls, and audit trails become part of the continuity contract. Ongoing work cannot ignore access design.

Observability is not a substitute for a good success definition

Seeing traces and metrics is useful, but it does not tell you what “good” means. The longer the work horizon becomes, the more important it is to define task completion and failure boundaries in advance.

Conclusion

The next comparison axis for AI agents is not peak one-shot capability, but how far they can keep holding the work

The public material this week makes a broader shift easier to see. AI agents are moving from the stage where the main question is whether they can produce an impressive answer, toward the stage where the main question is whether they can hold work across projects, sessions, sandboxes, artifacts, and evaluation loops. OpenAI pushes that story into user-facing surfaces such as Projects, Tasks, Apps, Codex, and the Agents SDK. Anthropic makes durable sessions and resumable harnesses a design principle. Google clarifies the contract around sessions, state, and artifacts. AWS brings observability and continuous evaluation into the heart of production operation. The practical comparison axis is therefore widening from “can it answer well once?” to “under what contract can it keep carrying the job after the user steps away?”