Platform Engineering

Context projection for SRE agents

The agent should not be the integration layer.

That is the design principle I keep coming back to when thinking about how large language model agents should participate in SRE workflows. Modern operations already span a wide surface area: Kubernetes, infrastructure state, observability platforms, logs, incidents, deployment history, service catalogs, runbooks, and internal APIs. It is tempting to give an agent direct access to every low-level tool and let it discover its way toward an answer.

I think that is usually the wrong shape.

A production health check should not require the model to pull raw responses from every vendor system, re-send similar metadata across multiple turns, infer ownership from naming conventions, and manually join identifiers across tools. That approach is expensive, inconsistent, and too easy to make brittle. It also puts the model in charge of work that deterministic systems can do more safely.

A better pattern is a context projection layer: a deterministic tool layer that builds a compact, task-specific operational view before the model begins interpreting the situation.

Vendor APIs
  -> collectors and adapters
  -> normalized service graph
  -> context projection layer
  -> LLM agent
  -> diagnosis, next action, or workflow guidance

The projection should be scoped by service, environment, time window, and workflow mode. A health check, incident triage session, change review, dependency check, and compliance review do not need the same context. They need different signals, different limits, and different levels of detail.

The useful default is summary first, evidence second, raw data by exception. For example, an agent should be able to start with a health summary, top findings, confidence, missing context, and stable evidence references. If the situation requires more detail, it can drill into a specific metric, log pattern, deployment diff, or raw vendor payload by reference.

This progressive disclosure matters. It keeps the model context small while preserving traceability. It also gives humans a path from conclusion back to evidence, which is essential during operational work. A finding without a basis is just a polished guess.

The service graph is the foundation. It should map a service to ownership, runtime resources, observability handles, infrastructure identifiers, dependencies, runbooks, and recent changes. Fuzzy search is useful for discovery, but inspection and action should move through exact identifiers returned by tools. A model should not synthesize a Pulumi URN, Kubernetes object reference, Datadog monitor ID, Splunk saved search, or AWS ARN. It should pass through the exact identifier it was given.

This changes the role of the model in a useful way. The model explains, prioritizes, asks better follow-up questions, and guides the workflow. The surrounding system collects, normalizes, joins, filters, compresses, and preserves evidence.

The initial implementation does not need to be grand. A read-only get_operational_context tool for one service and one environment would be enough to prove the shape. It should return ownership, runtime status, a few health signals, recent changes, evidence references, and explicit missing context. From there, evidence-backed drilldowns and additional workflow modes can grow naturally.

The long-term goal is not to make an agent that can call everything. It is to make an operational system that gives the agent the right context, at the right level of detail, with enough evidence to be useful under pressure.

In SRE work, context quality is part of reliability.