Unified observability in Amazon OpenSearch Service: metrics, traces, and AI agent debugging in a single interface

Amazon OpenSearch Service now brings utility monitoring, native Amazon Managed Service for Prometheus integration, and AI agent tracing collectively in OpenSearch UI‘s observability workspace. You possibly can question Prometheus metrics with PromQL alongside logs and traces saved in Amazon OpenSearch Service, hint an AI agent’s full reasoning chain all the way down to the failing device name, and drill from a service-level well being view to the precise span that triggered a checkout failure, all with out leaving the interface.

On this publish, we stroll by way of two real-world situations utilizing the OpenTelemetry pattern app: a multi-agent journey planner going through sluggish processing, and a checkout movement quietly failing on one microservice. We chase every one to its root trigger utilizing these new capabilities.

State of affairs 1: An underperforming AI agent

Your multi-agent journey planner is reside and customers begin reporting sluggish responses. With the brand new AI agent tracing functionality in Amazon OpenSearch Service, you possibly can hint the agent’s full processing path to pinpoint precisely the place issues went mistaken.

In any observability workspace in OpenSearch UI, navigate to Software Map within the left navigation pane.

You possibly can see the complete topology of your system together with the journey agent and the sub-agents it calls. The journey agent node reveals elevated latency and occasional errors. Choose it, and the facet panel confirms that latency is up however the latency chart reveals intermittent spikes slightly than constant degradation.

The applying map tells you one thing is mistaken, however understanding why an AI agent is underperforming requires seeing its reasoning chain. Choose Agent Traces within the left navigation pane, then filter by service title and time vary.

Choose one of many traces to see the hint tree. Not like a conventional span waterfall, this view organizes across the agent’s reasoning chain: the foundation agent span, the LLM calls it made, the instruments it invoked, and the way they nested every step color-coded by sort. The hint map gives a visible directed graph of the identical execution. You possibly can see which mannequin was known as, what number of enter and output tokens had been consumed, and the precise messages despatched to and obtained from the mannequin.

A device name contained in the climate agent errored out. The agent then spent further time reasoning concerning the failure earlier than returning a partial response explaining the intermittent latency spikes and occasional faults.

Why this issues for AI brokers

Brokers make autonomous choices based mostly on LLM responses, device outcomes, and chained reasoning. Not like conventional microservices with deterministic code paths, agent conduct varies throughout executions. With out semantic tracing that captures these AI-specific indicators, root-cause evaluation is guesswork. The hint tree surfaced the mannequin title, token counts, and failing device name as a result of the journey planner was instrumented with OpenTelemetry’s generative AI semantic conventions. The following part describes how.

Instrumenting AI brokers

OpenTelemetry auto-instrumentation enriches spans with well-known attributes for HTTP, database, and gRPC calls. AI brokers want a distinct set of attributes corresponding to which LLM was known as, what tokens had been consumed, which instruments had been invoked, that normal instrumentation doesn’t cowl.

The OpenTelemetry gen_ai semantic conventions outline normal attributes for these indicators, together with gen_ai.operation.title, gen_ai.utilization.input_tokens, gen_ai.request.mannequin, and gen_ai.device.title. When Amazon OpenSearch Service receives spans with these attributes, it categorizes them by operation sort (agent, LLM, device, embeddings, retrieval) and renders the agent hint tree and hint map views.

The Python SDK gives one solution to generate these spans. To ship traces to Amazon OpenSearch Ingestion, configure the SDK with AWS Signature Model 4 (SigV4) authentication. The AWSSigV4OTLPExporter cryptographically indicators every HTTP request to assist stop unauthorized knowledge ingestion. The calling identification wants an IAM coverage that grants osis:Ingest in your pipeline’s ARN. Credentials are resolved by way of the usual AWS credential supplier chain.

from opensearch_genai_observability_sdk_py import register, AWSSigV4OTLPExporter

exporter = AWSSigV4OTLPExporter(
    endpoint="https://pipeline.us-east-1.osis.amazonaws.com/v1/traces",
    service="osis",
    area="us-east-1",
)

register(service_name="my-agent", exporter=exporter)

Use the @observe decorator to hint agent capabilities and enrich() so as to add mannequin metadata:

@observe(op=Op.EXECUTE_TOOL)
def get_weather(metropolis: str) -> dict:
    return {"metropolis": metropolis, "temp": 22, "situation": "sunny"}

@observe(op=Op.INVOKE_AGENT)
def assistant(question: str) -> str:
    enrich(mannequin="gpt-4o", supplier="openai")
    knowledge = get_weather("Paris")
    return f"{knowledge['condition']}, {knowledge['temp']}C"

consequence = assistant("What is the climate?")

The SDK additionally helps auto-instrumentation for OpenAI, Anthropic, Amazon Bedrock, LangChain, LlamaIndex, and others. As a result of the instrumentation is constructed on OpenTelemetry requirements, any agent framework that emits spans with gen_ai.* attributes is suitable with OpenSearch UI.

State of affairs 2: Investigating a microservice difficulty

AI brokers are just one a part of most manufacturing environments. The identical interface surfaces telemetry from standard microservices, the place the troubleshooting workflow follows a extra acquainted path.

Your ecommerce checkout begins paging throughout a busy visitors window. From OpenSearch UI, navigate to APM Companies within the left navigation pane. Each instrumented service is listed alongside its well being indicators. The checkout service reveals an elevated error fee.

Choose the affected service. The element view reveals Request, Error, and Period (RED) metrics: request fee is climbing, fault fee has spiked within the final quarter-hour, and p99 period has doubled. You possibly can see precisely when the degradation began.

Drill into the correlated spans for the affected time window. The span checklist reveals a number of failed requests, all hitting the identical endpoint. Choose one to see the complete hint waterfall. The checkout service known as prepareOrder, which failed making an attempt to retrieve a product from the catalog. The error message within the span particulars tells you precisely what went mistaken, that’s your root trigger.

Checking the infrastructure with PromQL

In each situations, the pure subsequent query is whether or not the issue originates within the utility or within the infrastructure beneath it. With the brand new Amazon Managed Service for Prometheus integration, you possibly can reply that query with out leaving OpenSearch UI.

Prometheus metrics at the moment are queryable instantly from the identical workspace utilizing native PromQL syntax, alongside the logs and traces you’ve already been navigating.

For the database timeout in State of affairs 2, run a PromQL question to examine the database occasion’s learn/write throughput for a similar time window. For the agent latency difficulty in State of affairs 1, examine the LLM endpoint’s response time metrics to see if the slowness originates from the mannequin supplier.

It is a key architectural resolution: metrics proceed to reside in Amazon Managed Service for Prometheus, logs and traces proceed to reside in Amazon OpenSearch Service, and neither sign is copied or warehoused right into a second retailer. Every backend stays the one retailer for the information sort it’s purpose-built to deal with, whereas OpenSearch UI federates queries throughout each at runtime. The fee, retention, and operational mannequin of every retailer keep intact whereas the troubleshooting workflow collapses right into a single interface.

To configure the OpenTelemetry Collector and OpenSearch Ingestion pipelines that route metrics into Amazon Managed Service for Prometheus, see Ingesting utility telemetry.

The way it’s wired collectively

The next diagram reveals the end-to-end structure. Functions instrumented with OpenTelemetry ship traces, logs, and metrics over OTLP to Amazon OpenSearch Ingestion. OpenSearch Ingestion routes every sign to the suitable retailer: traces and logs land in Amazon OpenSearch Service, whereas metrics movement into Amazon Managed Service for Prometheus. OpenSearch UI then queries each shops to render the Software Map, Companies catalog, Agent Traces, and Metrics views.

Your entire expertise rests on open-source foundations, Prometheus for metrics, OpenSearch for logs and traces, and OpenTelemetry for instrumentation, so groups already working an OpenTelemetry collector can undertake it by updating the collector’s export configuration to level at Amazon OpenSearch Ingestion, with no proprietary brokers or rewritten instrumentation required.

Getting began

To allow these capabilities, log in to OpenSearch UI’s observability workspace, choose the Gear icon within the backside left nook to open Settings and setup, and confirm that the Observability:apmEnabled toggle is on underneath the Observability part. OpenSearch UI is on the market at no further cost for Amazon OpenSearch Service prospects.

Discover domestically first. The OpenSearch Observability Stack offers you a completely configured setting together with utility monitoring, agent tracing, and Prometheus integration, working in your machine with a single set up command. It ships with pattern instrumented companies, together with a multi-agent journey planner, so you possibly can discover the complete workflow with actual telemetry knowledge out of the field.

For AI agent growth. Agent Well being is an open-source, evaluation-driven observability device designed for native growth. It offers you execution movement graphs, token monitoring, and gear invocation visibility proper in your growth loop, earlier than you push to manufacturing.

For manufacturing. The Python SDK gives one-line setup and decorator-based tracing with gen_ai semantic conventions, with auto-instrumentation assist for OpenAI, Anthropic, Amazon Bedrock, LangChain, LlamaIndex, and others. See the Amazon OpenSearch Service documentation and the Amazon Managed Service for Prometheus integration information for the complete managed expertise.

Unified observability in Amazon OpenSearch Service: metrics, traces, and AI agent debugging in a single interface

State of affairs 1: An underperforming AI agent

Why this issues for AI brokers

Instrumenting AI brokers

State of affairs 2: Investigating a microservice difficulty

Checking the infrastructure with PromQL

The way it’s wired collectively

Getting began

Concerning the authors

This Researcher Trains Robots to Make Educated Guesses

Donald Trump’s White Home UFC Occasion Would Be Embarrassing Wherever

Deloitte Japan Advances Safety Operations with Cisco Basis AI’s Open-Supply Mannequin

Was “Tik-Tok of Oz” the First Clever Robotic to Seem in Literature?

CrankGPT Is Assured to Make You Cranky

From Intelligence to Motion: Operationalizing MS-ISAC Risk Knowledge Throughout SLED Environments

UrbanV and Japan Airport Consultants (JAC) announce a strategicpartnership to develop AAM in Japan and past – sUAS Information

New Boson SX8 Brings Excessive-Decision Thermal Imaging to NDAA-Compliant Drone Payloads

The best way to Generate AI Movies utilizing Gemini

Financial institution CCM Modernization: From Paperwork to Dialogue with AI

UrbanV and Japan Airport Consultants (JAC) announce a strategicpartnership to develop AAM in Japan and past – sUAS Information

Aviation Gasoline Demand Doesn’t Collapse. Low-cost Kerosene Development Does.