Amazon OpenSearch Ingestion 101: Set CloudWatch alarms for key metrics


Amazon OpenSearch Ingestion is a completely managed, serverless information pipeline that simplifies the method of ingesting information into Amazon OpenSearch Service and OpenSearch Serverless collections. Some key ideas embrace:

  • Supply – Enter part that specifies how the pipeline ingests the info. Every pipeline has a single supply which will be both push-based and pull-based.
  • Processors – Intermediate processing models that may filter, rework, and enrich data earlier than supply.
  • Sink – Output part that specifies the vacation spot(s) to which the pipeline publishes information. It may possibly publish data to a number of locations.
  • Buffer – It’s the layer between the supply and the sink. It serves as short-term storage for occasions, decoupling the supply from the downstream processors and sinks. Amazon OpenSearch Ingestion additionally affords a persistent buffer possibility for push-based sources
  • Useless-letter queues (DLQs) – Configures Amazon Easy Storage Service (Amazon S3) to seize data that fail to put in writing to the sink, enabling error dealing with and troubleshooting.

This end-to-end information ingestion service will help you acquire, course of, and ship information to your OpenSearch environments with out the necessity to handle underlying infrastructure.

This put up gives an in-depth have a look at organising Amazon CloudWatch alarms for OpenSearch Ingestion pipelines. It goes past our beneficial alarms to assist establish bottlenecks within the pipeline, whether or not that’s within the sink, the OpenSearch clusters information is being despatched to, the processors, or the pipeline not pulling or accepting sufficient from the supply. This put up will enable you proactively monitor and troubleshoot your OpenSearch Ingestion pipelines.

Overview

Monitoring your OpenSearch Ingestion pipelines is essential for catching and addressing points early. By understanding the important thing metrics and organising the best alarms, you may proactively handle the well being and efficiency of your information ingestion workflows. Within the following sections, we offer particulars about alarm metrics for various sources, screens, and sinks. The particular values for the brink, interval, and datapoints to alarm used for alarms can fluctuate based mostly on the person use case and necessities.

Conditions

To create an OpenSearch Ingestion pipeline, seek advice from Creating Amazon OpenSearch Ingestion pipelines. For creating CloudWatch alarms, seek advice from Create a CloudWatch alarm based mostly on a static threshold.

You possibly can allow logging for OpenSearch Ingestion Pipeline, which captures numerous log messages throughout pipeline operations and ingestion exercise, together with errors, warnings, and informational messages. For particulars on enabling and monitoring pipeline logs, seek advice from Monitoring pipeline logs

Sources

The entry level of your pipeline is commonly the place monitoring ought to start. By setting acceptable alarms for supply parts, you may shortly establish ingestion bottlenecks or connection points. The next desk summarizes key alarm metrics for various sources.

Supply Alarm Description Really helpful Motion
HTTP/ OpenTelemetry requestsTooLarge.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The request payload measurement of the consumer (information producer) is bigger than the utmost request payload measurement, ensuing within the standing code HTTP 413. The default most request payload measurement is 10 MB for HTTP sources and 4 MB for OpenTelemetry sources. The restrict for the HTTP sources will be elevated for the pipelines with persistent buffer enabled. The chunk measurement for the consumer will be lowered in order that the request payload doesn’t exceed the utmost measurement. You possibly can study the distribution of payload sizes of incoming requests utilizing the payloadSize.sum metric.
HTTP requestsRejected.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The request was despatched to the HTTP endpoint of the OpenSearch Ingestion pipeline by the consumer (information producer), however the request wasn’t accepted by the pipeline, and it rejected the request with the standing code 429 within the response. For persistent points, contemplate growing the minimal OCUs for the pipeline to allocate extra sources for request processing.
Amazon S3 s3ObjectsFailed.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The pipeline is unable to learn some objects from the Amazon S3 supply. Check with REF-003 in Reference Information beneath.
Amazon DynamoDB Distinction for totalOpenShards.max - activeShardsInProcessing.worth

Threshold: >0

Statistic: Most (totalOpenShards.max) and Sum (activeShardsInProcessing.worth)

Datapoints to Alarm: 3 out of three.Further Observe: refer REF-004 for extra particulars on configuring this particular alarm.
It screens alignment between complete open shards that needs to be processed by the pipeline and lively shards at the moment in processing. The activeShardsInProcessing.worth will go down periodically as shards shut however ought to by no means misalign from ‘totalOpenShards.max’ for longer than a few minutes. If the alarm is triggered, you may contemplate stopping and beginning the pipeline, this feature resets the pipeline’s state, and the pipeline will restart with a brand new full export. It’s non-destructive, so it does not delete your index or any information in DynamoDB. When you don’t create a contemporary index earlier than you do that, you would possibly see a excessive variety of errors from model conflicts as a result of the export tries to insert older paperwork than the present _version within the index. You possibly can safely ignore these errors. For root trigger evaluation on the misalignment, you may attain out to AWS Help
Amazon DynamoDB dynamodb.changeEventsProcessingErrors.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The variety of processing errors for change occasions for a pipeline with stream processing for DynamoDB. If the metrics report growing values, seek advice from REF-002 in Reference Information beneath
Amazon DocumentDB documentdb.exportJobFailure.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The try to set off an export to Amazon S3 failed. Evaluate ERROR-level logs within the pipeline logs for entries starting with “Acquired an exception throughout export from DocumentDB, backing off and retrying.” These logs comprise the whole exception particulars indicating the foundation reason for the failure.
Amazon DocumentDB documentdb.changeEventsProcessingErrors.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The variety of processing errors for change occasions for a pipeline with stream processing for Amazon DocumentDB. Check with REF-002 in Reference Information beneath
Kafka kafka.numberOfDeserializationErrors.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The OpenSearch Ingestion pipeline encountered deserialization errors whereas consuming a document from Kafka. Evaluate WARN-level logs within the pipeline logs and confirm serde_format is configured accurately within the pipeline configuration and the pipeline position has entry to the AWS Glue Schema Registry (if used).
OpenSearch opensearch.processingErrors.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
Processing errors have been encountered whereas studying from the index. Ideally, the OpenSearch Ingestion pipeline would retry mechanically, however for unknown exceptions, it’d skip processing. Check with REF-001 or REF-002 in Reference Information beneath, to get the exception particulars that resulted in processing errors.
Amazon Kinesis Information Streams kinesis_data_streams.recordProcessingErrors.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The OpenSearch Ingestion pipeline encountered an error whereas processing the data. If the metrics report growing values, seek advice from REF-002 in Reference Information beneath, which will help in figuring out the trigger.
Amazon Kinesis Information Streams kinesis_data_streams.acknowledgementSetFailures.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The pipeline encountered a adverse acknowledgment whereas processing the streams, inflicting it to reprocess the stream. Check with REF-001 or REF-002 in Reference Information beneath.
Confluence confluence.searchRequestsFailed.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
Whereas attempting to fetch the content material, the pipeline encountered the exception. Evaluate ERROR-level logs within the pipeline logs for entries starting with “Error whereas fetching content material.” These logs comprise the whole exception particulars indicating the foundation reason for the failure.
Confluence confluence.authFailures.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The variety of UNAUTHORIZED exceptions acquired whereas establishing the connection Though the service ought to mechanically renew tokens, if the metrics present an growing worth, evaluation ERROR-level logs within the pipeline logs to establish why the token refresh is failing.
Jira jira.ticketRequestsFailed.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
Whereas attempting to fetch the difficulty, the pipeline encountered an exception. Evaluate ERROR-level logs within the pipeline logs for entries starting with “Error whereas fetching subject.” These logs comprise the whole exception particulars indicating the foundation reason for the failure.
Jira jira.authFailures.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The variety of UNAUTHORIZED exceptions acquired whereas establishing the connection. Though the service ought to mechanically renew tokens, if the metrics present an growing worth, evaluation ERROR-level logs within the pipeline logs to establish why the token refresh is failing.

Processors

The next desk gives particulars about alarm metrics for various processors.

Processor Alarm Description Really helpful Motion
AWS Lambda aws_lambda_processor.recordsFailedToSentLambda.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
A number of the data couldn’t be despatched to Lambda. Within the case of excessive values for this metric, seek advice from REF-002 in Reference Information beneath.
AWS Lambda aws_lambda_processor.numberOfRequestsFailed.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The pipeline was unable to invoke the Lambda perform. Though this case shouldn’t happen beneath regular situations, if it does, evaluation Lambda logs and seek advice from REF-002 in Reference Information beneath.
AWS Lambda aws_lambda_processor.requestPayloadSize.max

Threshold: >= 6292536

Statistic: MAXIMUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The payload measurement is exceeding the 6 MB restrict, so the Lambda perform can’t be invoked. Contemplate revisiting the batching thresholds within the pipeline configuration for the aws_lambda processor.
Grok grok.grokProcessingMismatch.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The incoming information doesn’t match the Grok sample outlined within the pipeline configuration. Within the case of excessive values for this metric, evaluation the Grok processor configurations and ensure the outlined sample matches in line with the incoming information.
Grok grok.grokProcessingErrors.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The pipeline encountered an exception when extracting the knowledge from the incoming information in line with the outlined Grok sample. Within the case of excessive values for this metric, seek advice from REF-002 in Reference Information beneath.
Grok grok.grokProcessingTime.max

Threshold: >= 1000

Statistic: MAXIMUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The utmost period of time that every particular person document takes to match in opposition to patterns from the match configuration possibility. If the time taken is the same as or greater than 1 second, verify the incoming information and the Grok sample. The utmost period of time throughout which matching happens is 30,000 milliseconds, which is managed by the timeout_millis parameter.

Sinks and DLQs

The next desk comprises particulars about alarm metrics for various sinks and DLQs.

Sink Alarm Description Really helpful Motion
OpenSearch opensearch.bulkRequestErrors.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The variety of errors encountered whereas sending a bulk request. Check with REF-002 in Reference Information beneath which will help to establish the exception particulars.
OpenSearch opensearch.bulkRequestFailed.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The variety of errors acquired after sending the majority request to the OpenSearch area. Check with REF-001 in Reference Information beneath which will help to establish the exception particulars.
Amazon S3 s3.s3SinkObjectsFailed.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The OpenSearch Ingestion pipeline encountered a failure whereas writing the article to Amazon S3. Confirm that the pipeline position has the mandatory permissions to put in writing objects to the desired S3 key. Evaluate the pipeline logs to establish the precise keys the place failures occurred.

Monitor the s3.s3SinkObjectsEventsFailed.depend metric for granular particulars on the variety of failed write operations.
Amazon S3 DLQ s3.dlqS3RecordsFailed.depend

Threshold: >0

Statistic: SUM

Interval: 5 minutes

Datapoints to alarm: 1 out 1
For a pipeline with DLQ enabled, the data are both despatched to the sink or to the DLQ (if they’re unable to ship to the sink). This alarm signifies the pipeline was unable to ship the data to the DLQ resulting from some error. Check with REF-002 in Reference Information beneath which will help to establish the exception particulars.

Buffer

The next desk comprises particulars about alarm metrics for buffers.

Buffer Alarm Description Really helpful Motion
BlockingBuffer BlockingBuffer.bufferUsage.worth

Threshold: >80

Statistic: AVERAGE

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The p.c utilization, based mostly on the variety of data within the buffer. To analyze additional, verify if the Pipeline is bottlenecked resulting from processors or sink by evaluating timeElapsed.max metrics and analyzing bulkRequestLatency.max
Persistent persistentBufferRead.recordsLagMax.worth

Threshold: > 5000

Statistic: AVERAGE

Interval: 5 minutes

Datapoints to alarm: 1 out 1
The utmost lag by way of variety of data saved within the persistent buffer. If the worth for bufferUsage is low, enhance the utmost OCUs. If bufferUsage can also be excessive [>80], examine if pipeline is bottlenecked by processors or sink.

Reference Information

The next present steerage for resolving widespread pipeline points together with common reference.

REF-001: WARN-level Log Evaluate

Evaluate WARN-level logs within the pipeline logs to establish the exception particulars.

REF-002: ERROR-level Log Evaluate

Evaluate ERROR-level logs within the pipeline logs to establish the exception particulars.

REF-003: S3 Objects Failed

When troubleshooting growing s3ObjectsFailed.depend values, monitor these particular metrics to slim down the foundation trigger:

  • s3ObjectsAccessDenied.depend – This metric increments when the pipeline encounters Entry Denied or Forbidden errors whereas studying S3 objects. Widespread causes embrace:
  • Inadequate permissions within the pipeline position.
  • Restrictive S3 bucket coverage not permitting the pipeline position entry.
  • For cross-account S3 buckets, incorrectly configured bucket_owners mapping.
  • s3ObjectsNotFound.depend – This metric increments when the pipeline receives Not Discovered errors whereas making an attempt to learn S3 objects.

For additional help with the beneficial actions, contact AWS assist.

REF-004: Configuring Alarm for distinction in totalOpenShards.max and activeShardsInProcessing.worth for Amazon DynamoDB supply.

  1. Open the CloudWatch console at https://console.aws.amazon.com/cloudwatch/.
  2. Within the navigation pane, select Alarms, All alarms.
  3. Select Create alarm.
  4. Select Choose Metric.
  5. Choose Supply.
  6. In supply, following JSON can be utilized after updating the , and .
    {
        "metrics": [
            [ { "expression": "m1-e1", "label": "Expression2", "id": "e2", "period": 900 } ],
            [ { "expression": "FLOOR((m2/15)+0.5)", "label": "Expression1", "id": "activeShardsInProcessing", "visible": false, "period": 900 } ],
            [ "AWS/OSIS", ".dynamodb.totalOpenShards.max", "PipelineName", "", { "stat": "Maximum", "id": "m1", "visible": false } ],
            [ ".", ".dynamodb.activeShardsInProcessing.value", ".", ".", { "stat": "Average", "id": "m2", "visible": false } ]
        ],
        "view": "timeSeries",
        "stacked": false,
        "interval": 900,
        "area": ""
    }

Let’s evaluation couple of situations based mostly on the above metrics.

Situation 1 – Perceive and Decrease Pipeline Latency

Latency inside a pipeline is constructed up of three fundamental parts:

  • The time it takes to ship paperwork by way of bulk requests to OpenSearch,
  • the time it takes for information to undergo the pipeline processors, and
  • the time that information sits within the pipeline buffer

Bulk requests and processors (final two objects within the earlier record) are the foundation causes for why the buffer builds up and results in latency.

To watch how a lot information is being saved within the buffer, monitor the bufferUsage.worth metric. The one method to decrease latency inside the buffer is to optimize the pipeline processors and sink bulk request latency, relying on which of these is the bottleneck.

The bulkRequestLatency metric measures the time taken to execute bulk requests, together with retries, and can be utilized to observe write efficiency to the OpenSearch sink. If this metric experiences an unusually excessive worth, it signifies that the OpenSearch sink could also be overloaded, inflicting elevated processing time. To troubleshoot additional, evaluation the bulkRequestNumberOfRetries.depend metric to verify whether or not the excessive latency is because of rejections from OpenSearch which might be resulting in retries, reminiscent of throttling (429 errors) or different causes. If doc errors are current, study the configured DLQ to establish the failed doc particulars. Moreover, the max_retries parameter will be configured within the pipeline configuration to restrict the variety of retries. Nonetheless, if the documentErrors metric experiences zero, the bulkRequestNumberOfRetries.depend can also be zero, and the bulkRequestLatency stays excessive, it’s probably an indicator that the OpenSearch sink is overloaded. On this case, evaluation the vacation spot metrics for extra particulars.

If the bulkRequestLatency metric is low (for instance, lower than 1.5 seconds) and the bulkRequestNumberOfRetries metric is reported as 0, then the bottleneck is probably going inside the pipeline processors. To watch the efficiency of the processors, evaluation the .timeElapsed.avg metric. This metric experiences the time taken for the processor to finish processing of a batch of data. For instance, if a grok processor is reporting a a lot greater worth than different processors for timeElapsed, it might be resulting from a gradual grok sample that may be optimized and even changed with a extra performant processor, relying on the use case.

Situation 2 – Understanding and Resolving Doc Errors to OpenSearch

The documentErrors.depend metric tracks the variety of paperwork that did not be despatched by bulk requests. The failure can occur resulting from numerous causes reminiscent of mapping conflicts, invalid information codecs, or schema mismatches. When this metric experiences a non-zero worth, it signifies that some paperwork are being rejected by OpenSearch. To establish the foundation trigger, study the configured Useless Letter Queue (DLQ), which captures the failed paperwork together with error particulars. The DLQ gives details about why particular paperwork failed, enabling you to establish patterns reminiscent of incorrect discipline varieties, lacking required fields, or information that exceeds measurement limits. For instance, discover the pattern DLQ objects for widespread points beneath:

Mapper parsing exception:

{"dlqObjects": [{
        "pluginId": "opensearch",
        "pluginName": "opensearch",
        "pipelineName": "",
        "failedData": {
            "index": "",
            "indexId": null,
            "status": 400,
            "message": "failed to parse field [] of kind [integer] in doc with id ''. Preview of discipline's worth: 'N/A' brought on by For enter string: "N/A"",
            "doc": {}
        },
        "timestamp": "…"
    }]}

Right here, OpenSearch can not retailer the textual content string “N/A” in a discipline that’s just for numbers, so it rejects the doc and shops it within the DLQ.

Restrict of complete fields exceeded:

{"dlqObjects": [{
        "pluginId": "opensearch",
        "pluginName": "opensearch",
        "pipelineName": "",
        "failedData": {
            "index": "",
            "indexId": null,
            "status": 400,
            "message": "Limit of total fields [] has been exceeded",
            "doc": {}
        },
        "timestamp": "…"
    }]}

The index.mapping.total_fields.restrict setting is the parameter that controls the utmost variety of fields allowed in an index mapping, and exceeding this restrict will trigger indexing operations to fail. You possibly can verify if all these fields are required or leverage numerous processors supplied by OpenSearch Ingestion to rework the info.

As soon as these points are recognized, you may both appropriate the supply information, regulate the pipeline configuration to rework the info appropriately, or modify the OpenSearch index mapping to accommodate the incoming information format.

Clear up

When organising alarms for monitoring your OpenSearch Ingestion pipelines, it’s essential to be aware of the potential prices concerned. Every alarm you configure will incur fees based mostly on the CloudWatch pricing mannequin.

To keep away from pointless bills, we advocate rigorously evaluating your alarm necessities and configuring them accordingly. Solely arrange the alarms which might be important on your use case, and recurrently evaluation your alarm configurations to establish and take away unused or redundant alarms.

Conclusion

On this put up, we explored the great monitoring capabilities for OpenSearch Ingestion pipelines by CloudWatch alarms, masking key metrics throughout numerous sources, processors, and sinks. Though this put up highlights probably the most essential metrics, there’s extra to find. For a deeper dive, seek advice from the next sources:

Efficient monitoring by CloudWatch alarms is essential for sustaining wholesome ingestion pipelines and sustaining optimum information move.


In regards to the authors

Utkarsh Agarwal

Utkarsh Agarwal

Utkarsh is a Cloud Help Engineer within the Help Engineering workforce at AWS. He gives steerage and technical help to clients, serving to them construct scalable, extremely out there, and safe options within the AWS Cloud. In his free time, he enjoys watching motion pictures, TV sequence, and naturally cricket! These days, he’s additionally making an attempt to grasp foosball.

Ramesh Chirumamilla

Ramesh Chirumamilla

Ramesh is a Technical Supervisor with Amazon Net Providers. In his position, Ramesh works proactively to assist craft and execute methods to drive clients’ adoption and use of AWS providers. He makes use of his expertise working with Amazon OpenSearch Service to assist clients cost-optimize their OpenSearch domains by serving to them right-size and implement greatest practices.

Taylor Gray

Taylor Grey

Taylor is a Software program Engineer within the Amazon OpenSearch Ingestion workforce at Amazon Net Providers. He has contributed many options inside each Information Prepper and OpenSearch Ingestion to allow scalable options for purchasers. In his free time, he enjoys pickle ball, studying, and enjoying Rocket League.

Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *