Kaltura reduces observability operational prices by 60% with Amazon OpenSearch Service


This submit is co-written with Ido Ziv from Kaltura.

As organizations develop, managing observability throughout a number of groups and purposes turns into more and more advanced. Logs, metrics, and traces generate huge quantities of information, making it difficult to take care of efficiency, reliability, and cost-efficiency.

At Kaltura, an AI-infused video-first firm serving thousands and thousands of customers throughout a whole bunch of purposes, observability is mission-critical. Understanding system conduct at scale isn’t nearly troubleshooting—it’s about offering seamless experiences for purchasers and staff alike. However attaining efficient observability at this scale comes with challenges: managing spans; correlating logs, traces, and occasions throughout distributed methods; and sustaining visibility with out overwhelming groups with noise. Balancing granularity, value, and actionable insights requires fixed tuning and considerate structure.

On this submit, we share how Kaltura reworked its observability technique and technological stack by migrating from a software program as a service (SaaS) logging resolution to Amazon OpenSearch Service—attaining larger log retention, a 60% discount in value, and a centralized platform that empowers a number of groups with real-time insights.

Observability challenges at scale

Kaltura ingests over 8TB of logs and traces each day, processing greater than 20 billion occasions throughout 6 manufacturing AWS Areas and over 200 purposes—with log spikes reaching as much as 6 GB per second. This immense information quantity, mixed with a extremely distributed structure, created important challenges in observability. Traditionally, Kaltura relied on a SaaS-based observability resolution that met preliminary necessities however turned more and more troublesome to scale. Because the platform developed, groups generated disparate log codecs, utilized retention insurance policies that not mirrored information worth, and operated greater than 10 organically grown observability sources. The dearth of standardization and visibility required in depth guide effort to correlate information, keep pipelines, and troubleshoot points – resulting in rising operational complexity and glued prices that didn’t scale effectively with utilization.

Kaltura’s DevOps crew acknowledged the necessity to reassess their observability resolution and started exploring quite a lot of choices, from self-managed platforms to totally managed SaaS choices. After a complete analysis, they made the strategic choice emigrate to OpenSearch Service, utilizing its superior options comparable to Amazon OpenSearch Ingestion, the Observability plugin, UltraWarm storage, and Index State Administration.

Answer overview

Kaltura created a brand new AWS account that might be a devoted observability account, the place OpenSearch Service was deployed. Logs and traces have been collected from completely different accounts and producers comparable to microservices on Amazon Elastic Kubernetes Service (Amazon EKS) and providers working on Amazon Elastic Compute Cloud (Amazon EC2).

By utilizing AWS providers comparable to AWS Id and Entry Administration (IAM), AWS Key Administration Service (AWS KMS), and Amazon CloudWatch, Kaltura was capable of meet the requirements to create a production-grade system whereas maintaining safety and reliability in thoughts. The next determine reveals a high-level design of the atmosphere setup.

Ingestion

As seen within the following diagram, logs are shipped utilizing log shippers, also referred to as collectors. In Kaltura’s case, they used Fluent Bit. A log shipper is a instrument designed to gather, course of, and transport log information from numerous sources to a centralized location, comparable to log analytics platforms, administration methods, or an aggregator system. Fluent Bit was utilized in all sources and in addition supplied gentle processing skills. Fluent Bit was deployed as a daemonset in Kubernetes. The applying growth groups didn’t change their code, as a result of the Fluent Bit pods have been studying the stdout of the appliance pods.

The next code is an instance of FluentBit configurations for Amazon EKS:

[INPUT]
   Title                tail
   Path                /var/log/containers/*.log
   Tag                 kube.*
   Skip_Long_Lines     On
   multiline.parser    docker, cri
[FILTER]
   alias               k8s
   # kubernetes filter to parse all logs
   Title                kubernetes
   Match               kube.*
   Kube_Tag_Prefix     kube.var.log.containers.
   Annotations         On
   Labels              Off
   Merge_Log           On
   Keep_Log            Off
   Kube_URL            https://kubernetes.default.svc.cluster.native:443 
[FILTER]
   alias               apps
   Title                rewrite_tag
   Match               kube.*
   Rule                $kubernetes['annotations']['kaltura.com/observability'] ^apps$ 
[OUTPUT]
   Title                http
   Match               apps.*
   Alias               apps
   Host                xxxxx.us-east-1.osis.amazonaws.com
   Port                443
   URI                 /log/apps
   Format              json
   aws_auth            true
   aws_region          us-east-1
   aws_service         osis
   aws_role_arn        arn:aws:iam::xxxxx:function/osis-ingestion-role
   Log_Level           hint
   tls On

Spans and traces have been collected straight from the appliance layer utilizing a seamless integration strategy. To facilitate this, Kaltura deployed an OpenTelemetry Collector (OTEL) utilizing the OpenTelemetry Operator for Kubernetes. Moreover, the crew developed a customized OTEL code library, which was included into the appliance code to effectively seize and log traces and spans, offering complete observability throughout their system.

Knowledge from Fluent Bit and OpenTelemetry Collector was despatched to OpenSearch Ingestion, a totally managed, serverless information collector that delivers real-time log, metric, and hint information to OpenSearch Service domains and Amazon OpenSearch Serverless collections. Every producer despatched information to a particular pipeline, one for logs and one for traces, the place information was reworked, aggregated, enriched, and normalized earlier than being despatched to OpenSearch Service. The hint pipeline used the otel_trace and service_map processors, whereas utilizing the OpenSearch Ingestion OpenTelemetry hint analytics blueprint.

The next code is an instance of the OpenSearch Ingestion pipeline for logs:

model: "2"
entry-pipeline:
 supply:
   http:
     path: "/log/apps"

 processor:
   - add_entries:
       entries:
       - key: "log_type"
         worth: "default"
       - key: "log_type"
         worth: "api"
         add_when: 'accommodates(/filename, "api.log")'
         overwrite_if_key_exists: true
       - key: "log_type"
         worth: "stats"
         add_when: 'accommodates(/filename, "stats.log")'
         overwrite_if_key_exists: true
       - key: "log_type"
         worth: "occasion"
         add_when: 'accommodates(/filename, "occasion.log")'
         overwrite_if_key_exists: true
       - key: "log_type"
         worth: "login"
         add_when: 'accommodates(/filename, "login.log")'
         overwrite_if_key_exists: true

   - grok:
       grok_when: '/log_type == "api"'
       match:
         log: ['^[%%{DATA:timestamp}] [%%{DATA:logIp}] [%%{DATA:host}] [%%{WORD:id}] %%{WORD:priorityName}(%%{NUMBER:precedence}): [memory: %%{DATA:memory} MB, real: %%{DATA:real}MB] %%{GREEDYDATA:message}']

   - date:
       match:
         - key: timestamp
           patterns: ["dd-MMM-yyyy HH:mm:ss", "dd/MMM/yyyy:HH:mm:ss Z", "EEE MMM dd HH:mm:ss.SSSSSS yyyy"]

       vacation spot: "@timestamp"
       output_format: "yyyy-MM-dd'T'HH:mm:ss"

   - rename_keys:
       entries:
       - from_key: "timestamp"
         to_key: "@timestamp"
         overwrite_if_to_key_exists: false
       - from_key: "date"
         to_key: "@timestamp"
         overwrite_if_to_key_exists: false

   - drop_events:
       drop_when: 'accommodates(/filename, "simplesamlphp.log")'


 sink:
   - opensearch:
       hosts: ["${opensearch_host}"]
       index: '$${/env}-api-$${/log_type}-app-logs'
       index_type: customized
       motion: create
       bulk_size: 20
       aws:
         sts_role_arn: ${sts_role_arn}
         area:  ${area}
       dlq:
         s3:
           bucket: "${bucket}"
           key_path_prefix: 'my-app-dlq-files'
           area: "${area}"
           sts_role_arn: "${sts_role_arn}"

The previous instance reveals using processors comparable to grok, date, add_entries, rename_keys, and drop_events:

  • add_entries:
    • Provides a brand new subject log_type based mostly on filename
    • Default: “default”
    • If the filename accommodates particular substrings (comparable to api.log or stats.log), it assigns a extra particular sort
  • grok:
    • Applies Grok parsing to logs of sort “api”
    • Extracts fields like timestamp, logIp, host, priorityName, precedence, reminiscence, actual, and message utilizing a customized sample
  • date:
    • Parses timestamp strings into a regular datetime format
    • Shops it in a subject referred to as @timestamp based mostly on ISO8601 format
    • Handles a number of timestamp patterns
  • rename_keys:
    • timestamp or date are renamed into @timestamp
    • Doesn’t overwrite if @timestamp already exists
  • drop_events:
    • Drops logs the place filename accommodates simplesamlphp.log
    • It is a filtering rule to disregard noisy or irrelevant logs

The next is an instance of the enter of a log line:

   "log": "[25-Mar-2025 18:23:18] [127.0.0.1] [the-most-awesome-server-in-kaltura] [67e2f496cc321] INFO(6): [memory: 4.51 MB, real: 6MB] [request: 1] [time: 0.0263s / total: 0.0263s]",

After processing, we get the next code:

    "log_type": "api",
    "priorityName": "INFO",
    "reminiscence": "4.51",
    "host": "the-most-awesome-server-in-kaltura",
    "actual": "6",
    "precedence": "6",
    "message": "[request: 1] [time: 0.0263s / total: 0.0263s]",
    "logIp": "127.0.0.1",
    "id": "67e2f496cc321",
    "@timestamp": "2025-03-25T18:23:18"

Kaltura adopted some OpenSearch Ingestion finest practices, comparable to:

  • Together with a dead-letter queue (DLQ) in pipeline configuration. This may considerably assist troubleshoot pipeline points.
  • Beginning and stopping pipelines to optimize cost-efficiency, when potential.
  • Through the proof of idea stage:
    • Putting in Knowledge Prepper regionally for quicker growth iterations.
    • Disabling persistent buffering to expedite blue-green deployments.

Attaining operational excellence with environment friendly log and hint administration

Logs and traces play a significant function in figuring out operational points, however they arrive with distinctive challenges. First, they signify time sequence information, which inherently evolves over time. Second, their worth sometimes diminishes as time passes, making environment friendly administration essential. Third, they’re append-only in nature. With OpenSearch, Kaltura confronted distinct trade-offs between value, information retention, and latency. The objective was to verify priceless information remained accessible to engineering groups with minimal latency, however the resolution additionally wanted to be cost-effective. Balancing these components required considerate planning and optimization.

Knowledge was ingested to OpenSearch information streams, which simplifies the method of ingesting append-only time sequence information. A number of Index State Administration (ISM) insurance policies have been utilized to completely different information streams, which have been depending on log retention necessities. ISM insurance policies dealt with transferring indexes from scorching storage to UltraWarm, and finally deleting the indexes. This allowed a customizable and cost-effective resolution, with low latency for querying new information and cheap latency for querying historic information.

The next instance ISM coverage makes positive indexes are managed effectively, rolled over, and moved to completely different storage tiers based mostly on their age and measurement, and finally deleted after 60 days. If an motion fails, it’s retried with an exponential backoff technique. In case of failures, notifications are despatched to related groups to maintain them knowledgeable.

{
    "id": "retention",
    "coverage": {
        "description": "manufacturing ISM",
        },
        "default_state": "scorching",
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "rollover": {
                            "min_primary_shard_size": "30gb",
                            "copy_alias": false
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm",
                        "conditions": {
                            "min_index_age": "2d"
                        }
                    }
                ]
            },
            {
                "identify": "heat",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "warm_migration": {}
                    }
                ],
                "transitions": [
                    {
                        "state_name": "cold",
                        "conditions": {
                            "min_index_age": "14d"
                        }
                    }
                ]
            },
            {
                "identify": "chilly",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "cold_migration": {
                            "start_time": null,
                            "end_time": null,
                            "timestamp_field": "@timestamp",
                            "ignore": "none"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "60d"
                        }
                    }
                ]
            },
            {
                "identify": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "cold_delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": [
            {
                "index_patterns": [
                    "*-logs"
                ],
                "precedence": 50,
            }
        ]
    }
}

To create a knowledge stream in OpenSearch, a definition of index template is required, which configures how the info stream and its backing indexes will behave. Within the following instance, the index template specifies key index settings such because the variety of shards, replication, and refresh interval—controlling how information is distributed, replicated, and refreshed throughout the cluster. It additionally defines the mappings, which describe the construction of the info—what fields exist, their sorts, and the way they need to be listed. These mappings be sure that the info stream is aware of learn how to interpret and retailer incoming log information effectively. Lastly, the template allows the @timestamp subject because the time-based subject required for a knowledge stream.

{
  "index_patterns": [
    "*my-app-logs"
  ],
  "template": {
    "settings": {
      "index.number_of_shards": "32",
      "index.number_of_replicas": "0",
      "index.refresh_interval": "60s"
    },
    "mappings": {
      "properties": {
        "priorityName": {
          "sort": "key phrase"
        },
        "log_type": {
          "sort": "key phrase"
        },
        "@timestamp": {
          "sort": "date"
        },
        "reminiscence": {
          "sort": "float"
        },
        "host": {
          "sort": "key phrase"
        },
        "pid": {
          "sort": "key phrase"
        },
        "actual": {
          "sort": "float"
        },
        "env": {
          "sort": "key phrase"
        },
        "message": {
          "sort": "textual content"
        },
        "precedence": {
          "sort": "integer"
        },
        "logIp": {
          "sort": "ip"
        }
      }
    }
  },
  "composed_of": [],
  "precedence": "100",
  "_meta": {
    "circulate": "easy"
  },
  "data_stream": {
    "timestamp_field": {
      "identify": "@timestamp"
    }
  },
  "identify": "my-app-logs"
}

Implementing role-based entry management and person entry

The brand new observability platform is accessed by many kinds of customers; inside customers log in to OpenSearch Dashboards utilizing SAML-based federation with Okta. The next diagram illustrates the person circulate.

Every person accesses the dashboards to view observability objects related to their function. Positive-grained entry management (FGAC) is enforced in OpenSearch utilizing built-in IAM function and SAML group mappings to implement role-based entry management (RBAC).When customers log in to the OpenSearch area, they’re routinely routed to the suitable tenant based mostly on their assigned function. This setup makes positive builders can create dashboards tailor-made to debugging inside growth environments, and help groups can construct dashboards targeted on figuring out and troubleshooting manufacturing points. The SAML integration alleviates the necessity to handle inside OpenSearch customers fully.

For every function in Kaltura, a corresponding OpenSearch function was created with solely the required permissions. As an illustration, help engineers are granted entry to the monitoring plugin to create alerts based mostly on logs, whereas QA engineers, who don’t require this performance, are usually not granted that entry.

The next screenshot reveals the function of the DevOps engineers outlined with cluster permissions.

These customers are routed to their very own devoted DevOps tenant, to which they solely have write entry. This makes it potential for various customers from completely different roles in Kaltura to create the dashboard objects that concentrate on their priorities and desires. OpenSearch helps backend function mapping; Kaltura mapped the Okta group to the function so when a person logs in from Okta, they routinely get assigned based mostly on their function.

This additionally works with IAM roles to facilitate automations within the cluster utilizing exterior providers, comparable to OpenSearch Ingestion pipelines, as may be seen within the following screenshot.

Utilizing observability options and repair mapping for enhanced hint and log correlation

After a person is logged in, they will use the Observability plugins, view surrounding occasions in logs, correlate logs and traces, and use the Hint Analytics plugin. Customers can examine traces and spans, and group traces with latency info utilizing built-in dashboards. Customers may drill all the way down to a particular hint or span and correlate it again to log occasions. The service_map processor utilized in OpenSearch Ingestion sends OpenTelemetry information to create a distributed service map for visualization in OpenSearch Dashboards.

Utilizing the mixed indicators of traces and spans, OpenSearch discovers the appliance connectivity and maps them to a service map.

After OpenSearch ingests the traces and spans from Otel, they’re aggregated to teams in accordance with paths and developments. Durations are additionally calculated and offered to the person over time.

With a hint ID, it’s potential to filter out all of the related spans by the service and see how lengthy every took, figuring out points with exterior providers comparable to MongoDB and Redis.

From the spans, customers can uncover the related logs.

Put up-migration enhancements

After the migration, a powerful developer neighborhood emerged inside Kaltura that embraced the brand new observability resolution. As adoption grew, so did requests for brand spanking new options and enhancements aimed toward bettering the general developer expertise.

One key enchancment was extending log retention. Kaltura achieved this by re-ingesting historic logs from Amazon Easy Storage Service (Amazon S3) utilizing a devoted OpenSearch Ingestion pipeline with Amazon S3 learn permissions. With this enhancement, groups can entry and analyze logs from as much as a 12 months in the past utilizing the identical acquainted dashboards and filters.

Along with monitoring EKS clusters and EC2 situations, Kaltura expanded its observability stack by integrating extra AWS providers. Amazon API Gateway and AWS Lambda have been launched to help log ingestion from exterior distributors, permitting for seamless correlation with current information and broader visibility throughout methods.

Lastly, to empower groups and promote autonomy, information stream templates and ISM insurance policies are managed straight by builders inside their very own repositories. By utilizing infrastructure as code instruments like Terraform, builders can outline index mappings, alerts, and dashboards as code—versioned in Git and deployed constantly throughout environments.

Conclusion

Kaltura efficiently applied a wise log retention technique, extending actual time retention from 5 days for all log sorts to 30 days for important logs, whereas sustaining cost-efficiency by means of using UltraWarm nodes. This strategy led to a 60% discount in prices in comparison with their earlier resolution. Moreover, Kaltura consolidated their observability platform, streamlining operations by merging 10 separate methods right into a unified, all-in-one resolution. This consolidation not solely improved operational effectivity but in addition sparked elevated engagement from developer groups, driving characteristic requests, fostering inside design collaborations, and attracting early adopters for brand spanking new enhancements. If Kaltura’s journey has impressed you and also you’re interested by implementing an identical resolution in your group, take into account these steps:

  • Begin by understanding the necessities and setting expectations with the engineering groups in your group
  • Begin with a fast proof of idea to get hands-on expertise
  • Seek advice from the next assets that can assist you get began:


Concerning the authors

Ido Ziv is a DevOps crew chief in Kaltura with over 6 years of expertise. His hobbies embrace crusing and Kubernetes (however not on the similar time).

Roi Gamliel is a Senior Options Architect serving to startups construct on AWS. He’s passionate concerning the OpenSearch Mission, serving to clients fine-tune their workloads and maximize outcomes.

Yonatan Dolan is a Principal Analytics Specialist at Amazon Net Providers. He’s positioned in Israel and helps clients harness AWS analytical providers to make use of information, achieve insights, and derive worth.