How Skroutz handles real-time schema evolution in Amazon Redshift with Debezium

This visitor submit was co-authored with Kostas Diamantis from Skroutz.

At Skroutz, we’re keen about our product, and it’s all the time our prime precedence. We’re continuously working to enhance and evolve it, supported by a big and gifted crew of software program engineers. Our product’s steady innovation and evolution result in frequent updates, usually necessitating modifications and additions to the schemas of our operational databases.

After we determined to construct our personal knowledge platform to fulfill our knowledge wants, akin to supporting reporting, enterprise intelligence (BI), and decision-making, the primary problem—and in addition a strict requirement—was to ensure it wouldn’t block or delay our product improvement.

We selected Amazon Redshift to advertise knowledge democratization, empowering groups throughout the group with seamless entry to knowledge, enabling quicker insights and extra knowledgeable decision-making. This selection helps a tradition of transparency and collaboration, as knowledge turns into available for evaluation and innovation throughout all departments.

Nevertheless, maintaining with schema modifications from our operational databases, whereas updating the info warehouse with out continuously coordinating with improvement groups, delaying releases, or risking knowledge loss, grew to become a brand new problem for us.

On this submit, we share how we dealt with real-time schema evolution in Amazon Redshift with Debezium.

Resolution overview

Most of our knowledge resides in our operational databases, akin to MariaDB and MongoDB. Our strategy entails utilizing the change knowledge seize (CDC) approach, which routinely handles the schema evolution of the info shops being captured. For this, we used Debezium together with a Kafka cluster. This resolution allows schema modifications to be propagated with out disrupting the Kafka shoppers.

Nevertheless, dealing with schema evolution in Amazon Redshift grew to become a bottleneck, prompting us to develop a technique to deal with this problem. It’s vital to notice that, in our case, modifications in our operational databases primarily contain including new columns relatively than breaking modifications like altering knowledge sorts. Subsequently, we’ve carried out a semi-manual course of to resolve this challenge, together with a compulsory alerting mechanism to inform us of any schema modifications. This two-step course of consists of dealing with schema evolution in actual time and dealing with knowledge updates in an asynchronous handbook step. The next architectural diagram illustrates a hybrid deployment mannequin, integrating each on-premises and cloud-based elements.

The information stream begins with knowledge from MariaDB and MongoDB, captured utilizing Debezium for CDC in close to real-time mode. The captured knowledge is streamed to a Kafka cluster, the place Kafka shoppers (constructed on the Ruby Karafka framework) learn and write them to the staging space, both in Amazon Redshift or Amazon Easy Storage Service (Amazon S3). From the staging space, DataLoaders promote the info to manufacturing tables in Amazon Redshift. At this stage, we apply the slowly altering dimension (SCD) idea to those tables, utilizing Sort 7 for many of them.

In knowledge warehousing, an SCD is a dimension that shops knowledge, and although it’s typically secure, it would change over time. Varied methodologies deal with the complexities of SCD administration. SCD Sort 7 locations each the surrogate key and the pure key into the very fact desk. This permits the consumer to pick the suitable dimension data based mostly on:

The first efficient date on the very fact file
The latest or present info
Different dates related to the very fact file

Afterwards, analytical jobs are run to create reporting tables, enabling BI and reporting processes. The next diagram supplies an instance of the info modeling course of from a staging desk to a manufacturing desk.

The structure depicted within the diagram exhibits solely our CDC pipeline, which fetches knowledge from our operational databases and doesn’t embrace different pipelines, akin to these for fetching knowledge by way of APIs, scheduled batch processes, and lots of extra. Additionally word that our conference is that dw_* columns are used to catch SCD metadata info and different metadata basically. Within the following sections, we focus on the important thing elements of the answer in additional element.

Actual-time workflow

For the schema evolution half, we concentrate on the column dw_md_missing_data, which captures schema evolution modifications in close to actual time that happen within the supply databases. When a brand new change is produced to the Kafka cluster, the Kafka shopper is answerable for writing this variation to the staging desk in Amazon Redshift. For instance, a message produced by Debezium to the Kafka cluster can have the next construction when a brand new store entity is created:

{
  "earlier than": null,
  "after": {
    "id": 1,
    "title": "shop1",
    "state": "hidden"
  },
  "supply": {
    ...
    "ts_ms": "1704114000000",
    ...
  },
  "op": "c",
  ...
}

The Kafka shopper is answerable for making ready and executing the SQL INSERT assertion:

INSERT INTO staging.retailers (
  id,
  "title",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 13:00:00',
    'create',
    NULL
  )
;

After that, let’s say a brand new column is added to the supply desk known as new_column, with the worth new_value.
The brand new message produced to the Kafka cluster can have the next format:

{
  "earlier than": { ... },
  "after": {
    "id": 1,
    "title": "shop1",
    "state": "hidden",
    "new_column": "new_value"
  },
  "supply": {
    ...
    "ts_ms": "1704121200000"
    ...
  },
  "op": "u"
  ...
}

Now the SQL INSERT assertion executed by the Kafka shopper might be as follows:

INSERT INTO staging.retailers (
  id,
  "title",
  state,
  dw_md_changed_at,
  dw_md_operation,
  dw_md_missing_data
)
VALUES
  (
    1,
    'shop1',
    'hidden',
    '2024-01-01 15:00:00',
    'replace',
    JSON_PARSE('{"new_column": "new_value"}') /* <-- examine this */
  )
;

The patron performs an INSERT as it might for the recognized schema, and something new is added to the dw_md_missing_data column as key-value JSON. After the info is promoted from the staging desk to the manufacturing desk, it is going to have the next construction.

At this level, the info stream continues operating with none knowledge loss or the necessity for communication with groups answerable for sustaining the schema within the operational databases. Nevertheless, this knowledge may not be simply accessible for the info shoppers, analysts, or different personas. It’s value noting that dw_md_missing_data is outlined as a column of the SUPER knowledge kind, which was launched in Amazon Redshift to retailer semistructured knowledge or paperwork as values.

Monitoring mechanism

To trace new columns added to a desk, we’ve a scheduled course of that runs weekly. This course of checks for tables in Amazon Redshift with values within the dw_md_missing_data column and generates an inventory of tables requiring handbook motion to make this knowledge obtainable by way of a structured schema. A notification is then despatched to the crew.

Handbook remediation steps

Within the aforementioned instance, the handbook steps to make this column obtainable can be:

Add the brand new columns to each staging and manufacturing tables:

ALTER TABLE staging.retailers ADD COLUMN new_column varchar(255);
ALTER TABLE manufacturing.retailers ADD COLUMN new_column varchar(255);

Replace the Kafka shopper’s recognized schema. On this step, we simply want so as to add the brand new column title to a easy array record. For instance:

class ShopsConsumer < ApplicationConsumer
  SOURCE_COLUMNS = [
    'id',
    'name',
    'state',
    'new_column' # this one is the new column
  ]
 
  def eat
    # Ruby code for:
    #   1. knowledge cleansing
    #   2. knowledge transformation
    #   3. preparation of the SQL INSERT assertion
 
    RedshiftClient.conn.exec <<~SQL
      /*
        generated SQL INSERT assertion
      */
    SQL
  finish
finish

Replace the DataLoader’s SQL logic for the brand new column. A DataLoader is answerable for selling the info from the staging space to the manufacturing desk.

class DataLoader::ShopsTable < DataLoader::Base
  class << self
    def load
      RedshiftClient.conn.exec <<~SQL
        CREATE TABLE staging.shops_new (LIKE staging.retailers);
      SQL
 
      RedshiftClient.conn.exec <<~SQL
        /*
          We transfer the info to a brand new desk as a result of in staging.retailers
          the Kafka shopper will proceed add new rows
        */
        ALTER TABLE staging.shops_new APPEND FROM staging.retailers;
      SQL
 
      RedshiftClient.conn.exec <<~SQL
        BEGIN;
          /*
            SQL to deal with
              * knowledge deduplications and so forth
              * extra transformations
              * all the required operations so as to apply the info modeling we'd like for this desk
          */
 
          INSERT INTO manufacturing.retailers (
            id,
            title,
            state,
            new_column, /* --> this one is the brand new column <-- */
            dw_start_date,
            dw_end_date,
            dw_current,
            dw_md_changed_at,
            dw_md_operation,
            dw_md_missing_data
          )
          SELECT
            id,
            title,
            state,
            new_column, /* --> this one is the brand new column <-- */
            /*
              right here is the logic to use the info modeling (kind 1,2,3,4...7)
            */
          FROM
            staging.shops_new
          ;
 
          DROP TABLE staging.shops_new;
        END TRANSACTION;
      SQL
    finish
  finish
finish

Switch the info that has been loaded within the meantime from the dw_md_missing_data SUPER column to the newly added column after which clear up. On this step, we simply must run an information migration like the next:

BEGIN;
 
  /*
    Switch the info from the `dw_md_missing_data` to the corresponding column
  */
  UPDATE manufacturing.retailers
  SET new_column = dw_md_missing_data.new_column::varchar(255)
  WHERE dw_md_missing_data.new_column IS NOT NULL;
 
  /*
    Clear up dw_md_missing_data column
  */
  UPDATE manufacturing.retailers
  SET dw_md_missing_data = NULL
  WHERE dw_md_missing_data IS NOT NULL;
 
END TRANSACTION;

To carry out the previous operations, we guarantee that nobody else performs modifications to the manufacturing.retailers desk as a result of we would like no new knowledge to be added to the dw_md_missing_data column.

Conclusion

The answer mentioned on this submit enabled Skroutz to handle schema evolution in operational databases whereas seamlessly updating the info warehouse. This alleviated the necessity for fixed improvement crew coordination and eliminated dangers of knowledge loss throughout releases, in the end fostering innovation relatively than stifling it.

Because the migration of Skroutz to the AWS Cloud approaches, discussions are underway on how the present structure might be tailored to align extra carefully with AWS-centered rules. To that finish, one of many modifications being thought of is Amazon Redshift streaming ingestion from Amazon Managed Streaming for Apache Kafka (Amazon MSK) or open supply Kafka, which can make it potential for Skroutz to course of massive volumes of streaming knowledge from a number of sources with low latency and excessive throughput to derive insights in seconds.

When you face related challenges, focus on with an AWS consultant and work backward out of your use case to supply probably the most appropriate resolution.

In regards to the authors

Konstantina Mavrodimitraki is a Senior Options Architect at Amazon Net Providers, the place she assists clients in designing scalable, strong, and safe methods in international markets. With deep experience in knowledge technique, knowledge warehousing, and massive knowledge methods, she helps organizations remodel their knowledge landscapes. A passionate technologist and other people individual, Konstantina loves exploring rising applied sciences and helps the native tech communities. Moreover, she enjoys studying books and enjoying together with her canine.

Kostas Diamantis is the Head of the Information Warehouse at Skroutz firm. With a background in software program engineering, he transitioned into knowledge engineering, utilizing his technical experience to construct scalable knowledge options. Enthusiastic about data-driven decision-making, he focuses on optimizing knowledge pipelines, enhancing analytics capabilities, and driving enterprise insights.

How Skroutz handles real-time schema evolution in Amazon Redshift with Debezium

Resolution overview

Actual-time workflow

Monitoring mechanism

Handbook remediation steps

Conclusion

In regards to the authors

Underdog buys Aristotle change licenses to launch US sports activities prediction market

Is the Pentagon allowed to surveil People with AI?

Taiwan Excellence showcases AI breakthroughs at Embedded World 2026

A Companion Robotic That Roams Your Desk Till You Pet It

The info behind the win: How Catapult and AWS IoT are reworking professional sports activities

IoT Now Contract Win Listing – February 2026

Vector Publicizes Warfighter-Centered Hammer F1 Drone

British startup Skycutter stuns US military-industrial advanced in Pentagon’s ‘Drone Dominance’ trials – sUAS Information

Zero-ETL integrations with Amazon OpenSearch Service

Constructing a close to real-time software with Zerobus Ingest and Lakebase

Vector Publicizes Warfighter-Centered Hammer F1 Drone

Taiwan Excellence showcases AI breakthroughs at Embedded World 2026