A scientific method to benchmarking SQL processing engines on AWS


Deciding on the precise SQL processing resolution for large-scale information analytics is a important resolution for organizations. As information volumes develop exponentially, the expertise panorama has advanced to supply various choices for processing and analyzing this data effectively. This publish presents a scientific framework for evaluating and benchmarking SQL processing engines on AWS, utilizing Apache JMeter to conduct sensible efficiency testing at scale.

The AWS analytics ecosystem

AWS presents a wealthy portfolio of SQL processing options to satisfy varied analytical wants:

  • Serverless question providersAmazon Athena is a serverless, interactive question service that makes use of commonplace SQL to research information in Amazon Easy Storage Service (Amazon S3), providing automated scaling, parallel question execution, and pay-per-query pricing with no infrastructure administration required
  • Information warehouse optionsAmazon Redshift presents scalable, high-performance cloud information warehousing with serverless choices, zero-ETL integrations, AI-powered question help, and seamless machine studying (ML) integration for contemporary analytics at scale
  • Managed open supply enginesAmazon EMR helps Apache Spark SQL, Apache Trino (previously PrestoSQL), and different distributed question frameworks
  • Self-managed choices – You possibly can deploy open supply engines like Apache Spark, Apache Flink, and Trino on Amazon Elastic Kubernetes Service (Amazon EKS) for better management
  • Associate options – You possibly can entry specialised massive information analytics instruments by AWS Market

These choices are additional enhanced by trendy open desk codecs equivalent to Apache Iceberg, Delta Lake, and Apache Hudi, which carry essential enterprise options like ACID (Atomicity, Consistency, Isolation, and Sturdiness) transactions, schema evolution, and time journey capabilities to information lakes. These SQL processing options function below the AWS Shared Duty Mannequin. AWS manages the safety of the underlying infrastructure and providers, and clients are liable for safe configuration, entry administration, and information safety inside their testing environments. This division of duty stays vital when evaluating and benchmarking completely different SQL engines. Correct safety configuration and implementation by clients is important for sustaining a safe analytics atmosphere.

Analysis challenges in SQL engine choice

The wealthy ecosystem of SQL processing choices creates important analysis challenges. Every SQL engine employs distinctive architectural approaches and optimization methods, making direct comparisons advanced. Organizations embarking on this analysis journey face a number of interconnected obstacles:

  • Creating environments that precisely mirror manufacturing eventualities
  • Creating take a look at datasets that mirror real-world information traits and volumes
  • Replicating real-world question patterns and concurrency ranges
  • Sustaining uniform testing situations throughout completely different engine architectures
  • Controlling infrastructure bills all through the analysis course of

Efficiency issues at petabyte scale

When evaluating options for petabyte-scale deployments, the complexity intensifies significantly. A number of important components come into play:

  • Useful resource administration – Distributed SQL engines require exact balancing of CPU, reminiscence, and storage assets. Suboptimal useful resource allocation can result in question failures and efficiency degradation, significantly as information volumes develop.
  • Information distribution patterns – How information is distributed throughout partitions or nodes considerably impacts question efficiency. Information skew can create processing bottlenecks, with some nodes dealing with disproportionate workloads whereas others stay underutilized.
  • Concurrency dealing with – Excessive-concurrency environments demand subtle workload scheduling and useful resource isolation mechanisms. The power to keep up constant efficiency below various concurrent masses turns into a important differentiator between options.
  • Significant metrics – Efficiency analysis at scale requires complete metrics evaluation:
    • Imply, median, and percentile response occasions (significantly p90 and p95)
    • Question throughput below various concurrency ranges
    • Scalability traits throughout various workload varieties
    • Useful resource utilization effectivity throughout peak masses

Limitations of conventional benchmarks

Though industry-standard benchmarks like TPC-DS and TPC-H present beneficial insights, our expertise with a number of buyer engagements has proven that tailor-made, workload-specific testing typically reveals efficiency traits not captured by these standardized exams. That is very true for advanced, multi-tenant environments with various question patterns. Organizations that complement commonplace benchmarks with workload-specific testing usually expertise shorter proof-of-concept cycles, optimized analysis prices, and extra environment friendly testing operations. This complete method helps scale back uncertainty within the closing resolution choice course of.

Conditions

Earlier than you dive into the analysis course of, ensure you have the next stipulations:

  • An AWS account with acceptable permissions to create and handle Amazon Elastic Compute Cloud (Amazon EC2) situations and entry the SQL engines you propose to benchmark.
  • Fundamental familiarity with AWS providers, significantly Amazon EC2 and the SQL engines you propose to guage (equivalent to Athena, Amazon Redshift, or Amazon EMR).
  • Expertise with SQL and information analytics ideas.
  • Entry to the SQL engines you select to benchmark. This publish assumes you’ve already arrange the engines you need to take a look at. For setup directions, consult with the AWS documentation for every service.
  • A dataset appropriate on your benchmarking wants. Dataset creation and loading usually are not lined on this publish. Construct petabyte-scale artificial take a look at information with Amazon EMR on EC2 supplies prescriptive steering to generate take a look at datasets at scale. Ensure your take a look at datasets are saved in S3 buckets with encryption enabled (utilizing SSE-KMS or SSE-S3) and that every one service connections use TLS for information in transit.

Advantages of Apache JMeter

As organizations scale their analytics workloads to petabyte ranges, there’s a rising want for a strong, structured method to SQL question efficiency testing. Though many organizations develop customized testing frameworks or use varied benchmarking instruments, these approaches typically lack standardization and might be troublesome to copy throughout completely different SQL engines. The complexity of contemporary information architectures, mixed with the number of obtainable SQL processing options, calls for a scientific analysis methodology. Apache JMeter emerges as a strong resolution to handle this problem. Although historically identified for internet software testing, JMeter’s extensible structure and strong characteristic set make it significantly well-suited for SQL efficiency testing at scale.JMeter presents a number of benefits for evaluating SQL engines:

  • Help for a number of protocols and connections
  • Means to simulate advanced concurrent workloads
  • Constructed-in efficiency metrics and reporting
  • Extensible structure for customized testing eventualities
  • Integration capabilities with steady integration and steady supply (CI/CD) pipelines

By this proposed framework, which has been validated throughout a number of buyer engagements at petabyte scale, we purpose to assist organizations make extra knowledgeable selections when deciding on a SQL processing resolution. Our expertise working with clients to evaluate varied AWS Analytics providers and open supply options has demonstrated {that a} systematic analysis method considerably reduces proof-of-concept cycles and optimizes useful resource investments. This framework has helped organizations successfully consider providers like Athena, Amazon Redshift, and Amazon EMR, alongside open supply options equivalent to Trino on Amazon EKS, based mostly on their particular workload profiles and efficiency necessities.With this system, organizations can accomplish the next:

  • Navigate the advanced panorama of large-scale information processing applied sciences
  • Scale back proof-of-concept cycles from months to weeks
  • Decrease infrastructure prices throughout analysis phases
  • Make data-driven selections about expertise choice
  • Higher align expertise selections with enterprise necessities
  • Set up repeatable testing patterns for future evaluations

Testing methodology in follow

A profitable SQL engine analysis requires understanding and replicating real-world workload patterns. Our methodology, refined by quite a few buyer engagements, focuses on complete testing throughout a number of dimensions whereas remaining adaptable to particular organizational wants.

Question sample choice

We start by deciding on consultant question patterns that mirror manufacturing workloads:

  • Aggregation queries that summarize massive datasets utilizing operations like SUM, AVG, and COUNT
  • Complicated be part of operations that take a look at the engine’s skill to mix information effectively throughout a number of tables
  • String operations that consider textual content processing capabilities
  • Nested queries that assess the engine’s optimization capabilities for advanced question constructions

A fastidiously chosen set of 8–10 queries usually supplies enough protection whereas maintaining the analysis manageable. These ought to mirror your precise workload traits and enterprise necessities.

Information quantity variations

Testing throughout completely different information volumes is vital for understanding scalability traits. We construction our exams round various information scan ranges:

  • Small-scale scans – Queries accessing 1–7 days of knowledge (megabytes to gigabytes)
  • Massive-scale scans – Queries spanning 14–30 days (terabytes to petabytes)

This method evaluates each I/O effectivity with massive datasets and metadata dealing with with smaller, frequent queries, serving to perceive how providers like Amazon EMR, Amazon Redshift, or Athena optimize question execution throughout completely different entry patterns.

Concurrency testing

Actual-world analytics environments not often course of single queries in isolation. Our methodology incorporates the next options:

  • Progressive concurrency testing beginning at decrease ranges (usually 16, 32, 64, and 128 parallel queries), although these numbers might be adjusted based mostly in your take a look at infrastructure capability and particular necessities. We advocate beginning with smaller concurrency ranges and progressively scaling as much as perceive efficiency traits
  • Diverse question complexity and frequency (known as question weights) to simulate real looking workload distributions. This implies some queries are run extra typically or are extra resource-intensive than others, mimicking real-world utilization patterns.
  • Blended question patterns operating concurrently to check useful resource administration.
  • Constant execution throughout completely different date ranges to guage scaling habits.

This method is especially vital when evaluating managed providers just like the workload administration capabilities of Amazon Redshift or the useful resource allocation methods of Amazon EMR.

Question weight distribution

Manufacturing environments usually see various frequencies of various question varieties. Our framework incorporates weighted question distribution to simulate real-world eventualities extra precisely. In a typical distribution, frequent light-weight queries would possibly signify 60% of the workload, advanced analytical queries would possibly comprise 30%, and resource-intensive information processing operations would possibly make up the remaining 10%.This weighted method makes certain efficiency testing displays precise utilization patterns fairly than synthetic benchmarking eventualities. The precise distribution ought to mirror your group’s particular workload patterns.

Sequential vs. concurrent testing

Our methodology implements two distinct testing phases:

  • Sequential testing – Establishes baseline efficiency metrics:
    • Runs every question sort independently throughout completely different date ranges
    • Runs a number of iterations to offer consistency and determine variability
    • Helps perceive particular person question efficiency traits
  • Concurrent testing – Simulates real-world multi-user eventualities:
    • Implements weighted question distributions
    • Assessments completely different concurrency ranges to determine scaling limitations
    • Evaluates useful resource administration capabilities of various engines

JMeter effectively implements each testing phases whereas sustaining constant take a look at situations throughout SQL engines. Its skill to deal with varied JDBC connections makes it significantly appropriate for testing AWS analytics providers.By this structured method, organizations can collect complete efficiency information reflecting their particular use circumstances, enabling knowledgeable SQL engine choice selections whereas sustaining core ideas of systematic analysis and real looking workload simulation.

Take a look at plans

To guage SQL engines’ efficiency below various workloads, we designed two take a look at eventualities: sequential and concurrent execution plans. Every state of affairs was executed throughout completely different information volumes by adjusting the question date vary filters to cowl 1, 7, 14, and 30 days. These variations simulate typical analytical workloads with progressively growing information sizes.For sequential runs, every take a look at was handled as a definite batch, grouping all queries (Question 1 to Question 9) below the identical date vary—every question will scan information for 1, 7, 14, and 30 days with acceptable date filtering within the question’s the place predicate. We used JMeter to seize common question response occasions for every batch. This configuration was run 3 times, and the ultimate metrics mirror the typical response time throughout these iterations to make sure reliability and account for environmental variance.Though three iterations present preliminary insights, should you observe important variations in outcomes (usually greater than 10% deviation between runs), think about increasing to 10 or extra iterations. This extra sampling helps set up statistical significance, determine true efficiency patterns, and distinguish outliers (past three commonplace deviations) from regular variations. Doc any constant anomalies, as a result of they might point out vital efficiency or safety issues on your particular atmosphere.The next desk reveals the pattern take a look at plans template for the sequential take a look at plan run.

 

Dataset Time Vary Run Question Weights
Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Question 7 Question 8 Question 9
1 day Run 1
Run 2
Run 3
Avg
7 days Run 1
Run 2
Run 3
Avg
14 days Run 1
Run 2
Run 3
Avg
30 days Run 1
Run 2
Run 3
Avg

For the concurrent take a look at plan, we launched a probabilistic weighted distribution to the queries (Question 1 to Question 9), simulating a extra real looking production-like atmosphere the place question frequency varies based mostly on enterprise relevance and utilization patterns. This added a layer of complexity to higher mirror how the SQL engine would carry out below real-world concurrent entry patterns.The next desk reveals the pattern take a look at plans template for the concurrent take a look at plan run.

Dataset Time Vary Concurrent Runs Question Weights
Question 1 Question 2 Question 3 Question 4 Question 5 Question 6 Question 7 Question 8 Question 9
1 days 8 11% 11% 11% 11% 11% 11% 11% 11% 11%
16 10% 5% 24% 5% 5% 5% 24% 14% 10%
32 8% 3% 24% 5% 5% 5% 24% 16% 8%
64 7% 3% 24% 6% 4% 6% 26% 16% 9%
128 1% 4% 19% 8% 5% 7% 14% 20% 22%
*7 days 8 11% 11% 11% 11% 11% 11% 11% 11% 11%
16 10% 5% 24% 5% 5% 5% 24% 14% 10%
32 8% 3% 24% 5% 5% 5% 24% 16% 8%
64 7% 3% 24% 6% 4% 6% 26% 16% 9%
**128 1% 4% 19% 8% 5% 7% 14% 20% 22%
14 days 8 11% 11% 11% 11% 11% 11% 11% 11% 11%
16 10% 5% 24% 5% 5% 5% 24% 14% 10%
32 8% 3% 24% 5% 5% 5% 24% 16% 8%
64 7% 3% 24% 6% 4% 6% 26% 16% 9%
128 1% 4% 19% 8% 5% 7% 14% 20% 22%
30 days 8 11% 11% 11% 11% 11% 11% 11% 11% 11%
16 10% 5% 24% 5% 5% 5% 24% 14% 10%
32 8% 3% 24% 5% 5% 5% 24% 16% 8%
64 7% 3% 24% 6% 4% 6% 26% 16% 9%
128 1% 4% 19% 8% 5% 7% 14% 20% 22%

For instance, for configuration of *7 days concurrent run with **128 concurrency, the proposed configuration distributes Question 1 to Question 9 with acceptable weighted submissions such that Question 9 is executed the best variety of occasions within the general 128 executions submitted throughout all 9 queries for this run.

JMeter setup

To start, you could arrange JMeter on a machine that may deal with the specified take a look at load. An EC2 occasion is a versatile and cost-effective choice. Select an occasion sort with enough vCPUs to assist your most deliberate concurrency. For instance, a c6i.4xlarge or increased is usually appropriate for average to excessive throughput testing eventualities. For the working system, you would possibly select Amazon Linux, which is optimized for AWS. For production-grade testing environments, deploy the JMeter EC2 occasion in a non-public subnet of a digital non-public cloud (VPC) with acceptable safety teams that enable solely required connections. This community isolation helps keep safety whereas executing efficiency exams. Think about using Amazon Digital Personal Cloud (Amazon VPC) endpoints for safe entry to AWS providers.

After the occasion is provisioned, set up Java (Java 17 LTS or Java 21 LTS) and obtain the newest model of JMeter. Make sure to configure the system with acceptable JVM choices to allocate enough heap reminiscence for large-scale take a look at executions. Consult with Getting Began to be taught extra.

# Set up Java
sudo yum replace -y # For Amazon Linux
sudo yum set up java-17-amazon-corretto -y

# Obtain JMeter and place the suitable jdbc driver for the engine of your choice below lib folder
wget https://downloads.apache.org//jmeter/binaries/apache-jmeter-5.6.3.tgz
tar -xvzf apache-jmeter-5.6.3.tgz
cd apache-jmeter-5.6.3/lib

# Launch JMeter in GUI mode (if utilizing a GUI-capable setup) or use CLI for distant testing
./bin/jmeter

JMeter ideas

Earlier than you create take a look at plans in JMeter, it’s vital to know a number of foundational ideas that affect how your take a look at plan behaves—equivalent to thread teams, user-defined variables, and JDBC connection. These elements allow the simulation of real-world question masses, together with concurrency and pacing.

Take a look at plans

The take a look at plan is the top-level container for a JMeter take a look at. It defines the general testing technique, together with the queries to execute, their parameters, and the concurrent consumer habits. These plans are represented as jmx recordsdata that may then be used for CLI-based execution. JMeter helps each GUI and CLI modes. It’s extremely beneficial that you just use the JMeter GUI primarily for creating take a look at plans as jmx, and use the CLI for giant load exams. You can even run thread teams consecutively for sequential execution. The default habits is to run all thread teams in parallel suited to concurrent execution. Consult with Constructing a Take a look at Plan to be taught extra about choices obtainable with take a look at plans.

Person-defined variables

Person-defined variables are world parameters which you could reuse all through the take a look at plan. They’re useful for outlining database credentials, server URLs, or question parameters. For instance:DB_URL=jdbc:trino://trino-cluster.instance.com:8889?SSL=true #Allow SSL/TLS

You possibly can configure authentication (consumer title and password) by your group’s permitted strategies, equivalent to AWS Secrets and techniques Supervisor (see Transfer hardcoded secrets and techniques to AWS Secrets and techniques Supervisor) AWS Id and Entry Administration (IAM) roles, or different safe credential administration methods.

Thread teams

A thread group represents a gaggle of digital customers (threads) executing take a look at actions. Every thread simulates a single consumer sending requests to the SQL engine. This can be utilized to simulate concurrent runs. For instance, within the previous template, Question 3 has 19% weightage throughout 128 runs. This implies .19*128=25 whole runs, so we set the thread group to 25.

JDBC connection configuration

JDBC connection configuration units up the database connection for the take a look at. It specifies the database URL, driver, and credentials required for executing SQL queries. Key fields to configure are database URL and JDBC driver class. The next desk summarizes the completely different configuration settings.

 

JDBC requests

The JDBC request executes SQL queries in opposition to the database utilizing the configuration outlined within the JDBC connection configuration.

For instance, following command runs the JMeter in CLI mode:

# Run benchmarks in CLI mode 
./jmeter -n -t .jmx -l .log -e -o /output/

The output folder will comprise an HTML report with completely different statistics. The next screenshot illustrates 128 concurrent runs.

Monitoring and logging

For complete visibility and audit necessities, allow AWS CloudTrail logging, VPC Circulation Logs, and service-specific logs (like Amazon S3 entry logs). These logs might be centralized in Amazon CloudWatch Logs for monitoring and evaluation. This supplies correct audit trails whereas evaluating completely different SQL engines and helps monitor entry patterns and potential safety occasions.

Submit-test steps

After operating your JMeter exams, proceed with the next steps:

  1. Evaluation the HTML report’s key metrics, together with response occasions, throughput, and error charges throughout completely different question varieties and concurrency ranges.
  2. Run an identical take a look at plans throughout your candidate SQL engines for direct efficiency comparability.
  3. Refine your take a look at plans based mostly on preliminary findings, specializing in areas the place efficiency variations are important.
  4. Think about the associated fee implications alongside efficiency metrics to make a balanced resolution.

These steps may help you systematically consider and choose probably the most appropriate SQL engine on your analytics workloads.

Sources

Within the previous steps, we walked by a UI-based setup for JMeter together with take a look at plans. We’ve got created a number of pattern JMeter take a look at plans for each sequential and concurrent runs together with pattern take a look at reviews. You possibly can modify the plans to suit your wants.

  1. JMeter pattern report
  2. JMeter take a look at plan for sequential run
  3. JMeter take a look at plan for concurrent run

Clear up

After you full your benchmarking course of, clear up the assets to keep away from pointless prices:

  1. Cease or delete the EC2 situations used for operating JMeter.
  2. Relying on which SQL engines you used for testing, clear up energetic assets.
  3. Evaluation your AWS Administration Console to substantiate no energetic assets stay.
  4. For those who created take a look at datasets in Amazon S3 or different storage providers particularly for this benchmarking, think about deleting them in the event that they’re now not wanted.
  5. Though JMeter take a look at plans and outcomes don’t incur AWS prices, set up or delete native recordsdata as wanted on your record-keeping.

Abstract

Deciding on the precise SQL processing resolution for large-scale analytics calls for a scientific, data-driven method. Our JMeter framework may help organizations successfully consider completely different SQL engines by simulating real-world workload patterns throughout varied question varieties, information volumes, and concurrency ranges. This technique reduces proof-of-concept cycles and supplies insights past conventional benchmarks, serving to you assess managed AWS providers like Athena and Amazon Redshift and open supply options on Amazon EKS.


In regards to the authors

Anubhav Awasthi

Anubhav Awasthi

Anubhav is a Senior Huge Information Specialist Options Architect at Amazon Net Companies (AWS). He collaborates with clients to offer skilled architectural steering for implementing and optimizing analytics options utilizing Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.

Gagan Brahmi

Gagan Brahmi

Gagan is a Specialist Senior Options Architect at Amazon Net Companies (AWS), targeted on Information Analytics and AI/ML. With over 20 years in data expertise, he companions with clients to unravel advanced AI/ML challenges by leveraging information and AI/ML platforms. Gagan helps clients architect scalable, high-performance options that make the most of distributed information processing, real-time streaming applied sciences, and AI/ML providers to drive enterprise transformation by synthetic intelligence and data-driven insights. When not designing cloud-native information and AI options, Gagan enjoys exploring new locations along with his household.

Jayaprakash Boreddy

Jayaprakash Boreddy

Jayaprakash is a Senior Options Architect at AWS. He works with ISV clients in designing and constructing extremely scalable, versatile and resilient purposes on AWS Cloud.

Sahil Thapar

Sahil Thapar

Sahil is a Principal Options Architect. He works with ISV clients to assist them construct extremely obtainable, scalable, and resilient purposes on the AWS Cloud.

Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *