Apache Spark encryption efficiency enchancment with Amazon EMR 7.9

The Amazon EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that’s 100% API appropriate with open supply Apache Spark. With Amazon EMR launch 7.9.0, the EMR runtime for Apache Spark introduces vital efficiency enhancements for encrypted workloads, supporting Spark model 3.5.5.

For compliance and safety necessities, many purchasers have to allow Apache Spark’s native storage encryption (spark.io.encryption.enabled = true) along with Amazon Easy Storage Service (Amazon S3) encryption (reminiscent of server-side encryption (SSE) or AWS Key Administration Service (AWS KMS)). This characteristic encrypts shuffle recordsdata, cached information, and different intermediate information written to native disk throughout Spark operations, defending delicate information at relaxation on Amazon EMR cluster cases.

Industries topic to rules such because the Well being Insurance coverage Portability and Accountability Act (HIPAA) for healthcare, Cost Card Trade Information Safety Customary (PCI-DSS) for monetary companies, Common Information Safety Regulation (GDPR) for private information, and Federal Danger and Authorization Administration Program (FedRAMP) for presidency usually require encryption of all information at relaxation, together with non permanent recordsdata on native storage. Whereas Amazon S3 encryption protects information in object storage, Spark’s I/O encryption secures the intermediate shuffle and spill information that Spark writes to native disk throughout distributed processing—information that by no means reaches Amazon S3 however would possibly include delicate data extracted from supply datasets. Usually, encrypted operations require extra computational overhead that may affect total job efficiency.

With the built-in encryption optimizations of Amazon EMR 7.9.0, prospects would possibly see vital efficiency enhancements of their Apache Spark purposes with out requiring any utility modifications. In our efficiency benchmark assessments, derived from TPC-DS efficiency assessments at 3 TB scale, we noticed as much as 20% sooner efficiency with the EMR 7.9 optimized Spark runtime in comparison with Spark with out these optimizations. Particular person outcomes could range relying on particular workloads and configurations.

On this submit, we analyze the outcomes from our benchmark assessments evaluating the Amazon EMR 7.9 optimized Spark runtime towards Spark 3.5.5 with out encryption optimizations. We stroll by way of an in depth price evaluation and supply step-by-step directions to breed the benchmark.

Outcomes noticed

To judge the efficiency enhancements, we used an open supply Spark efficiency take a look at utility derived from the TPC-DS efficiency take a look at toolkit. We ran the assessments on two nine-node (eight core nodes and one major node) r5d.4xlarge Amazon EMR 7.9.0 clusters, evaluating two configurations:

Baseline: EMR 7.9.0 cluster with a bootstrap motion putting in Spark 3.5.5 with out encryption optimizations
Optimized: EMR 7.9.0 cluster utilizing the EMR Spark 3.5.5 runtime with encryption optimizations

Each assessments used information saved in Amazon Easy Storage Service (Amazon S3). All information processing was configured identically aside from the Spark runtime model.

To keep up benchmarking consistency and guarantee a constant, equal comparability, we disabled Dynamic Useful resource Allocation (DRA) in each take a look at configurations. This method eliminates variability from dynamic scaling and so we are able to measure pure computational efficiency enhancements.

The next desk exhibits the full job runtime for all queries (in seconds) within the 3 TB question dataset between the baseline and Amazon EMR 7.9 optimized configurations:

Configuration	Whole runtime (seconds)	Geometric imply (seconds)	Efficiency enchancment
Baseline (Spark 3.5.5 with out optimization)	1,485	10.24
EMR 7.9 (with encryption optimization)	1,176	8.15	20% sooner

We noticed that our TPC-DS assessments with the Amazon EMR 7.9 optimized Spark runtime accomplished about 20% sooner primarily based on complete runtime and 20% sooner primarily based on geometric imply in comparison with the baseline configuration.

The encryption optimizations in Amazon EMR 7.9 ship efficiency advantages by way of:

Improved shuffle and decryption operations lowering overhead throughout information change with out compromising safety
Higher reminiscence administration for intermediate outcomes

Price evaluation

The efficiency enhancements of the Amazon EMR 7.9 optimized Spark runtime immediately translate to decrease prices. We realized an roughly 20% price financial savings working the benchmark utility with encryption optimizations in comparison with the baseline configuration, due to diminished hours of EMR, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Block Retailer (Amazon EBS) utilizing Common Function SSD (gp2).

The next desk summarizes the associated fee comparability within the us-east-1 AWS Area:

Configuration	Runtime (hours)	Estimated price	Whole EC2 cases	Whole vCPU	Whole reminiscence (GiB)	Root system (EBS)
Baseline: Spark 3.5.5 with out optimization, 1 major and eight core nodes	0.41	$5.28	9	144	1152	64 GiB gp2
Amazon EMR 7.9 with optimization, 1 major and eight core nodes	0.33	$4.25	9	144	1152	64 GiB gp2

Price breakdown

Formulation used:

Amazon EMR price – Variety of cases × EMR hourly price × Runtime hours
Amazon EC2 price – Variety of cases × EC2 hourly price × Runtime hour)
Amazon EBS price – (EBS price per GB per thirty days ÷ hours in a month) × EBS quantity dimension × variety of cases × runtime hours

Word: EBS is priced month-to-month ($0.1 per GB per thirty days), so we divide by 730 hours to transform to an hourly price. EMR and EC2 are already priced hourly, so no conversion is required.

Baseline configuration (0.41 hours):

Amazon EMR price – 9 × $0.27 × 0.41 = $1.00
Amazon EC2 price – 9 × $1.152 × 0.41 = $4.25
Amazon EBS price – ($0.1/730 × 64 × 9 × 0.41) = $0.032
Whole price – $5.28

EMR 7.9 optimized configuration (0.33 hours):

Amazon EMR price – (9 × $0.27 × 0.33) = $0.80
Amazon EC2 price – (9 × $1.152 × 0.33) = $3.42
Amazon EBS price – ($0.1/730 × 64 × 9 × 0.33) = $0.025
Whole price: $4.25

Whole price financial savings: 20% per benchmark run, which scales linearly together with your manufacturing workload frequency.

Arrange EMR benchmarking

For detailed directions and scripts, see the companion GitHub repository.

Conditions

To arrange Amazon EMR benchmarking, begin by finishing the next prerequisite steps:

Configure your AWS Command Line Interface (AWS CLI) by working aws configure to level to your benchmarking account,
Create an S3 bucket for take a look at information and outcomes.
Copy the TPC-DS 3TB supply information from a publicly accessible dataset to your S3 bucket utilizing the next command:
```
aws s3 cp s3://blogpost-sparkoneks-us-east-1/weblog/BLOG_TPCDS-TEST-3T-partitioned s3:///BLOG_TPCDS-TEST-3T-partitioned --recursive
```
Change with the identify of the S3 bucket you created in step 2.
Construct or obtain the benchmark utility JAR file (spark-benchmark-assembly-3.3.0.jar)
Guarantee you’ve got applicable AWS Id Entry Administration (IAM) roles for EMR cluster creation and Amazon S3 entry

Deploy the baseline EMR cluster (with out optimization)

Step 1: Launch EMR 7.9.0 cluster with bootstrap motion

The baseline configuration makes use of a bootstrap motion to put in Spark 3.5.5 with out encryption optimizations. We have now made the bootstrap script publicly accessible in an S3 bucket on your comfort.

Create the default Amazon EMR roles:

aws emr create-default-roles

Now create the cluster:

aws emr create-cluster 
  --name "EMR-7.9-Baseline-Spark-3.5.5" 
  --release-label emr-7.9.0 
  --applications Identify=Spark 
  --ec2-attributes SubnetId=,InstanceProfile=EMR_EC2_DefaultRole  
  --service-role EMR_DefaultRole
  --instance-groups 
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge 
    InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge 
  --bootstrap-actions 
    Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Identify="set up spark 3.5.5 with out encryption optimization" 
  --use-default-roles 
  --log-uri s3:///logs/baseline/

Word: The bootstrap script is out there in a public S3 bucket at s3://spark-ba/install-spark-3-5-5-no-encryption.sh. This script installs Apache Spark 3.5.5 with out the encryption optimizations current within the Amazon EMR runtime.

Step 2: Submit the benchmark job to the baseline cluster

Subsequent submit the Spark job utilizing the next instructions:

aws emr add-steps 
  --cluster-id    
  --steps 'Kind=Spark,Identify="EMR-7.9-Baseline-Spark-3.5.5 Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=false","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3:///jar/spark-benchmark-assembly-3.3.0.jar","s3:///blog/BLOG_TPCDS-TEST-3T-partitioned","s3:///blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

Deploy the optimized EMR cluster (with encryption optimization)

Step 1: Launch EMR 7.9.0 cluster with Spark runtime

The optimized configuration makes use of the EMR 7.9.0 Spark runtime with none bootstrap actions:

aws emr create-cluster 
  --name "EMR-7.9-Optimized-Native-Spark" 
  --release-label emr-7.9.0 
  --applications Identify=Spark 
  --ec2-attributes SubnetId=,InstanceProfile=EMR_EC2_DefaultRole 
  --service-role EMR_DefaultRole
  --instance-groups 
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge 
    InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge 
  --use-default-roles 
  --log-uri s3:///logs/optimized/

Instance:

aws emr create-cluster 
--name "EMR-7.9-Optimized-Native-Spark" 
--release-label emr-7.9.0 
--applications Identify=Spark 
--ec2-attributes SubnetId=subnet-08a5f71f92bc8a801 
--instance-groups 
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=r5d.4xlarge 
InstanceGroupType=CORE,InstanceCount=8,InstanceType=r5d.4xlarge 
--bootstrap-actions 
Path=s3://spark-ba/install-spark-3-5-5-no-encryption.sh,Identify="set up spark 3.5.5 with out encryption optimization" 
--use-default-roles 
--log-uri s3://aws-logs-123456789012-us-west-2/elasticmapreduce/

Step 2: Submit the benchmark job to optimized cluster

ext submit the Spark job utilizing the next instructions:

aws emr add-steps 
  --cluster-id   
  --steps 'Kind=Spark,Identify="EMR-7.9-Optimized-Native-Spark Step",ActionOnFailure=CONTINUE,Args=["--deploy-mode","client","--conf","spark.io.encryption.enabled=true","--class","com.amazonaws.eks.tpcds.BenchmarkSQL","s3:///jar/spark-benchmark-assembly-3.3.0.jar","s3:///blog/BLOG_TPCDS-TEST-3T-partitioned","s3:///blog/BASELINE_TPCDS-TEST-3T-RESULT","/opt/tpcds-kit/tools","parquet","3000","3","false","q1-v2.4,q10-v2.4,q11-v2.4,q12-v2.4,q13-v2.4,q14a-v2.4,q14b-v2.4,q15-v2.4,q16-v2.4,q17-v2.4,q18-v2.4,q19-v2.4,q2-v2.4,q20-v2.4,q21-v2.4,q22-v2.4,q23a-v2.4,q23b-v2.4,q24a-v2.4,q24b-v2.4,q25-v2.4,q26-v2.4,q27-v2.4,q28-v2.4,q29-v2.4,q3-v2.4,q30-v2.4,q31-v2.4,q32-v2.4,q33-v2.4,q34-v2.4,q35-v2.4,q36-v2.4,q37-v2.4,q38-v2.4,q39a-v2.4,q39b-v2.4,q4-v2.4,q40-v2.4,q41-v2.4,q42-v2.4,q43-v2.4,q44-v2.4,q45-v2.4,q46-v2.4,q47-v2.4,q48-v2.4,q49-v2.4,q5-v2.4,q50-v2.4,q51-v2.4,q52-v2.4,q53-v2.4,q54-v2.4,q55-v2.4,q56-v2.4,q57-v2.4,q58-v2.4,q59-v2.4,q6-v2.4,q60-v2.4,q61-v2.4,q62-v2.4,q63-v2.4,q64-v2.4,q65-v2.4,q66-v2.4,q67-v2.4,q68-v2.4,q69-v2.4,q7-v2.4,q70-v2.4,q71-v2.4,q72-v2.4,q73-v2.4,q74-v2.4,q75-v2.4,q76-v2.4,q77-v2.4,q78-v2.4,q79-v2.4,q8-v2.4,q80-v2.4,q81-v2.4,q82-v2.4,q83-v2.4,q84-v2.4,q85-v2.4,q86-v2.4,q87-v2.4,q88-v2.4,q89-v2.4,q9-v2.4,q90-v2.4,q91-v2.4,q92-v2.4,q93-v2.4,q94-v2.4,q95-v2.4,q96-v2.4,q97-v2.4,q98-v2.4,q99-v2.4,ss_max-v2.4","true"]'

Benchmark command parameters defined

The Amazon EMR Spark step makes use of the next parameters:

EMR step configuration:
- Kind=Spark: Specifies it is a Spark utility step
- Identify=”EMR-7.9-Baseline-Spark-3.5.5″: Human-readable identify for the step
- ActionOnFailure=CONTINUE: Proceed with different steps if this one fails
Spark submit arguments:
- –deploy-mode consumer: Run the motive force on the grasp node (not cluster mode)
- –class com.amazonaws.eks.tpcds.BenchmarkSQL: Most important class for the TPC-DS benchmark
Utility parameters:
- JAR file: s3:///jar/spark-benchmark-assembly-3.3.0.jar
- Enter information: s3:///weblog/BLOG_TPCDS-TEST-3T-partitioned (3 TB TPC-DS dataset)
- Output location: s3:///weblog/BASELINE_TPCDS-TEST-3T-RESULT (S3 path for outcomes)
- TPC-DS instruments path: /choose/tpcds-kit/instruments(native path on EMR nodes)
- Format: parquet (output format)
- Scale issue: 3000 (3 TB dataset dimension)
- Iterations: 3 (run every question 3 occasions for averaging)
- Accumulate outcomes: false (don’t acquire outcomes to driver)
- Question checklist: "q1-v2.4,q10-v2.4,...,ss_max-v2.4" (all 104 TPC-DS queries)
- Last parameter: true (allow detailed logging and metrics)
Question protection:
- All 104 commonplace TPC-DS benchmark queries (q1-v2.4 by way of q99-v2.4)
- Plus the ss_max-v2.4 question for extra testing
- Every question runs 3 occasions to calculate common efficiency

Summarize the outcomes

Obtain the take a look at end result recordsdata from each output S3 places:

# Baseline outcomes
aws s3 cp s3:///weblog/BASELINE_TPCDS-TEST-3T-RESULT/timestamp=xxxx/abstract.csv/xxx.csv ./baseline-results.csv
   
# Optimized outcomes
aws s3 cp s3:///weblog/OPTIMIZED_TPCDS-TEST-3T-RESULT/timestamp=xxxx/abstract.csv/xxx.csv ./optimized-results.csv

The CSV recordsdata include 4 columns (with out headers):
- Question identify
- Median time (seconds)
- Minimal time (seconds)
- Most time (seconds)
Calculate efficiency metrics for comparability:
- Common time per question: AVERAGE(median, min, max) for every question
- Whole runtime: Sum of all median occasions
- Geometric imply: GEOMEAN(common occasions) throughout all queries
- Speedup: Calculate the ratio between baseline and optimized for every question
Create comparability evaluation:Speedup = (Baseline Time - Optimized Time) / Baseline Time * 100%

Testing configuration particulars

The next desk summarizes the take a look at surroundings used for this submit:

Parameter	Worth
EMR launch	emr-7.9.0 (each configurations)
Baseline Spark model	3.5.5 (put in by way of bootstrap motion)
Baseline bootstrap script	s3://spark-ba/install-spark-3-5-5-no-encryption.sh (public)
Optimized spark model	Amazon EMR Spark runtime
Cluster dimension	9 nodes (1 major and eight core)
Occasion sort	r5d.4xlarge
vCPUs per node	16
Reminiscence per node	128 GB
Occasion storage	600 GB SSD
EBS quantity	64 GB gp2 (2 volumes per occasion)
Whole vCPUs	144 (9 × 16)
Whole reminiscence	1152 GB (9 × 128)
Dataset	TPC-DS 3TB (Parquet format)
Queries	104 queries (TPC-DS v2.4)
Iterations	3 runs per question
DRA	Disabled for constant benchmarking

Clear up

To keep away from incurring future expenses, delete the sources you created:

Terminate each EMR clusters:

aws emr terminate-clusters --cluster-ids

Delete S3 take a look at outcomes if now not wanted:

aws s3 rm s3:///weblog/BASELINE_TPCDS-TEST-3T-RESULT/ --recursive
aws s3 rm s3:///weblog/OPTIMIZED_TPCDS-TEST-3T-RESULT/ --recursive
aws s3 rm s3:///logs/ --recursive

Take away IAM roles if created particularly for testing

Key findings

As much as 20% efficiency enchancment utilizing the Amazon EMR 7.9’s Spark runtime with no code modifications required
20% price financial savings due to diminished runtime
Important features for shuffle-heavy, join-intensive workloads
100% API compatibility with open supply Apache Spark
Easy migration from customized Spark builds to EMR runtime
Simple benchmarking utilizing publicly accessible bootstrap scripts

Conclusion

You possibly can run your Apache Spark workloads as much as 20% sooner and at decrease price with out making any modifications to your purposes through the use of the Amazon EMR 7.9.0 optimized Spark runtime. This enchancment is achieved by way of quite a few optimizations within the EMR Spark runtime, together with enhanced encryption dealing with, improved information serialization, and optimized shuffle operations.

To study extra about Amazon EMR 7.9 and greatest practices, see the EMR documentation. For configuration steerage and tuning recommendation, subscribe to the AWS Large Information Weblog.

Associated sources:

In the event you’re working Spark workloads on Amazon EMR at the moment, we encourage you to check the EMR 7.9 Spark runtime together with your manufacturing workloads and measure the enhancements particular to your use case.

Apache Spark encryption efficiency enchancment with Amazon EMR 7.9

Outcomes noticed

Price evaluation

Price breakdown

Arrange EMR benchmarking

Conditions

Deploy the baseline EMR cluster (with out optimization)

Deploy the optimized EMR cluster (with encryption optimization)

Benchmark command parameters defined

Summarize the outcomes

Testing configuration particulars

Clear up

Key findings

Conclusion

Concerning the authors

Synthetic Muscle groups, Boston Dynamics, and Extra Movies

11 Finest USB Flash Drives (2026): Pen Drives, Thumb Drives, Reminiscence Sticks

IoT Now Contract Win Listing – February 2026

A Name for Collaboration in Building

The $5 DIY Digital Scale You Can Construct In the present day

The Downtime Dilemma: Fixing IoT Resilience with rSIM

Umbrella Trick Can Idiot AI Goal-Monitoring Drones, UC Irvine

Southern States Enhances Layered Airspace Safety Technique with SkySafe’s Drone Detection and Airspace Intelligence – sUAS Information

How Amplitude applied pure language-powered analytics utilizing Amazon OpenSearch Service as a vector database

Turning Perception Into Influence with Databricks and International Orphan Mission

Tesla Gross sales in Germany Truly Nonetheless Down Enormously

IoT Now Contract Win Listing – February 2026