Utilizing AWS Glue Knowledge Catalog views with Apache Spark in EMR Serverless and Glue 5.0

The AWS Glue Knowledge Catalog has expanded its Knowledge Catalog views characteristic, and now helps Apache Spark environments along with Amazon Athena and Amazon Redshift. This enhancement, launched in March 2025, now makes it potential to create, share, and question multi-engine SQL views throughout Amazon EMR Serverless, Amazon EMR on Amazon EKS, and AWS Glue 5.0 Spark, in addition to Athena and Amazon Redshift Spectrum. The multi-dialect views empower knowledge groups to create SQL views one time and question them by way of supported engines—whether or not it’s Athena for ad-hoc analytics, Amazon Redshift for knowledge warehousing, or Spark for large-scale knowledge processing. This cross-engine compatibility means knowledge engineers can concentrate on constructing knowledge merchandise moderately than managing a number of view definitions or complicated permission schemes. Utilizing AWS Lake Formation permissions, organizations can share these views inside the identical AWS account, throughout completely different AWS accounts, and with AWS IAM Identification Heart customers and teams, with out granting direct entry to the underlying tables. Options of Lake Formation corresponding to fine-grained entry management (FGAC) utilizing Lake Formation-tag primarily based entry management (LF-TBAC) may be utilized to Knowledge Catalog views, enabling scalable sharing and entry management throughout organizations.

In an earlier weblog submit, we demonstrated the creation of Knowledge Catalog views utilizing Athena, including a SQL dialect for Amazon Redshift, and querying the view utilizing Athena and Amazon Redshift. On this submit, we information you thru the method of making a Knowledge Catalog view utilizing EMR Serverless, including the SQL dialect to the view for Athena, sharing it with one other account utilizing LF-Tags, after which querying the view within the recipient account utilizing a separate EMR Serverless workspace and AWS Glue 5.0 Spark job and Athena. This demonstration showcases the flexibility and cross-account capabilities of Knowledge Catalog views and entry by way of numerous AWS analytics providers.

Advantages of Knowledge Catalog views

The next are key advantages of Knowledge Catalog views for enterprise options:

Focused knowledge sharing and entry management – Knowledge Catalog views, mixed with the sharing capabilities of Lake Formation, allow organizations to offer particular knowledge subsets to completely different groups or departments with out duplicating knowledge. For instance, a retail firm can create views that present gross sales knowledge to regional managers whereas proscribing entry to delicate buyer info. By making use of LF-TBAC to those views, corporations can effectively handle knowledge entry throughout giant, complicated organizational buildings, sustaining compliance with knowledge governance insurance policies whereas selling data-driven decision-making.
Multi-service analytics integration – The power to create a view in a single analytics service and question it throughout Athena, Amazon Redshift, EMR Serverless, and AWS Glue 5.0 Spark breaks down knowledge silos and promotes a unified analytics method. This characteristic permits companies to make use of the strengths of various providers for numerous analytics wants. For example, a monetary establishment might create a view of transaction knowledge and use Athena for ad-hoc queries, Amazon Redshift for complicated aggregations, and EMR Serverless for large-scale knowledge processing—all with out transferring or duplicating the information. This flexibility accelerates insights and improves useful resource utilization throughout the analytics stack.
Centralized auditing and compliance – With views saved within the central Knowledge Catalog, companies can preserve a complete audit path of information entry throughout linked accounts utilizing AWS CloudTrail logs. This centralization is essential for industries with strict regulatory necessities, corresponding to healthcare or finance. Compliance officers can seamlessly monitor and report on knowledge entry patterns, detect uncommon actions, and display adherence to knowledge safety laws like GDPR or HIPAA. This centralized method simplifies compliance processes and reduces the chance of regulatory violations.

These capabilities of Knowledge Catalog views present highly effective options for companies to reinforce knowledge governance, enhance analytics effectivity, and preserve sturdy compliance measures throughout their knowledge ecosystem.

Resolution overview

An instance firm has a number of datasets containing particulars of their clients’ buy particulars combined with personally identifiable info (PII) knowledge. They categorize their datasets primarily based on sensitivity of the knowledge. The info steward needs to share a subset of their most popular clients knowledge for additional evaluation downstream by their knowledge engineering crew.

To display this use case, we use pattern Apache Iceberg tables buyer and customer_address. We create a Knowledge Catalog view from these two tables to filter by most popular clients. We then use LF-Tags to share restricted columns of this view to the downstream engineering crew. The answer is represented within the following diagram.

Stipulations

To implement this answer, you want two AWS accounts with an AWS Identification and Entry Administration (IAM) admin function. We use the function to run the supplied AWS CloudFormation templates and in addition use the identical IAM roles added as Lake Formation administrator.

Arrange infrastructure within the producer account

We offer a CloudFormation template that deploys the next sources and completes the information lake setup:

Two Amazon Easy Storage Service (Amazon S3) buckets: one for scripts, logs, and question outcomes, and one for the information lake storage.
Lake Formation administrator and catalog settings. The IAM admin function that you simply present is registered as Lake Formation administrator. Cross-account sharing model is about to 4. Default permissions for newly created databases and tables is about to make use of Lake Formation permissions solely.
An IAM function with learn, write, and delete permissions on the information lake bucket objects. The info lake bucket is registered with Lake Formation utilizing this IAM function.
An AWS Glue database for the information lake.
Lake Formation tags. These tags are hooked up to the database.
CSV and Iceberg format tables within the AWS Glue database. The CSV tables are pointing to s3://redshift-downloads/TPC-DS/2.13/10GB/ and the Iceberg tables are saved within the consumer account’s knowledge lake bucket.
An Athena workgroup.
An IAM function and an AWS Lambda perform to run Athena queries. Athena queries are run within the Athena workgroup to insert knowledge from CSV tables to Iceberg tables. Related Lake Formation permissions are granted to the Lambda function.
An EMR Studio and associated digital non-public cloud (VPC), subnet, routing desk, safety teams, and EMR Studio service IAM function.
An IAM function with insurance policies for the EMR Studio runtime. Related Lake Formation permissions are granted to this function on the Iceberg tables. This function can be used because the definer function to create the Knowledge Catalog view. A definer function is the IAM function with essential permissions to entry the referenced tables, and runs the SQL assertion that defines the view.

Full the next steps in your producer AWS account:

Check in to the AWS Administration Console as an IAM administrator function.
Launch the CloudFormation stack.

Enable roughly 5 minutes for the CloudFormation stack to finish creation. After the CloudFormation has accomplished launching, proceed with the next directions.

When you’re utilizing the producer account in Lake Formation for the primary time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime function GlueViewBlog-EMRStudio-RuntimeRole.

Create an EMR Serverless software

Full the next steps to create an EMR Serverless software in your EMR Studio:

On the Amazon EMR console, underneath EMR Studio within the navigation pane, select Studios.
Select GlueViewBlog-emrstudio and select the URL hyperlink of the Studio to open it.
On the EMR Studio dashboard, select Create software.

You’ll be directed to the Create software web page on EMR Studio. Let’s create a Lake Formation enabled EMR Serverless software.

Beneath Software settings, present the next info:
1. For Identify, enter a reputation (for instance, emr-glueview-application).
2. For Kind, select Spark.
3. For Launch model, select emr-7.8.0.
4. For Structure, select x86_64.
Beneath Software setup choices, choose Use customized settings.
Beneath Interactive endpoint, choose Allow endpoint for EMR studio.
Beneath Extra configurations, for Metastore configuration, choose Use AWS Glue Knowledge Catalog as metastore, then choose Use Lake Formation for fine-grained entry management.
Beneath Community connections, select emrs-vpc for VPC, enter any two non-public subnets, and enter emr-serverless-sg for Safety teams.
Select Create and begin the applying.

Create an EMR Workspace

Full the next steps to create an EMR Workspace:

On the EMR Studio console, select Workspaces within the navigation pane and select Create Workspace.
Enter a Workspace identify (for instance, emrs-glueviewblog-workspace).
Go away all different settings as default and select Create Workspace.
Select Launch Workspace. Your browser may request to permit pop-up permissions for the primary time launching the Workspace.
After the Workspace is launched, within the navigation pane, select Compute.
For Compute kind, choose EMR Serverless software and enter emr-glueview-application for the applying and GlueViewBlog-EMRStudio-RuntimeRole for Interactive runtime function.
Be sure that the kernel hooked up to the Workspace is PySpark.

Create a Knowledge Catalog view and confirm

Full the next steps:

Obtain the pocket book glueviewblog_producer.ipynb. The code creates a Knowledge Catalog view customer_nonpii_view from the 2 Iceberg tables, customer_iceberg and customer_address_iceberg, within the database glueviewblog__db.
In your EMR Workspace emrs-glueviewblog-workspace, go to the File browser part and select Add recordsdata.
Add glueviewblog_producer.ipynb.
Replace the information lake bucket identify, AWS account ID, and AWS Area to match your sources.
Replace the database_name, table1_name, and table2_name to match your sources.
Save the pocket book.
Select the double arrow icon to restart the kernel and rerun the pocket book.

The Knowledge Catalog view customer_nonpii_view is created and verified.

Within the navigation pane on the Lake Formation console, underneath Knowledge Catalog, select Views.
Select the brand new view customer_nonpii_view.
On the SQL definitions tab, confirm EMR with Apache Spark exhibits up for Engine identify.
Select the tab LF-Tags. The view ought to present the LF-Tag sensitivity=pii-confidential inherited from the database.
Select Edit LF-Tags.
On the Values dropdown menu, select confidential to overwrite the Knowledge Catalog view’s key worth of sensitivity from pii-confidential.
Select Save.

With this, we now have created a non-PII view to share with the information engineering crew from the datasets that has PII info of consumers.

Add Athena SQL dialect to the view

With the view customer_nonpii_view having been created by the EMR runtime function GlueViewBlog-EMRStudio-RuntimeRole, the Admin may have solely describe permissions on it as a database creator and Lake Formation administrator. On this step, the Admin will grant itself alter permissions on the view, to be able to add the Athena SQL dialect to the view.

On the Lake Formation console, within the navigation pane, select Knowledge permissions.
Select Grant and supply the next info:
1. For Principals, enter Admin.
2. For LF-Tags or catalog sources, choose Sources matched by LF-Tags.
3. For Key, select sensitivity.
4. For Values, select confidential and pii-confidential.
5. Beneath Database permissions, choose Tremendous for Database permissions and Grantable permissions.
6. Beneath Desk permissions, choose Tremendous for Desk permissions and Grantable permissions.
7. Select Grant.
Confirm the LF-Tags primarily based permissions the Admin.
Open the Athena question editor, select the Workgroup GlueViewBlogWorkgroup and select the AWS Glue database glueviewblog__db.

Run the next question. Exchange along with your account ID.

ALTER VIEW glueviewblog__db.customer_nonpii_view ADD DIALECT
AS
choose c_customer_id, c_customer_sk, c_last_review_date, ca_country, ca_location_type
from glueviewblog___db.customer_iceberg, glueviewblog___db.customer_address_iceberg
the place c_current_addr_sk = ca_address_sk and c_preferred_cust_flag='Y';

Confirm the Athena dialect by working a preview on the view.
On the Lake Formation console, confirm the SQL dialects on the view customer_nonpii_view.

Share the view to the patron account

Full the next steps to share the Knowledge Catalog view to the patron account:

On the Lake Formation console, within the navigation pane, select Knowledge permissions.
Select Grant and supply the next info:
1. For Principals, choose Exterior accounts and enter the patron account ID.
2. For LF-Tags or catalog sources, choose Sources matched by LF-Tags.
3. For Key, select sensitivity.
4. For Values, select confidential.
5. Beneath Database permissions, choose Describe for Database permissions and Grantable permissions.
6. Beneath Desk permissions, choose Describe and Choose for Desk permissions and Grantable permissions.
7. Select Grant.
Confirm granted permissions on the Knowledge permissions web page.

With this, the producer account knowledge steward has created a Knowledge Catalog view of a subset of information from two tables of their Knowledge Catalog, utilizing the EMR runtime function because the definer function. They’ve shared it to their analytics account utilizing LF-Tags to run additional processing of the information downstream.

Arrange infrastructure within the shopper account

We offer a CloudFormation template to deploy the next sources and arrange the information lake as follows:

An S3 bucket for Amazon EMR and AWS Glue logs
Lake Formation administrator and catalog settings just like the producer account setup
An AWS Glue database for the information lake
An EMR Studio and associated VPC, subnet, routing desk, safety teams, and EMR Studio service IAM function
An IAM function with insurance policies for the EMR Studio runtime

Full the next steps in your shopper AWS account:

Check in to the console as an IAM administrator function.
Launch the CloudFormation stack.

Enable roughly 5 minutes for the CloudFormation stack to finish creation. After the CloudFormation has accomplished launching, proceed with the next directions.

When you’re utilizing the patron account Lake Formation for the primary time, on the Lake Formation console, create a database named default and grant describe permission on the default database to runtime function GlueViewBlog-EMRStudio-Shopper-RuntimeRole.

Settle for AWS RAM shares within the shopper account

Now you can log in to the AWS shopper account and settle for the AWS RAM invitation:

Open the AWS RAM console with the IAM function that has AWS RAM entry.
Within the navigation pane, select Useful resource shares underneath Shared with me.

You must see two pending useful resource shares from the producer account.

Settle for each invites.

Create a useful resource hyperlink for the shared view

To entry the view that was shared by the producer AWS account, it’s worthwhile to create a useful resource hyperlink within the shopper AWS account. A useful resource hyperlink is a Knowledge Catalog object that may be a hyperlink to an area or shared database, desk, or view. After you create a useful resource hyperlink to a view, you should utilize the useful resource hyperlink identify wherever you’d use the view identify. Moreover, you possibly can grant permission on the useful resource hyperlink to the job runtime function GlueViewBlog-EMRStudio-Shopper-RuntimeRole to entry the view by way of EMR Serverless Spark.

To create a useful resource hyperlink, full the next steps:

Open the Lake Formation console because the Lake Formation knowledge lake administrator within the shopper account.
Within the navigation pane, select Tables.
Select Create and Useful resource hyperlink.
For Useful resource hyperlink identify, enter the identify of the useful resource hyperlink (for instance, customer_nonpii_view_rl).
For Database, select the glueviewblog_customer__db database.
For Shared desk area, select the Area of the shared desk.
For Shared desk, select customer_nonpii_view.
Select Create.

Grant permissions on the database to the EMR job runtime function

Full the next steps to grant permissions on the database glueviewblog_customer__db to the EMR job runtime function:

On the Lake Formation console, within the navigation pane, select Databases.
Choose the database glueviewblog_customer__db and on the Actions menu, select Grant.
Within the Ideas part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Shopper-RuntimeRole.
Within the Database permissions part, choose Describe.
Select Grant.

Grant permissions on the useful resource hyperlink to the EMR job runtime function

Full the next steps to grant permissions on the useful resource hyperlink customer_nonpii_view_rl to the EMR job runtime function:

On the Lake Formation console, within the navigation pane, select Tables.
Choose the useful resource hyperlink customer_nonpii_view_rl and on the Actions menu, select Grant.
Within the Ideas part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Shopper-RuntimeRole.
Within the Useful resource hyperlink permissions part, choose Describe for Useful resource hyperlink permissions.
Select Grant.

This enables the EMR Serverless job runtime roles to explain the useful resource hyperlink. We don’t make any alternatives for grantable permissions as a result of runtime roles shouldn’t have the ability to grant permissions to different rules.

Grant permissions on the goal for the useful resource hyperlink to the EMR job runtime function

Full the next steps to grant permissions on the goal for the useful resource hyperlink customer_nonpii_view_rl to the EMR job runtime function:

On the Lake Formation console, within the navigation pane, select Tables.
Choose the useful resource hyperlink customer_nonpii_view_rl and on the Actions menu, select Grant heading in the right direction.
Within the Ideas part, choose IAM customers and roles, and select GlueViewBlog-EMRStudio-Shopper-RuntimeRole.
Within the View permissions part, choose Choose and Describe.
Select Grant.

Arrange an EMR Serverless software and Workspace within the shopper account

Repeat the steps to create an EMR Serverless software within the shopper account.

Repeat the steps to create a Workspace within the shopper account. For Compute kind, choose EMR Serverless software and enter emr-glueview-application for the applying and GlueViewBlog-EMRStudio-Shopper-RuntimeRole because the runtime function.

Confirm entry utilizing interactive notebooks from EMR Studio

Full the next steps to confirm entry in EMR Studio:

Obtain the pocket book glueviewblog_emr_consumer.ipynb. The code runs a choose assertion on the view shared from the producer.
In your EMR Workspace emrs-glueviewblog-workspace, navigate to the File browser part and select Add recordsdata.
Add glueviewblog_emr_consumer.ipynb.
Replace the information lake bucket identify, AWS account ID, and Area to match your sources.
Replace the database to match your sources.
Save the pocket book.
Select the double arrow icon to restart the kernel with PySpark kernel and rerun the pocket book.

Confirm entry utilizing interactive notebooks from AWS Glue Studio

Full the next steps to confirm entry utilizing AWS Glue Studio:

Obtain the pocket book glueviewblog_glue_consumer.ipynb
Open the AWS Glue Studio console.
Select Pocket book after which select Add pocket book.
Add the pocket book glueviewblog_glue_consumer.ipynb.
For IAM function, select GlueViewBlog-EMRStudio-Shopper-RuntimeRole.
Select Create pocket book.
Replace the information lake bucket identify, AWS account ID, and Area to match your sources.
Replace the database to match your sources.
Save the pocket book.
Run all of the cells to confirm fine-grained entry.

Confirm entry utilizing the Athena question editor

As a result of the view from the producer account was shared to the patron account, the Lake Formation administrator may have entry to the view within the producer account. Additionally, as a result of the lake admin function created the useful resource hyperlink pointing to the view, it’ll even have entry to the useful resource hyperlink. Go to the Athena question editor and run a easy choose question on the useful resource hyperlink.

The analytics crew within the shopper account was capable of entry a subset of the information from a enterprise knowledge producer crew, utilizing their analytics instruments of selection.

Clear up

To keep away from incurring ongoing prices, clear up your sources:

In your shopper account, delete AWS Glue pocket book, cease and delete the EMR software, after which delete EMR Workspace.
In your shopper account, delete the CloudFormation stack. This could take away the sources launched by the stack.
In your producer account, log in to the Lake Formation console and revoke the LF-Tags primarily based permissions you had granted to the patron account.
In your producer account, cease and delete the EMR software after which delete the EMR Workspace.
In your producer account, delete the CloudFormation stack. This could delete the sources launched by the stack.
Assessment and clear up any extra AWS Glue and Lake Formation sources and permissions.

Conclusion

On this submit, we demonstrated a strong, enterprise-grade answer for cross-account knowledge sharing and evaluation utilizing AWS providers. We walked you thru how you can do the next key steps:

Create a Knowledge Catalog view utilizing Spark in EMR Serverless inside one AWS account
Securely share this view with one other account utilizing LF-TBAC
Entry the shared view within the recipient account utilizing Spark in each EMR Serverless and AWS Glue ETL
Implement this answer with Iceberg tables (it’s additionally appropriate different open desk codecs like Apache Hudi and Delta Lake)

The answer method with multi-dialect knowledge catalog views supplied on this submit is especially helpful for enterprises seeking to modernize their knowledge infrastructure whereas optimizing prices, enhance cross-functional collaboration whereas enhancing knowledge governance, and speed up enterprise insights whereas sustaining management over delicate info.

Discuss with the next details about Knowledge Catalog views with particular person analytics providers, and check out the answer. Tell us your suggestions and questions within the feedback part.

In regards to the Authors

Aarthi Srinivasan is a Senior Massive Knowledge Architect with Amazon SageMaker Lakehouse. As a part of the SageMaker Lakehouse crew, she works with AWS clients and companions to architect lake home options, improve product options, and set up greatest practices for knowledge governance.

Praveen Kumar is an Analytics Options Architect at AWS with experience in designing, constructing, and implementing fashionable knowledge and analytics platforms utilizing cloud-based providers. His areas of curiosity are serverless know-how, knowledge governance, and data-driven AI purposes.

Dhananjay Badaya is a Software program Developer at AWS, specializing in distributed knowledge processing engines together with Apache Spark and Apache Hadoop. As a member of the Amazon EMR crew, he focuses on designing and implementing enterprise governance options for EMR Spark.

Utilizing AWS Glue Knowledge Catalog views with Apache Spark in EMR Serverless and Glue 5.0

Advantages of Knowledge Catalog views

Resolution overview

Stipulations

Arrange infrastructure within the producer account

Create an EMR Serverless software

Create an EMR Workspace

Create a Knowledge Catalog view and confirm

Add Athena SQL dialect to the view

Share the view to the patron account

Arrange infrastructure within the shopper account

Settle for AWS RAM shares within the shopper account

Create a useful resource hyperlink for the shared view

Grant permissions on the database to the EMR job runtime function

Grant permissions on the useful resource hyperlink to the EMR job runtime function

Grant permissions on the goal for the useful resource hyperlink to the EMR job runtime function

Arrange an EMR Serverless software and Workspace within the shopper account

Confirm entry utilizing interactive notebooks from EMR Studio

Confirm entry utilizing interactive notebooks from AWS Glue Studio

Confirm entry utilizing the Athena question editor

Clear up

Conclusion

In regards to the Authors

Deixe um comentário Cancelar resposta

Robotic Movies: One-Legged Robotic, Goodbye Aldebaran, and Extra

iFixit Says Swap 2 Is In all probability Nonetheless Drift Susceptible

China to host international tech visionaries at MWC25 Shanghai: Asia Pacific’s flagship cell tech occasion reveals speaker lineup and programme highlights

SGP.32 to shake-up IoT connectivity

Heavy Tools Tendencies to Watch

Healthware Highlights: COSMIIC – Hackster.io

Verity’s Lucie Drones Gentle Up Eurovision With Mini Beam Payload

Assessment: Flywoo Explorer LR4 with DJI O4 Professional – The Final Sub-250g Lengthy-Vary FPV Drone

Meet David Flynn, a 2025 BigDATAwire Individual to Watch

Resident Evil: Requiem takes the franchise again to Raccoon Metropolis

Robotic Movies: One-Legged Robotic, Goodbye Aldebaran, and Extra

After its knowledge was wiped, KiranaPro’s co-founder can’t rule out an exterior hack