Implement desk stage entry management on information lake tables utilizing AWS Glue 5.0 with AWS Lake Formation


AWS Glue 5.0 now helps Full-Desk Entry (FTA) management in Apache Spark primarily based in your insurance policies outlined in AWS Lake Formation. This new characteristic allows learn and write operations out of your AWS Glue 5.0 Spark jobs on Lake Formation registered tables when the job position has full desk entry. This stage of management is good to be used instances that have to comply with safety laws on the desk stage. As well as, now you can use Spark capabilities together with Resilient Distributed Datasets (RDDs), customized libraries, and user-defined capabilities (UDFs) with Lake Formation tables. This functionality allows Information Manipulation Language (DML) operations together with CREATE, ALTER, DELETE, UPDATE, and MERGE INTO statements on Apache Hive and Iceberg tables from throughout the similar Apache Spark utility. Information groups can run advanced, interactive Spark purposes by Amazon SageMaker Unified Studio in compatibility mode whereas sustaining the table-level safety boundaries offered by Lake Formation. This simplifies safety and governance of your information lakes.

On this put up, we present you tips on how to implement FTA management on AWS Glue 5.0 by Lake Formation permissions.

How entry management works on AWS Glue

AWS Glue 5.0 helps two options that obtain entry management by Lake Formation:

  • Full-Desk Entry (FTA) management
  • Superb-grained entry management (FGAC)

At a excessive stage, FTA helps entry management on the desk stage whereas FGAC can assist entry management on the desk, row, column, and cell ranges. To assist extra granular entry management, FGAC makes use of a good safety mannequin primarily based on person/system house isolation. By sustaining this additional stage of safety, solely a subset of Spark core courses are allowlisted. Moreover, there may be additional setup for enabling FGAC, akin to passing the --enable-lakeformation-fine-grained-access parameter to the job. For extra details about FGAC, see Implement fine-grained entry management on information lake tables utilizing AWS Glue 5.0 built-in with AWS Lake Formation.

Whereas this stage of granular management is important for organizations that have to adjust to information governance, safety laws, or take care of delicate information, it’s extreme for organizations that solely want desk stage entry management. To supply clients with a method to implement desk stage entry with out the efficiency, price, and setup overhead launched by the tighter safety mannequin in FGAC, AWS Glue launched FTA. Let’s dive into FTA, the primary subject of this put up.

How Full-Desk Entry (FTA) works in AWS Glue

Till AWS Glue 4.0, Lake Formation-based information entry labored by GlueContext class, the utility class offered by AWS Glue. With the launch of AWS Glue 5.0, Lake Formation-based information entry is out there by native Spark SQL and Spark DataFrames.

With this launch, when you have got full desk entry to your tables by Lake Formation permissions, you don’t have to allow fine-grained entry mode on your AWS Glue jobs or classes. This eliminates the necessity to spin up a system driver and system executors, as a result of they’re designed to permit fine-grained entry, leading to decrease efficiency overhead and decrease price. As well as, though Lake Formation fine-grained entry mode helps learn operations, FTA helps not solely learn operations, but additionally write operations by CREATE, ALTER, DELETE, UPDATE, and MERGE INTO instructions.

To make use of FTA mode, you could permit third-party question engines to entry information with out the AWS Id and Entry Administration (IAM) session tag validation in Lake Formation. To do that, comply with the steps in Software integration for full desk entry.

Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA

The high-level steps to allow the Spark native FTA characteristic are documented in Utilizing AWS Glue with AWS Lake Formation for Full Desk Entry. Nonetheless, on this part, we’ll undergo an end-to-end instance of tips on how to migrate an AWS Glue 4.0 job that makes use of FTA by GlueContext to learn an Iceberg desk to an AWS Glue 5.0 job that makes use of Spark native FTA.

Conditions

Earlier than you get began, just be sure you have the next conditions:

  • An AWS account with AWS Id and Entry Administration (IAM) roles as wanted:
    •  Lake Formation information entry IAM position that isn’t a service-linked position.
    • AWS Glue job execution position with AWS managed coverage AWSGlueServiceRole connected and lakeformation:GetDataAccess permission. Be sure you embrace the AWS Glue service within the belief coverage.
  • The required permissions to carry out the next actions:
  • Lake Formation arrange within the account and a Lake Formation administrator position or an identical position to comply with together with the directions on this put up. To study extra about establishing permissions for a knowledge lake administrator position, see Create a knowledge lake administrator.

For this put up, we use the us-east-1 AWS Area, however you possibly can combine it in your most popular Area if the AWS providers included within the structure can be found in that Area.

You’ll stroll by establishing check information and an instance AWS Glue 4.0 job utilizing GlueContext, but when you have already got these and are solely curious about tips on how to migrate, proceed to Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA. With the conditions in place, you’re prepared begin the implementation steps.

Create an S3 bucket and add a pattern information file

To create an S3 bucket for the uncooked enter datasets and Iceberg desk, full the next steps:

  1. On the AWS Administration Console for Amazon S3, select Buckets within the navigation pane.
  2. Select Create bucket.
  3. Enter the bucket identify (for instance, glue5-fta-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}), and depart the remaining fields as default.
  4. Select Create bucket.
  5. On the bucket particulars web page, select Create folder.
  6. Create two subfolders: raw-csv-input and iceberg-datalake.

  1. Add the LOAD00000001.csv file into the raw-csv-input folder of the bucket.

Create an AWS Glue database and AWS Glue tables

To create enter and output pattern tables within the Information Catalog, full the next steps:

  1. On the Athena console, navigate to the question editor.
  2. Run the next queries in sequence (present your S3 bucket identify):
-- Create database for the demo
CREATE DATABASE glue5_fta_demo;

-- Create exterior desk in enter CSV recordsdata. Exchange the S3 path along with your bucket identify
CREATE EXTERNAL TABLE glue5_fta_demo.raw_csv_input(
 op string, 
 product_id bigint, 
 class string, 
 product_name string, 
 quantity_available bigint, 
 last_update_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3:///raw-csv-input/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'typeOfData'='file');
 
-- Create output Iceberg desk with partitioning. Exchange the S3 bucket identify along with your bucket identify
CREATE TABLE glue5_fta_demo.iceberg_datalake WITH (
  table_type="ICEBERG",
  format="parquet",
  write_compression = 'SNAPPY',
  is_external = false,
  partitioning=ARRAY['category', 'bucket(product_id, 16)'],
  location='s3:///iceberg-datalake/'
) AS SELECT * FROM glue5_fta_demo.raw_csv_input;

  1. Run the next question to validate the uncooked CSV enter information:

SELECT * FROM glue5_fta_demo.raw_csv_input;The next screenshot exhibits the question consequence:

  1. Run the next question to validate the Iceberg desk information:

SELECT * FROM glue5_fta_demo.iceberg_datalake;The next screenshot exhibits the question consequence:

This step used DDL to create desk definitions. Alternatively, you should use a Information Catalog API, the AWS Glue console, the Lake Formation console, or an AWS Glue crawler.

The following step is to configure Lake Formation permissions on the iceberg_datalake desk.

Configure Lake Formation permissions

To validate the aptitude, it’s essential outline FTA permissions for the iceberg_datalake Information Catalog desk you created. To begin, allow learn entry to iceberg_datalake.

To configure Lake Formation permissions for the iceberg_datalake desk, full the next steps:

  1. On the Lake Formation console, select Information lake places beneath Administration within the navigation pane.
  2. Select Register location.
  3. For Amazon S3 path, enter the trail of your S3 bucket to register the placement.
  4. For IAM position, select your Lake Formation information entry IAM position, which isn’t a service linked position.
  5. For Permission mode, choose Lake Formation.

  1. Select Register location.

Grant permissions on the Iceberg desk

The following step is to grant desk permissions on the iceberg_datalake desk to the AWS Glue job position.

  1. On the Lake Formation console, select Information permissions beneath Permissions within the navigation pane.
  2. Select Grant.
  3. For Principals, select IAM customers and roles.
  4. For IAM customers and roles, select your IAM position that’s going for use on an AWS Glue job.
  5. For LF-Tags or catalog assets, select Named Information Catalog assets.
  6. For Catalogs, select your account ID (the default catalog).
  7. For Databases, select glue5_fta_demo.
  8. For Tables, select iceberg_datalake.
  9. For Desk permissions, select Choose and Describe.
  10. For Information permissions, select All information entry.

Subsequent, create the AWS Glue PySpark job to course of the enter information.

Question the Iceberg desk by an AWS Glue 4.0 job utilizing GlueContext and DataFrames

Subsequent, create a pattern AWS Glue 4.0 job to load information from the iceberg_datalake desk. You’ll use this pattern job as a supply of migration. Full the next steps:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. For Create job, select Script Editor.
  3. For Engine, select Spark.
  4. For Choices, select Begin contemporary.
  5. Select Create script.
  6. For Script, substitute the next parameters:
    1. Exchange aws_region along with your Area.
    2. Exchange aws_account_id along with your AWS account ID.
    3. Exchange warehouse_path along with your Amazon S3 warehouse path for the Iceberg desk.

For extra details about tips on how to use Iceberg in AWS Glue 4.0 jobs, see Utilizing the Iceberg framework in AWS Glue.

from awsglue.context import GlueContext
from pyspark.sql import SparkSession

catalog_name = "glue_catalog"
aws_region = "us-east-1"
aws_account_id = "123456789012"
warehouse_path = "s3:///warehouse/"

# Initialize Spark and Glue contexts
spark = SparkSession.builder 
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") 
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") 
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") 
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") 
    .config(f"spark.sql.catalog.{catalog_name}.glue.lakeformation-enabled","true") 
    .config(f"spark.sql.catalog.{catalog_name}.shopper.area",f"{aws_region}") 
    .config(f"spark.sql.catalog.{catalog_name}.glue.id",f"{aws_account_id}") 
    .getOrCreate()
glueContext = GlueContext(spark.sparkContext)

database_name = "glue5_fta_demo"
table_name = "iceberg_datalake"

# Learn the Iceberg desk
df = glueContext.create_data_frame.from_catalog(
    database=database_name,
    table_name=table_name,
)
df.present()

  1. On the Job particulars tab, for Title, enter glue-fta-demo-iceberg.
  2. For IAM Position, assign an IAM position that has the required permissions to run an AWS Glue job and browse and write to the S3 bucket.
  3. For Glue model, select Glue 4.0 – Helps spark 3.3, Scala 2, Python 3.
  4. For Job parameters, add the next parameters:
    1. Key: --conf
    2. Worth: spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    3. Key: --datalake-formats
    4. Worth: iceberg
  5. Select Save after which Run.
  6. When the job is full, on the Run particulars tab, select Output logs.

You’re redirected to the Amazon CloudWatch console to validate the output.

The output desk is proven within the following screenshot. You see the identical output that you simply noticed in Athena while you verified that the Iceberg desk was populated. It’s because the AWS Glue job execution position has full desk entry from the Lake Formation permissions that you simply granted:

For those who have been to run this similar AWS Glue job with one other IAM position that wasn’t granted entry to the desk in Lake Formation, you’d see an error Inadequate Lake Formation permission(s) on iceberg_datalake. Use the next steps to copy this conduct:

  1. Create a brand new IAM position that’s equivalent to the AWS Glue job execution position you already used, however don’t grant permissions to this clone in Lake Formation.
  2. Change the position within the AWS Glue console for glue-fta-demo-iceberg to the brand new cloned position.
  3. Rerun the job. It is best to see the error.
  4. For the needs of this put up, change the position again to the unique job execution position that’s registered in Lake Formation so you should use it within the subsequent steps.

You now have an FTA setup in AWS Glue 4.0 that makes use of GlueContext DataFrames for an Iceberg desk. You noticed how roles which are granted permission in Lake Formation can learn, and the way roles that aren’t granted permission in Lake Formation can’t learn. Within the subsequent part, we present you tips on how to migrate from AWS Glue 4.0 GlueContext FTA to AWS Glue 5.0 native Spark FTA.

Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA

The Lake Formation permission granting expertise is equivalent whatever the AWS Glue model and Spark information buildings used. Due to this fact, assuming you have got a working Lake Formation setup on your AWS Glue 4.0 job, you don’t want to change these permissions throughout migration. Listed below are the migration steps utilizing the AWS Glue 4.0 instance from the earlier sections:

  1. Enable third-party question engines to entry information with out the IAM session tag validation in Lake Formation. Comply with the step-by-step information in Software integration for full desk entry.
  2. You shouldn’t want to alter the job runtime position in case you have AWS Glue 4.0 FTA working (see the instance permissions within the conditions). The principle IAM permission to confirm is that the AWS Glue job execution position has lakeformation:GetDataAccess.
  3. Modify the Spark session configurations within the script. Confirm that the next Spark configurations are current:
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3:///warehouse/
--conf spark.sql.catalog.spark_catalog.shopper.area=REGION
--conf spark.sql.catalog.spark_catalog.glue.account-id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.glue.lakeformation-enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true

For more information concerning the above three steps, see Utilizing AWS Glue with AWS Lake Formation for Full Desk Entry.

  1. Replace the script in order that GlueContext DataFrames are modified to native Spark DataFrames. For instance, the up to date script for the earlier AWS Glue 4.0 job would now appear to be:
from pyspark.sql import SparkSession

catalog_name = "spark_catalog"
aws_region = "us-east-1"
aws_account_id = ""
warehouse_path = "s3:///warehouse/"

spark = SparkSession.builder 
    .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") 
    .config("spark.sql.defaultCatalog", f"{catalog_name}") 
    .config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkSessionCatalog") 
    .config(f"spark.sql.catalog.{catalog_name}.warehouse", f"{warehouse_path}") 
    .config(f"spark.sql.catalog.{catalog_name}.shopper.area", f"{aws_region}") 
    .config(f"spark.sql.catalog.{catalog_name}.glue.account-id", f"{aws_account_id}") 
    .config(f"spark.sql.catalog.{catalog_name}.glue.lakeformation-enabled", "true") 
    .config(f"spark.sql.catalog.dropDirectoryBeforeTable.enabled", "true") 
    .config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") 
    .config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") 
    .getOrCreate()

database_name = "glue5_fta_demo"
table_name = "iceberg_datalake"

df = spark.sql(f"choose * from {database_name}.{table_name}")
df.present()

  • You’ll be able to take away the --conf job argument that was added within the AWS Glue 4.0 job as a result of it’s set within the script itself now.
  1. For Glue model, select Glue 5.0 – Helps spark 3.5, Scala 2, Python 3.

To confirm that roles that don’t have Lake Formation permissions granted for them aren’t capable of entry the Iceberg desk, you possibly can repeat the identical train you probably did in AWS Glue 4.0 and reuse the clone job execution position to rerun the job. It is best to see the error message: AnalysisException: Inadequate Lake Formation permission(s) on glue5_fta_demo

You’ve accomplished the migration and now have an FTA setup in AWS Glue 5.0 that makes use of native Spark and reads from an Iceberg desk. You noticed that roles which are granted permission in Lake Formation can learn and that roles that aren’t granted permission in Lake Formation can’t learn.

Clear up

Full the next steps to scrub up your assets:

  1. Delete the AWS Glue job glue-fta-demo-iceberg.
  2. Delete the Lake Formation permissions.
  3. Delete the bucket that you simply created for the enter datasets, which could have a reputation much like glue5-fta-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}.

Conclusion

This put up defined how one can allow Spark native FTA in AWS Glue 5.0 jobs that can implement entry management outlined utilizing Lake Formation grant instructions. For earlier AWS Glue variations, you wanted to combine AWS Glue DataFrames to implement FTA in AWS Glue jobs or migrate to AWS Glue 5.0 FGAC, which has comparatively restricted performance. With this launch, in case you don’t want fine-grained management, you possibly can implement FTA by Spark DataFrame or Spark SQL for extra flexibility and efficiency. This functionality is at present supported for Iceberg and Hive tables.

This characteristic can prevent effort and encourage portability whereas migrating Spark scripts to totally different serverless environments akin to AWS Glue and Amazon EMR.


In regards to the authors

Layth Yassin is a Software program Growth Engineer on the AWS Glue staff. He’s enthusiastic about tackling difficult issues at a big scale, and constructing merchandise that push the boundaries of the sector. Exterior of labor, he enjoys taking part in/watching basketball, and spending time with family and friends.

Noritaka Sekiyama is a Principal Massive Information Architect on the AWS Glue staff. He’s additionally the creator of the guide Serverless ETL and Analytics with AWS Glue. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking along with his highway bike.

Kartik Panjabi is a Software program Growth Supervisor on the AWS Glue staff. His staff builds generative AI options for the Information Integration and distributed system for information integration.

Matt Su is a Senior Product Supervisor on the AWS Glue staff. He enjoys serving to clients uncover insights and make higher choices utilizing their information with AWS Analytics providers. In his spare time, he enjoys snowboarding and gardening.