Information scientists and ML engineers typically have to entry uncooked knowledge information in Amazon Easy Storage Service (Amazon S3) for machine studying coaching, knowledge exploration, and generative AI workflows. Nevertheless, when table-level entry is ruled by AWS Lake Formation, accessing the underlying S3 information has required sustaining separate permission mechanisms. S3 bucket insurance policies or AWS Identification and Entry Administration (IAM) function insurance policies create operational overhead and danger of permission drift.
Lake Formation now helps direct entry to S3 knowledge file places for tables whose permissions it manages. Beforehand, knowledge scientists with Lake Formation permissions on AWS Glue Information Catalog tables might question them utilizing spark.sql(). Now, they’ll additionally learn and write the underlying S3 knowledge information utilizing spark.learn.parquet() or spark.learn.csv() from Amazon EMR Spark jobs, Amazon SageMaker Unified Studio notebooks with EMR compute, and customized purposes. All entry is ruled by the identical Lake Formation permissions.
This functionality is powered by the brand new GetTemporaryDataLocationCredentials() API, which vends non permanent credentials scoped to registered S3 places when callers have applicable Lake Formation permissions on the corresponding Information Catalog tables. This eliminates the necessity to handle separate S3 bucket insurance policies for file-level entry whereas sustaining fine-grained entry management in Lake Formation for table-based entry. It permits your knowledge scientists to discover S3 datasets securely, speed up machine studying pipelines, and construct generative AI workflows with out compromising governance.
On this submit, we reveal studying from and writing to Lake Formation-managed S3 places utilizing Apache Spark jobs from EMR. Lake Formation credential merchandising for S3 location entry is out there in EMR launch label 7.13 and later, Boto3 1.42.29 and later, AWS Java SDK 2.41.32 and later, and AWS Command Line Interface (AWS CLI) model 2.33.1 and later.
Key use circumstances for Lake Formation permissions to S3 places
- Unified permissions for Analytics and Machine Studying pipelines – Information scientists can entry each structured tables by SQL queries and underlying knowledge information by programmatic APIs for machine studying and AI workloads. They’re empowered to make use of instruments of their alternative – for instance, use Amazon Athena for SQL analytics with the desk names whereas learn and write to the underlying information of their SageMaker pocket book or Spark utility with spark.learn.parquet(“s3://bucket/database_path/table_files/).
- Allow AI prepared knowledge lakes – Machine studying pipelines can learn coaching knowledge immediately from ruled knowledge lakes. Generative AI purposes can entry basis mannequin coaching datasets, and knowledge exploration workflows to make use of native file APIs whereas sustaining centralized governance and compliance.
- Decreased operational complexity – Operations groups don’t want to take care of separate permission insurance policies – one in Lake Formation for desk entry and one other in S3 bucket insurance policies or AWS Identification and Entry Administration (IAM) roles for file entry. This reduces the danger of permission mismatches and avoids inconsistent entry management.
- Unified audit functionality – Auditors don’t want to look at a number of log sources, similar to S3 Entry Logs, AWS CloudTrail occasions from totally different providers, to grasp who accessed what knowledge and when. With this function, you get a unified CloudTrail audit path exhibiting each desk entry by SQL engines and file entry by direct APIs, with every entry occasion linked to the Lake Formation permission grant.
What prospects are saying
“By means of our shut collaboration with AWS, Lake Formation’s new S3 location-based permissions have remodeled how we handle knowledge governance at Intuit. By unifying two separate entry mechanisms for a similar knowledge into one unified permission mannequin, we’ve dramatically diminished complexity and streamlined our auditing course of. That is precisely the form of simplification that lets our groups transfer quicker with out compromising safety, making certain we keep the strict compliance and governance requirements our regulators count on.”
— Tapan Upadhyay, Group Engineering Supervisor, Intuit
Lake Formation Credential Merchandising Plugin for AWS SDK v2 for Java
Lake Formation has made accessible a specialised library AWS Lake Formation Credential Merchandising Plugin for AWS SDK V2 for Java. The Java plugin intercepts S3 requests for knowledge, checks Lake Formation permissions for the requested location, and offers non permanent scoped credentials to the shopper if permissions are granted in Lake Formation. If the S3 location entry permissions usually are not managed by Lake Formation, the plugin checks for entry in Amazon S3 Entry Grants and lastly falls again to IAM permissions. The plugin is supported independently of Spark and comes as an enhancement to EMR Spark Full Desk Entry (FTA) mode, beginning in EMR 7.13 and later. The plugin is built-in on the S3A degree. Subsequently, any shopper of S3A can allow it by setting the S3A configurations, along with the EMR Lake Formation Full Desk Entry (FTA) configuration as follows:
With the Java plugin, you may allow governance for knowledge lake assets in your customized purposes with Lake Formation permissions – managing each tremendous grained entry for customers requiring restricted entry on Information Catalog tables whereas offering direct S3 object degree entry to use-cases that require them.
Word: (1) The principal that will probably be accessing direct S3 places of the tables would require full desk entry. That’s, Lake Formation SELECT permission on all columns and rows of the desk is required. (2) The Spark cluster wants FTA configuration. (3) At the moment, Apache Iceberg desk format isn’t supported with this plugin.
Resolution overview
A monetary providers firm runs day by day ETL jobs utilizing Spark in EMR. They course of uncooked transaction data in S3 and retailer the processed data in one other S3 location. The remodeled Parquet knowledge is registered with Lake Formation and cataloged as a desk in Information Catalog. The ETL job may have direct IAM entry to the uncooked knowledge location, whereas it makes use of Lake Formation permissions to jot down to and skim from the curated desk location. Downstream, a data-analyst function will question the curated desk, with restricted column entry. The answer is proven in Determine 1.
Determine 1 – Structure reveals EMR Spark writing curated data to the S3 location of a desk utilizing Lake Formation permissions whereas Information-Analyst queries the identical desk with Lake Formation tremendous grained entry management in Athena.

Conditions
To get began exploring this function, we advocate you could have the next setup.
Resolution walkthrough
First, we are going to get the setup prepared with S3, pattern database, desk, and knowledge. We are going to add a uncooked knowledge set to S3 location, create a desk with parquet knowledge in one other S3 location that represents the curated dataset for additional downstream consumption. We are going to register the desk knowledge location with Lake Formation and grant permissions for the EMR run time function and Information-Analyst function.
Your S3 bucket may have the next construction.
Uncooked knowledge – s3://
Course of knowledge for desk – s3://
Spark script – s3://
Logs for the EMR cluster – s3://
Step 1 – Create a parquet desk in Information Catalog
From the Athena console question editor, create a desk in Information Catalog.
Step 2 – Register S3 location and grant desk permission to IAM roles in Lake Formation
2.1 Register the desk knowledge location s3:// with Lake Formation in Lake Formation mode utilizing the customized S3 registration IAM function. For particulars on easy methods to register places with Lake Formation, refer Including an Amazon S3 location to your knowledge lake.
2.2 Grant DESCRIBE permission on the database finance_db and ALL permission on the desk transactions_processed to your EMR runtime function.
2.3 Grant Information location permission to EMR runtime function on the curated desk’s location. That is to permit writing to that location.
2.4 Grant DESCRIBE permission on the database finance_db and SELECT permission on the desk transactions_processed to your Information-Analyst function. Exclude the columns transaction_id and account_number whereas granting SELECT permissions on the desk to the Information-Analyst function.
For particulars on easy methods to grant Lake Formation permissions, refer Granting database permissions utilizing the named useful resource methodology; Granting desk permissions utilizing the named useful resource methodology and Granting knowledge location permissions.
Step 3 – Run ETL script in EMR
3.1 Obtain the script bdb-5860-script.py.
3.2 Edit the S3 bucket identify placeholder within the script (RAW_PATH and TABLE_PATH) to your useful resource names and add to your S3 path s3://.
3.3 Be sure that your EMR runtime function has entry to the script location in its IAM coverage permissions.
3.4 Submit and run the script as a step to the EMR cluster, following directions at Add a Spark step.
What does the script do?
It populates uncooked data of transaction knowledge right into a Spark knowledge body, writes to the uncooked knowledge bucket location utilizing IAM permissions on the EMR runtime function. We apply some transformations and write on to the S3 location of the desk that’s registered with Lake Formation, from the info body utilizing Spark’s native Parquet author.
The next determine reveals the stdout of the step.

The Java plugin built-in into EMR 7.13 routinely handles the entry for the desk’s knowledge location registered with Lake Formation, so that you don’t have to manually name the GetTemporaryDataLocationCredentials() API. On this instance, the desk knowledge location s3:// is registered with Lake Formation, for which EMR runtime function is granted ALL permissions. The direct S3 location entry assist by Lake Formation permits studying and writing to the placement immediately utilizing Spark knowledge body.
Step 4 – Run question as Information-Analyst utilizing Athena
Log in because the Information-Analyst function to the Athena console. Run a choose question on the desk as follows.
The Information-Analyst function ought to see all however two columns of the desk.

With these steps full, we’ve learn from and written to direct S3 places utilizing Spark knowledge frames with the syntax s3://bucketname/prefix/, and accessed the identical knowledge utilizing database_name.table_name syntax with Lake Formation permissions. This reveals fine-grained entry at desk degree and coarse-grained entry on the file path degree.
Clear up
To keep away from incurring prices, clear up the assets you created for this submit.
- Delete the Information Catalog database and tables. This removes the associated Lake Formation permissions too. Take away the S3 bucket registration from Lake Formation.
- Delete the info information, logs, and the PySpark script of this submit out of your S3 bucket.
- Terminate the EMR cluster.
Conclusion
On this submit, we confirmed easy methods to use Lake Formation’s direct S3 location entry to learn and write knowledge information utilizing Spark knowledge frames from Amazon EMR, whereas sustaining unified governance by Lake Formation permissions. We walked by the GetTemporaryDataLocationCredentials() API and the AWS Lake Formation Credential Merchandising Plugin for AWS SDK v2 for Java, which is built-in into EMR launch labels 7.13 and later.
This functionality unifies permission administration for each fine-grained table-based entry and direct S3 file path entry in Lake Formation. Your knowledge scientists can now use spark.learn.parquet() and spark.write alongside spark.sql(), ruled by the identical permissions, audited in the identical CloudTrail logs, and managed from a single console.
To get began, launch an EMR 7.13 cluster and begin exploring the function. Listed here are some further assets:
Acknowledgements: We wish to thank all of the staff members who labored to launch this function efficiently – Rajas Bhate, Akhil Yendluri, Kunal Parikh, Sharda Khubchandani, Dhananjay Badaya, Santhosh Padmanabhan, Nitin Agrawal and Sandeep Adwankar.
Concerning the authors