Find out how to use Parquet Column Indexes with Amazon Athena

Amazon Athena not too long ago added help for studying Parquet Column Indexes in Apache Iceberg tables on November 21, 2025. With this optimization, Athena can carry out page-level knowledge pruning to skip pointless knowledge inside Parquet row teams, doubtlessly lowering the quantity of knowledge scanned and bettering question runtime for queries with selective filters. For knowledge groups, this will likely assist allow sooner insights and assist scale back prices when analyzing large-scale knowledge lakes.

Knowledge groups constructing knowledge lakes usually select Apache Iceberg for its ACID transactions, schema evolution, and metadata administration capabilities. Athena is a serverless question engine that means that you can question Amazon S3-based knowledge lakes utilizing SQL, and also you don’t have to handle infrastructure. Primarily based on the kind of knowledge and question logic, Athena can apply a number of question optimizations to enhance efficiency and scale back prices.

On this weblog submit, we use Athena and Amazon SageMaker Unified Studio to discover Parquet Column Indexes and exhibit how they will enhance Iceberg question efficiency. We clarify what Parquet Column Indexes are, exhibit their efficiency advantages, and present you learn how to use them in your functions.

Overview of Parquet Column Indexes

Parquet Column Indexes retailer metadata that question engines can use to skip irrelevant knowledge with better precision than row group statistics alone. To grasp how they work, contemplate how knowledge is structured inside Parquet recordsdata and the way engines like Athena course of them.

Parquet recordsdata set up knowledge hierarchically by dividing knowledge into row teams (sometimes 128-512 MB every) and additional subdividing them into pages (sometimes 1 MB every). Historically, Parquet maintains metadata on the contents of every row group stage within the type of min/max statistics, permitting engines like Athena to skip row teams that don’t fulfill question predicates. Though this strategy reduces the bytes scanned and question runtime, it has limitations. If even a single web page inside a row group overlaps with the values you’re looking for, Athena scans all pages inside the row group.

Parquet Column Indexes assist deal with this downside by storing page-level min/max statistics within the Parquet file footer. Row group statistics present coarse-grained filtering, however Parquet Column Indexes allow finer-grained filtering by permitting question engines like Athena to skip particular person pages inside a row group. Contemplate a Parquet file with a single row group containing 5 pages for a column. The row group has min/max statistics of (1, 20), and every web page for that column has the next min/max statistics.

row-group-0: min=1, max=20
    page-0: min=1, max=10
    page-1: min=1, max=10
    page-2: min=5, max=15
    page-3: min=6, max=16
    page-4: min=10, max=20

When Athena runs a question filtering for values equal to 2, it first checks the row group statistics and confirms that 2 falls inside the vary (1, 20). Athena will then plan to scan the pages inside that row group. With out Parquet Column Indexes, Athena scans every of the 5 pages within the row group. With Parquet Column Indexes, Athena examines the page-level statistics and determines that solely page-0 and page-1 have to be learn, skipping the remaining 3 pages.

Find out how to use Parquet Column Indexes with Athena

Athena makes use of Parquet Column Indexes based mostly on desk kind:

Amazon S3 Tables: Athena routinely makes use of Parquet Column Indexes by default when they’re current.
Iceberg tables in S3 common function buckets: Athena doesn’t use Parquet Column Indexes by default. To permit Athena to make use of Parquet Column Indexes, add an AWS Glue desk property named use_iceberg_parquet_column_index and set it to true. Use the AWS Glue console or AWS Glue UpdateTable API to carry out these actions.

Learn extra about learn how to use this function in Use Parquet column indexing.

Measuring Athena efficiency good points when utilizing Parquet Column Indexes

Now that we perceive what Parquet Column Indexes are, we’ll exhibit the efficiency advantages of utilizing Parquet Column Indexes by analyzing the catalog_sales desk from a 3TB TPC-DS dataset. This desk comprises ecommerce transaction knowledge together with order dates, gross sales quantities, buyer IDs, and product data. This dataset is an effective proxy for the sorts of enterprise evaluation that you simply may carry out by yourself knowledge, reminiscent of figuring out gross sales traits, analyzing buyer buying patterns, and calculating income metrics. We evaluate question execution statistics with and with out Parquet Column Indexes to quantify the efficiency enchancment.

Stipulations

Earlier than you start, you need to have the next sources:

A SageMaker Unified Studio IAM-based area.
An Execution IAM Function configured inside the SageMaker Unified Studio IAM-based area with entry to S3, AWS Glue Knowledge Catalog, and Athena.
An S3 bucket in your account to retailer Iceberg desk knowledge and Athena question outcomes.

Create catalog_sales Iceberg desk

Full the next steps utilizing SageMaker Unified Studio notebooks. There, you should utilize SageMaker Unified Studio’s multi-dialect pocket book performance to work along with your knowledge utilizing the Athena SQL and Spark engines. To create a catalog_sales Iceberg desk in your account, observe these steps:

Navigate to Amazon SageMaker within the AWS Administration Console and select Open below Get began with Amazon SageMaker Unified Studio.
From the aspect navigation, choose Notebooks and select Create Pocket book. The following steps on this submit will execute scripts on this pocket book.
Create a brand new SQL cell within the pocket book and set the connection kind to Athena (Spark). Execute the next question to create a database for the tables on this submit.
```
CREATE DATABASE parquet_column_index_blog;
```

Create a brand new SQL cell within the pocket book and confirm the connection kind is Athena (Spark). Execute the next question to create a Hive desk pointing to the placement of the TPC-DS catalog_sales desk knowledge on the public S3 bucket.

CREATE TABLE IF NOT EXISTS parquet_column_index_blog.catalog_sales_hive (
	  cs_sold_time_sk int,
	  cs_ship_date_sk int,
	  cs_bill_customer_sk int,
	  cs_bill_cdemo_sk int,
	  cs_bill_hdemo_sk int,
	  cs_bill_addr_sk int,
	  cs_ship_customer_sk int,
	  cs_ship_cdemo_sk int,
	  cs_ship_hdemo_sk int,
	  cs_ship_addr_sk int,
	  cs_call_center_sk int,
	  cs_catalog_page_sk int,
	  cs_ship_mode_sk int,
	  cs_warehouse_sk int,
	  cs_item_sk int,
	  cs_promo_sk int,
	  cs_order_number bigint,
	  cs_quantity int,
	  cs_wholesale_cost decimal(7, 2),
	  cs_list_price decimal(7, 2),
	  cs_sales_price decimal(7, 2),
	  cs_ext_discount_amt decimal(7, 2),
	  cs_ext_sales_price decimal(7, 2),
	  cs_ext_wholesale_cost decimal(7, 2),
	  cs_ext_list_price decimal(7, 2),
	  cs_ext_tax decimal(7, 2),
	  cs_coupon_amt decimal(7, 2),
	  cs_ext_ship_cost decimal(7, 2),
	  cs_net_paid decimal(7, 2),
	  cs_net_paid_inc_tax decimal(7, 2),
	  cs_net_paid_inc_ship decimal(7, 2),
	  cs_net_paid_inc_ship_tax decimal(7, 2),
	  cs_net_profit decimal(7, 2))
	USING parquet
	PARTITIONED BY (cs_sold_date_sk int)
	LOCATION 's3://blogpost-sparkoneks-us-east-1/weblog/BLOG_TPCDS-TEST-3T-partitioned/catalog_sales/'
	TBLPROPERTIES (
	  'parquet.compression'='SNAPPY'
	);

Create a brand new SQL cell within the pocket book and confirm the connection kind is Athena (Spark). Execute the next question so as to add the Hive partitions to the AWS Glue metadata.
```
MSCK REPAIR TABLE parquet_column_index_blog.catalog_sales_hive;
```
Create a brand new SQL cell within the pocket book and confirm the connection kind is Athena (Spark). Change s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/catalog_sales/ with the S3 URI the place you wish to retailer your Iceberg desk knowledge, then execute the next question to create the catalog_sales Iceberg desk from the Hive desk.
```
CREATE TABLE parquet_column_index_blog.catalog_sales
	USING iceberg
	PARTITIONED BY (cs_sold_date_sk)
	LOCATION 's3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/catalog_sales/'
	AS
	SELECT * FROM parquet_column_index_blog.catalog_sales_hive;
```
Create a brand new SQL cell within the pocket book and confirm the connection kind is Athena (Spark). Execute the next question to delete the catalog_sales_hive desk, which was solely wanted to create the catalog_sales Iceberg desk.
```
DROP TABLE parquet_column_index_blog.catalog_sales_hive;
```

Run an Athena question with out Parquet Column Indexes

After creating the catalog_sales Iceberg desk within the previous steps, we run a easy question that analyzes transport delays of the highest 10 most ordered gadgets. Such a evaluation might be crucial for ecommerce and retail operations. By figuring out which fashionable gadgets expertise the best delays, success groups can focus sources the place they matter most. For instance, you’ll be able to modify stock placement, change warehouse assignments, or deal with provider points. Moreover, fashionable gadgets with vital transport delays usually tend to lead to order cancellations or returns, so proactively figuring out these points helps shield income.

SELECT cs_item_sk,
    SUM(cs_quantity) as total_orders,
    AVG(cs_ship_date_sk - cs_sold_date_sk) as avg_ship_delay_days,
    MIN(cs_ship_date_sk - cs_sold_date_sk) as min_ship_delay,
    MAX(cs_ship_date_sk - cs_sold_date_sk) as max_ship_delay,
    SUM(
        CASE
            WHEN cs_ship_date_sk - cs_sold_date_sk > 7 THEN 1 ELSE 0
        END
    ) as late_shipments,
    SUM(
        CASE
            WHEN cs_ship_date_sk - cs_sold_date_sk > 7 THEN 1 ELSE 0
        END
    ) * 100.0 / COUNT(*) as late_shipment_pct,
    AVG(cs_ext_ship_cost) as avg_shipping_cost
FROM parquet_column_index_blog.catalog_sales
WHERE cs_item_sk IN (
        SELECT cs_item_sk
        FROM parquet_column_index_blog.catalog_sales
        WHERE cs_item_sk IS NOT NULL
        GROUP BY cs_item_sk
        ORDER BY SUM(cs_quantity) DESC
        LIMIT 10
    )
    AND cs_ship_date_sk IS NOT NULL
    AND cs_sold_date_sk IS NOT NULL
GROUP BY cs_item_sk
ORDER BY avg_ship_delay_days DESC;

Moreover, this question is an effective candidate for demonstrating the effectiveness of utilizing Parquet Column Indexes as a result of it has a selective filter predicate on a single column cs_item_sk. When Athena executes this question, it first identifies row teams whose min/max ranges overlap with the highest 10 most ordered gadgets. With out utilizing Parquet Column Indexes, Athena has to scan each web page of knowledge inside these matched row teams. Nonetheless, when utilizing Parquet Column Indexes, Athena can prune knowledge additional by skipping particular person pages inside these row teams whose min/max ranges don’t overlap with the ids. Full the next steps to ascertain baseline question efficiency when Athena doesn’t use Parquet Column Indexes throughout the question.

Create a brand new Python cell within the pocket book. Change s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/query_results/ with the S3 URI the place you wish to retailer your Athena question outcomes, then execute the next script. Observe the runtime and bytes scanned that will likely be printed. The script will run the question 5 instances with question consequence reuse disabled and chooses the minimal runtime and the corresponding bytes scanned amongst these iterations. See our numbers within the Run Athena question with Parquet Column Indexes part.

import boto3
import time

# Configuration
DATABASE = "parquet_column_index_blog"
OUTPUT_LOCATION = "s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/query_results/"

def run_athena_query(question: str, database: str, output_location: str):
    athena_client = boto3.consumer('athena')
    
    response = athena_client.start_query_execution(
        QueryString=question,
        QueryExecutionContext={'Database': database},
        ResultConfiguration={'OutputLocation': output_location}
    )
    
    query_execution_id = response['QueryExecutionId']
    
    whereas True:
        consequence = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
        state = consequence['QueryExecution']['Status']['State']
        
        if state in ['SUCCEEDED', 'FAILED', 'CANCELLED']:
            break
        
        time.sleep(5)
    
    if state != 'SUCCEEDED':
        increase Exception(f"Question failed with state: {state}")
    
    stats = consequence['QueryExecution']['Statistics']
    
    return {
        'execution_time_sec': stats['EngineExecutionTimeInMillis'] / 1000,
        'data_scanned_gb': stats['DataScannedInBytes'] / (1024 ** 3)
    }


def benchmark_query(question: str, database: str, output_location: str, num_runs: int = 5):
    outcomes = []
    
    for i in vary(num_runs):
        stats = run_athena_query(question, database, output_location)
        outcomes.append(stats)
    
    best_run = min(outcomes, key=lambda r: r['execution_time_sec'])
    
    execution_time = spherical(best_run['execution_time_sec'], 1)
    data_scanned = spherical(best_run['data_scanned_gb'], 1)
    
    print(f"Execution time: {execution_time} sec")
    print(f"Knowledge scanned: {data_scanned} GB")


QUERY = """
SELECT cs_item_sk,
    SUM(cs_quantity) as total_orders,
    AVG(cs_ship_date_sk - cs_sold_date_sk) as avg_ship_delay_days,
    MIN(cs_ship_date_sk - cs_sold_date_sk) as min_ship_delay,
    MAX(cs_ship_date_sk - cs_sold_date_sk) as max_ship_delay,
    SUM(
        CASE
            WHEN cs_ship_date_sk - cs_sold_date_sk > 7 THEN 1 ELSE 0
        END
    ) as late_shipments,
    SUM(
        CASE
            WHEN cs_ship_date_sk - cs_sold_date_sk > 7 THEN 1 ELSE 0
        END
    ) * 100.0 / COUNT(*) as late_shipment_pct,
    AVG(cs_ext_ship_cost) as avg_shipping_cost
FROM parquet_column_index_blog.catalog_sales
WHERE cs_item_sk IN (
        SELECT cs_item_sk
        FROM parquet_column_index_blog.catalog_sales
        WHERE cs_item_sk IS NOT NULL
        GROUP BY cs_item_sk
        ORDER BY SUM(cs_quantity) DESC
        LIMIT 10
    )
    AND cs_ship_date_sk IS NOT NULL
    AND cs_sold_date_sk IS NOT NULL
GROUP BY cs_item_sk
ORDER BY avg_ship_delay_days DESC;
"""

# Run benchmark
benchmark_query(QUERY, DATABASE, OUTPUT_LOCATION, num_runs=5)

Kind the catalog_sales desk

Earlier than rerunning the question with Athena utilizing Parquet Column Indexes, you should type the catalog_sales desk by the cs_item_sk column. Within the previous question, there’s a dynamic filter as a subquery on the cs_item_sk column:

cs_item_sk IN (
        SELECT cs_item_sk
        FROM parquet_column_index_blog.catalog_sales
        WHERE cs_item_sk IS NOT NULL
        GROUP BY cs_item_sk
        ORDER BY SUM(cs_quantity) DESC
        LIMIT 10
    )

When executing this question, Athena pushes down the filter predicate to the info supply stage, fetching solely rows that match the highest 10 most ordered gadgets. To maximise web page pruning with Parquet Column Indexes, rows with the identical cs_item_sk values must be saved close to one another within the Parquet file. With out sorting, matching values might be scattered throughout many pages, forcing Athena to learn extra knowledge. Sorting the desk by cs_item_sk clusters comparable values collectively, enabling Athena to learn fewer pages.

Let’s look at the Parquet Column Indexes in one of many Parquet recordsdata to know how the info within the catalog_sales desk is at present organized. First, obtain the Parquet file from the cs_sold_date_sk = 2450815 partition and set up the open-source parquet-cli instrument in your native machine. Change with the trail to the downloaded Parquet file, then run the next command in your native machine:

This shows Parquet Column Indexes for all columns. For brevity, solely the primary 11 pages of the cs_item_sk column from the primary row group are proven within the following instance:

row-group 0:
column index for column cs_item_sk:
Boundary order: UNORDERED
         null_count  min  max
page-0            0    4  359989
page-1            0    2  359996
page-2            0   10  359995
page-3            0   13  359996
page-4            0   22  359989
page-5            0   25  359984
page-6            0   13  359989
page-7            0   56  359990
page-8            0   14  359984
page-9            0    7  359978
page-10           0    1  359998

Discover that just about each web page comprises a variety of values. This overlap means Athena can not eradicate pages when filtering with Parquet Column Indexes on cs_item_sk. For instance, looking for cs_item_sk = 100 requires scanning every of the 11 pages as a result of the worth 100 falls inside each web page’s min/max vary. With this overlap, enabling Athena to make use of Parquet Column Indexes would supply no efficiency profit. Sorting the info by cs_item_sk eliminates this overlap, creating distinct, non-overlapping ranges for every web page. To make Parquet Column Indexes simpler, type the desk by finishing the next step:

Create a brand new SQL cell within the pocket book and confirm the connection kind is Athena (Spark). Execute the question to type the cs_item_sk column values of the catalog_sales desk in ascending order and to place all of the null values in the previous few Parquet pages. New Iceberg knowledge recordsdata will likely be generated from this question.
```
CALL spark_catalog.system.rewrite_data_files(
desk => 'parquet_column_index_blog.catalog_sales', 
technique => 'type', 
sort_order => 'cs_item_sk ASC NULLS LAST', 
choices => map('target-file-size-bytes', '1073741824', 
'rewrite-all', 'true', 'max-concurrent-file-group-rewrites', '200'));
```

Operating the parquet column-index command on the sorted knowledge file from the cs_sold_date_sk = 2450815 partition reveals that the Parquet Column Indexes at the moment are sorted and have non-overlapping ranges. The primary 11 pages of the cs_item_sk column from the primary row group are proven within the following instance:

row-group 0:
column index for column cs_item_sk:
Boundary order: ASCENDING
         null_count  min    max
page-0           0      1   5282
page-1           0   5282  10556
page-2           0  10556  15842
page-3           0  15842  21154
page-4           0  21154  26434
page-5           0  26434  31669
page-6           0  31669  36916
page-7           0  36916  42205
page-8           0  42205  47528
page-9           0  47528  52808
page-10          0  52808  58189

Now when looking for cs_item_sk = 100, Athena solely must learn page-0, skipping the remaining 10 pages completely.

Run Athena question with Parquet Column Indexes

Now that the info is sorted to eradicate overlapping pages inside the row teams for the cs_item_sk column, we run two experiments on the sorted knowledge. The primary measures the influence of sorting alone, and the second measures the mixed impact of sorting with Parquet Column Indexes.

Create a brand new Python cell within the pocket book. Execute the identical script within the part Run Athena question with out Parquet Column Indexes and pay attention to the question runtime and bytes scanned outcomes. This measures the efficiency of querying sorted knowledge with out utilizing Parquet Column Indexes.

Create a brand new Python cell within the pocket book. Execute the next Python script to set the use_iceberg_parquet_column_index desk property to true for the catalog_sales desk within the AWS Glue Knowledge Catalog.

import boto3

def add_iceberg_parquet_column_index(database_name: str, table_name: str):
    glue_client = boto3.consumer('glue')
    
    # Get present desk definition
    response = glue_client.get_table(DatabaseName=database_name, Title=table_name)
    desk = response['Table']
    
    # Construct TableInput with solely allowed fields
    table_input = {'Title': desk['Name']}
    
    allowed_fields = [
        'Description', 'Owner', 'LastAccessTime', 'LastAnalyzedTime',
        'Retention', 'StorageDescriptor', 'PartitionKeys', 'ViewOriginalText',
        'ViewExpandedText', 'TableType', 'Parameters', 'TargetTable'
    ]
    
    for subject in allowed_fields:
        if subject in desk:
            table_input[field] = desk[field]
    
    # Add the property
    if 'Parameters' not in table_input:
        table_input['Parameters'] = {}
    table_input['Parameters']['use_iceberg_parquet_column_index'] = 'true'
    
    # Replace the desk
    glue_client.update_table(DatabaseName=database_name, TableInput=table_input)

# Utilization
add_iceberg_parquet_column_index("parquet_column_index_blog", "catalog_sales")

Create a brand new Python cell within the pocket book. Execute the identical script within the part Run Athena question with out Parquet Column Indexes and pay attention to the question runtime and bytes scanned outcomes. This measures the efficiency of querying sorted knowledge utilizing Parquet Column Indexes.

Athena question time and bytes scanned enchancment

The next desk summarizes the outcomes from every experiment. The share enhancements for the sorted experiments are measured towards the unsorted baseline.

Experiment	Runtime (sec)	Bytes Scanned (GB)
Unsorted with out Parquet Column Indexes	20.6	45.2
Sorted with out Parquet Column Indexes	15.4 (25.2% sooner)	27.8 (38.5% fewer bytes)
Sorted with Parquet Column Indexes	10.3 (50.0% sooner)	13.0 (71.2% fewer bytes)

Suggestions

To maximise Athena’s potential to make use of Parquet Column Indexes and obtain optimum question efficiency, we suggest the next.

Kind knowledge by often filtered columns. This enables Athena to effectively learn Parquet Column Indexes and skip irrelevant pages, doubtlessly lowering scan time. When knowledge is sorted by a filter column, comparable values are clustered collectively inside pages. As a result of Parquet Column Indexes retailer min/max values for every web page, Athena can shortly decide which pages include matching values and skip the remainder.
Kind knowledge by high-cardinality columns. This creates distinct worth ranges between pages, maximizing the chance for Athena to skip pages throughout question execution. Excessive-cardinality (many distinct values) columns produce non-overlapping min/max ranges throughout pages, permitting Athena to extra successfully filter out irrelevant pages. In distinction, low-cardinality columns reminiscent of boolean or standing fields lead to overlapping ranges throughout many pages, lowering the variety of skipped pages.

Clear up

When you will have completed the steps on this submit, full the next cleanup actions to keep away from incurring ongoing prices:

Create a brand new SQL cell within the pocket book and set the connection kind to Athena (Spark). Execute the next command to drop the parquet_column_index_blog database and the catalog_sales desk.
```
DROP DATABASE parquet_column_index_blog CASCADE;
```
Delete the Iceberg desk knowledge and the Athena question outcomes out of your S3 bucket.
Delete the SageMaker Unified Studio IAM-based area whether it is now not wanted.

Conclusion

On this submit, we confirmed you ways Athena makes use of Parquet Column Indexes to hurry up queries and scale back the variety of bytes scanned. By utilizing Parquet Column Indexes, Athena can skip irrelevant knowledge pages to enhance question efficiency, particularly for queries with selective filters on sorted knowledge. Check with Optimize Iceberg tables to study extra about this function and check out it out by yourself queries.

Find out how to use Parquet Column Indexes with Amazon Athena

Overview of Parquet Column Indexes

Find out how to use Parquet Column Indexes with Athena

Measuring Athena efficiency good points when utilizing Parquet Column Indexes

Stipulations

Create catalog_sales Iceberg desk

Run an Athena question with out Parquet Column Indexes

Kind the catalog_sales desk

Run Athena question with Parquet Column Indexes

Athena question time and bytes scanned enchancment

Suggestions

Clear up

Conclusion

In regards to the Creator

This Researcher Trains Robots to Make Educated Guesses

Donald Trump’s White Home UFC Occasion Would Be Embarrassing Wherever

Deloitte Japan Advances Safety Operations with Cisco Basis AI’s Open-Supply Mannequin

Was “Tik-Tok of Oz” the First Clever Robotic to Seem in Literature?

CrankGPT Is Assured to Make You Cranky

From Intelligence to Motion: Operationalizing MS-ISAC Risk Knowledge Throughout SLED Environments

UrbanV and Japan Airport Consultants (JAC) announce a strategicpartnership to develop AAM in Japan and past – sUAS Information

New Boson SX8 Brings Excessive-Decision Thermal Imaging to NDAA-Compliant Drone Payloads

The Mannequin Everybody Stated Could not Exist Is Now Accessible to Everybody |

The best way to Generate AI Movies utilizing Gemini

Claude AI coaching: Study prompting, real-world workflows & extra

The Mannequin Everybody Stated Could not Exist Is Now Accessible to Everybody |