Scalable analytics and centralized governance for Apache Iceberg tables using Amazon S3 Tables and Amazon Redshift

Amazon Redshift supports querying data stored in Apache Iceberg tables managed by Amazon S3 Tables, which we previously covered as part of getting started blog post. While this blog post helps you to get started using Amazon Redshift with Amazon S3 Tables, there are additional steps you need to consider when working with your data in production environments, including who has access to your data and with what level of permissions.

In this post, we’ll build on the first post in this series to show you how to set up an Apache Iceberg data lake catalog using Amazon S3 Tables and provide different levels of access control to your data. Through this example, you’ll set up fine-grained access controls for multiple users and see how this works using Amazon Redshift. We’ll also review an example with simultaneously using data that resides both in Amazon Redshift and Amazon S3 Tables, enabling a unified analytics experience.

Solution overview

In this solution, we show how to query a dataset stored in Amazon S3 Tables for further analysis using data managed in Amazon Redshift. Specifically, we go through the steps shown in the following figure to load a dataset into Amazon S3 Tables, grant appropriate permissions, and finally execute queries to analyze our dataset for trends and insights.

In this post, you walk through the following steps:

Creating an Amazon S3 Table bucket: In AWS Management Console for Amazon S3, create an Amazon S3 Table bucket and integrate with other AWS analytics services
Creating an S3 Table and loading data: Run spark SQL in Amazon EMR to create a namespace and an S3 Table and load diabetic patients’ visit data
Granting permissions: Granting fine-grained access controls in AWS Lake Formation
Running SQL analytics: Querying S3 Tables using the auto mounted S3 Table catalog.

This post uses data from a healthcare use case to analyze information about diabetic patients and identify the frequency of age groups admitted to the hospital. You’ll use the preceding steps to perform this analysis.

Prerequisites

To begin, you need to add an Amazon Redshift service-linked role—AWSServiceRoleForRedshift—as a read-only administrator in Lake Formation. You can run following AWS Command Line Interface (AWS CLI) command to add the role.

Replace with your account number and replace with the AWS Region that you are using. You can run this command from AWS CloudShell or through AWS CLI configured in your environment.

aws lakeformation put-data-lake-settings \
        --region  \
        --data-lake-settings \
 '{
   "DataLakeAdmins": [{"DataLakePrincipalIdentifier":"arn:aws:iam:::role/Admin"}],
   "ReadOnlyAdmins":[{"DataLakePrincipalIdentifier":"arn:aws:iam:: :role/aws-service-role/redshift.amazonaws.com/AWSServiceRoleForRedshift"}],
   "CreateDatabaseDefaultPermissions":[],
   "CreateTableDefaultPermissions":[],
   "Parameters":{"CROSS_ACCOUNT_VERSION":"4","SET_CONTEXT":"TRUE"}
  }'

You also need to create or use an existing Amazon Elastic Compute Cloud (Amazon EC2) key pair that will be used for SSH connections to cluster instances. For more information, see Amazon EC2 key pairs.

The examples in this post require the following AWS services and features:

The CloudFormation template that follows creates the following resources:

An Amazon EMR 7.6.0 cluster with Apache Iceberg packages
An Amazon Redshift Serverless instance
An AWS Identity and Access Management (IAM) instance profile, service role, and security groups
IAM roles with required policies
Two IAM users: nurse and analyst

Download the CloudFormation template, or you can use the Launch Stack button to automatically download it to your AWS environment. Note that network routes are directed to 255.255.255.255/32 for security reasons. Replace the routes with your organization’s IP addresses. Also enter your IP or VPN range for Jupyter Notebook access in the SourceCidrForNotebook parameter in CloudFormation.

Download the diabetic encounters and patient datasets and upload it into your S3 bucket. These files are from a publicly available open dataset.

This sample dataset is used to highlight this use case, the techniques covered can be adapted to your workflows. The following are more details about this dataset:

diabetic_encounters_s3.csv: Contains information about patient visits for diabetic treatment.

encounter_id: Unique number to refer to an encounter with a patient who has diabetes.
patient_nbr: Unique number to identify a patient.
num_procedures: Number of medical procedures administered.
num_medications: Number of medications provided during the visit
insulin: Insulin level observed. Valid values are steady, up, and no.
time_in_hospital: Duration of time in hospital in days.
readmitted: Readmitted to hospital within 30 days or after 30 days.

diabetic_patients_rs.csv: Contains patient information such as age group, gender, race, and number of visits.

patient_nbr: Unique number to identify a patient
race: Patient’s race
gender: Patient’s gender
age_grp: Patient’s age group. Valid values are 0-10, 10-20, 20-30, and so on
number_outpatient: Number of outpatient visits
number_emergency: Number of emergency room visits
number_inpatient: Number of inpatient visits

Now that you’ve set up the prerequisites, you’re ready to connect Amazon Redshift to query Apache Iceberg data stored in Amazon S3 Tables.

Create an S3 Table bucket

Before you can use Amazon Redshift to query the data in an Amazon S3 Table, you must create an Amazon S3 Table.

Sign in to the AWS Management Console and go to Amazon S3.
Go to Amazon S3 Table buckets. This is an option in the Amazon S3 console.
In the Table buckets view, there’s a section that describes Integration with AWS analytics services. Choose Enable Integration if you haven’t previously set this up. This sets up the integration with AWS analytics services, including Amazon Redshift, Amazon EMR, and Amazon Athena.
Wait a few seconds for the status to change to Enabled.
Choose Create table bucket and enter a bucket name. You can use any name that follows the naming conventions. In this example, we used the bucket name patient-encounter. When you’re finished, choose Create table bucket.
After the S3 Table bucket is created, you’ll be redirected to the Table buckets list. Copy the Amazon Resource Name (ARN) of the table bucket you just created to use in the next section.

Now that your S3 Table bucket is set up, you can load data.

Create S3 Table and load data

The CloudFormation template in the prerequisites created an Apache Spark cluster using Amazon EMR. You’ll use the Amazon EMR cluster to load data into Amazon S3 Tables.

Connect to the Apache Spark primary node using SSH or through Jupyter Notebooks. Note that an Amazon EMR cluster was launched when you deployed the CloudFormation template.

Enter the following command to launch the Spark shell and initialize a Spark session for Iceberg that connects to your S3 Table bucket. Replace , and

> with the information your region, account and bucket name.

spark-shell \
  --packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.1,software.amazon.awssdk:bundle:2.20.160,software.amazon.awssdk:url-connection-client:2.20.160" \
  --master "local[*]" \
  --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
  --conf "spark.sql.defaultCatalog=spark_catalog" \
   --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog" \
  --conf "spark.sql.catalog.spark_catalog.type=rest" \
  --conf "spark.sql.catalog.spark_catalog.uri=https://s3tables..amazonaws.com/iceberg" \
  --conf "spark.sql.catalog.spark_catalog.warehouse=arn:aws:s3tables:::bucket/" \
  --conf "spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true" \
  --conf "spark.sql.catalog.spark_catalog.rest.signing-name=s3tables" \
  --conf "spark.sql.catalog.spark_catalog.rest.signing-region=" \
  --conf "spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \
  --conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider" \
  --conf "spark.sql.catalog.spark_catalog.rest-metrics-reporting-enabled=false"

See Accessing Amazon S3 Tables with Amazon EMR for upgrades to software.amazon.s3tables package versions.

Next, create a namespace that will link your S3 Table bucket with your Amazon Redshift Serverless workgroup. We chose encounters as the namespace for this example, but you can use a different name. Use the following SparkSQL command:
```
spark.sql("CREATE NAMESPACE IF NOT EXISTS s3tablesbucket.encounters")
```

Create an Apache Iceberg table with name diabetic_encounters.

spark.sql( 
""" CREATE TABLE IF NOT EXISTS s3tablesbucket.encounters.`diabetic_encounters` ( 
encounter_id INT, 
patient_nbr INT,
num_procedures INT,
num_medications INT,
insulin STRING,
time_in_hospital INT,
readmitted STRING 
) 
USING iceberg """
)

Load csv into the S3 Table encounters.diabetic_encounters. Replace with the Amazon S3 file path of the diabetic_encounters_s3.csv file you uploaded earlier.

val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(" ")

df.writeTo("s3tablesbucket.encounters.diabetic_encounters").using("Iceberg").tableProperty ("format-version", "2").createOrReplace()

Query the data to validate it using Spark shell.

spark.sql(""" SELECT * FROM s3tablesbucket.encounters.diabetic_encounters """).show()

Grant permissions

In this section, you grant fine-grained access control to the two IAM users created as part of the prerequisites.

nurse: Grant access to all columns in the diabetic_encounters table
analyst: Grant access to only {encounter_id, patient_nbr, readmitted} columns

First, grant access to the diabetic_encounters table for nurse user.

In AWS Lake Formation, Choose Data Permissions.
On the Grant Permissions page, under Principals, select IAM users and roles.
Select the IAM user nurse.
For Catalogs, select :s3tablescatalog/patient-encounter.
For Databases, select encounter
Scroll down. For Tables, select diabetic_encounters.
For Table permissions, select Select.
For Data permissions, select All data access.
Choose Grant. This will grant select access on all the columns in diabetic_encounters to the nurse

Now grant access to the diabetic_encounters table for the analyst user.

Repeat the same steps that you followed for nurse user up to step 7 in the previous section.
For Data permissions, select Column-based access. Select Include columns and select the encounter_id, patient_nbr, and readmitted columns
Choose Grant. This will grant select access on the encounter_id, patient_nbr, and readmitted columns in diabetic_encounters to the analyst

Run SQL analytics

In this section, you will access the data in the diabetic_encounters S3 Table using nurse and analyst to learn how fine-grain access control works. You will also combine data from the S3 Table data with a local table in Amazon Redshift using a single query.

In the Amazon Redshift Query Editor V2, connect to serverless:rs-demo-wg, an Amazon Redshift Serverless instance created by the CloudFormation template.
Select Database user name and password as the connection method and connect using super user awsuser. Provide the password you gave as an input parameter to the CloudFormation stack.
Run the following commands to create the IAM users nurse and analyst in Amazon Redshift.
```
CREATE USER IAM:nurse password disable;
CREATE USER IAM:analyst password disable;
```
Amazon Redshift automatically mounts the Data Catalog as an external database named awsdatacatalog to simplify accessing your tables in Data Catalog. You can grant usage access to this database for the IAM users:
```
GRANT USAGE ON DATABASE awsdatacatalog to "IAM:nurse";
GRANT USAGE ON DATABASE awsdatacatalog to "IAM:analyst";
```

For the next steps, you must first sign in to the AWS Console as the nurse IAM user. You can find the IAM user’s password in the AWS Secrets Manager console and retrieving the value from the secret ending with iam-users-credentials. See Get a secret value using the AWS console for more information.

After you’ve signed in to the console, navigate to the Amazon Redshift Query Editor V2.
Sign in to your Amazon Redshift cluster using the IAM:nurse. You can do this by connecting to serverless:rs-demo-wg as Federated user. This applies the permission provided in Lake Formation for accessing your data in Amazon S3 Tables:

Run following SQL to query S3 Table diabetic_encounters.

SELECT * FROM patient-encounter@s3tablescatalog"."encounters"."diabetic_encounters";

This returns all the data in the S3 Table for diabetic_encounters across every column in the table, as shown in the following figure:

Recall that you also created an IAM user called analyst that only has access to the encounter_id, patient_nbr, and readmitted columns. Let’s verify that analyst user can only access those columns.

Sign in to the AWS console as the analyst IAM user and open the Amazon Redshift Query Editor v2 using the same steps as above. Run the same query as before:
```
SELECT * FROM patient-encounter@s3tablescatalog"."encounters"."diabetic_encounters";
```

This time, you should only the encounter_id, patient_nbr, and readmitted columns:

Now that you’ve seen how you can access data in Amazon S3 Tables from Amazon Redshift while setting the levels of access required for your users, let’s see how we can join data in S3 Tables to tables that already exist in Amazon Redshift.

Combine data from an S3 Table and a local table in Amazon Redshift

For this section, you’ll load data into your local Amazon Redshift cluster. After this is complete, you can analyze the datasets in both Amazon Redshift and S3 Tables.

First, as the analytics federated user, sign in to your Amazon Redshift cluster using Amazon Redshift Query Editor v2.

Use the following SQL command to create a table that contains patient information.:

CREATE TABLE public.patient_info (
    patient_nbr integer ENCODE az64,
    race character varying(256) ENCODE lzo,
    gender character varying(256) ENCODE lzo,
    age_grp character varying(256) ENCODE lzo,
    number_outpatient integer ENCODE az64,
    number_emergency integer ENCODE az64,
    number_inpatient integer ENCODE az64);

Copy patient information from the file csv that’s stored in your Amazon S3 object bucket. Replace with the location of the file in your S3 bucket.
```
COPY dev.public.patient_info FROM 's3://' 
IAM_ROLE default 
FORMAT AS CSV DELIMITER ',' 
IGNOREHEADER 1;
```
Use the following query to review the sample data to verify that the command was successful. This will show information from 10 patients, as shown in the following figure.
```
SELECT * FROM public.patient_info limit 10;
```

Now combine data from the Amazon S3 Table diabetic_encounters and the Amazon Redshift patient_info. In this example, the query fetches information about what age group was most frequently readmitted to the hospital within 30 days of an initial hospital visit:

SELECT
    age_grp,
    count(*) readmission_count
FROM
    "patient-encounter@s3tablescatalog"."encounters"."diabetic_encounters" a
JOIN public.patient_info b ON b.patient_nbr = a.patient_nbr
WHERE
    a.readmitted='<30'
GROUP BY age_grp
ORDER BY readmission_count DESC
LIMIT 1;

This query returns results showing an age group and the number of re-admissions, as shown in the following figure.

Cleanup

To clean up your resources, delete the stack you deployed using AWS CloudFormation. For instructions, see Deleting a stack on the AWS CloudFormation console.

Conclusion

In this post, you walked through an end-to-end process for setting up security and governance controls for Apache Iceberg data stored in Amazon S3 Tables and accessing it from Amazon Redshift. This includes creating S3 Tables, loading data into them, registering the tables in a data lake catalog, setting up access controls, and querying the data using Amazon Redshift. You also learned how to combine data from Amazon S3 Tables and local Amazon Redshift tables stored in Redshift Managed Storage in a single query, enabling a seamless, unified analytics experience. Try out these features and see Working with Amazon S3 Tables and table buckets for more details. We welcome your feedback in the comments section.

About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Jonathan Katz is a Principal Product Manager – Technical on the Amazon Redshift team and is based in New York. He is a Core Team member of the open source PostgreSQL project and an active open source contributor, including PostgreSQL and the pgvector project.

Scalable analytics and centralized governance for Apache Iceberg tables using Amazon S3 Tables and Amazon Redshift

Solution overview

Prerequisites

Create an S3 Table bucket

Create S3 Table and load data

Grant permissions

Run SQL analytics

Combine data from an S3 Table and a local table in Amazon Redshift

Cleanup

Conclusion

About the Authors

Leave a Reply Cancel reply

Tesla continues to circle the drain

Extremely slow Terminal startup on macOS Sequoia (M3 Mac Pro)

Floating Offshore Wind Turbines Get A Vertical Axis Makeover

Try vegan and gluten-free pasta from Charlie’s Table

Lime-free soil treatment product can cut brownfield remediation costs, says group

Natural Bridge Zoo 2025: Must-See Animal Encounters & What’s New in Virginia

Nordic Semiconductor’s nRF9151 cellular IoT module validated for use on NTT DOCOMO’s LTE-M network in Japan

Google steers Gemini AI into connected cars

Read-it-later app Pocket is shutting down. Here are the best alternatives.

Summer Is Almost Here. Here’s How to Check for Skin Cancer Signs

A New AI Whips Up Designer Proteins With Only a Text Prompt

Read-it-later app Pocket is shutting down. Here are the best alternatives.