Amazon Redshift supports querying data stored in Apache Iceberg tables managed by Amazon S3 Tables, which we previously covered as part of getting started blog post. While this blog post helps you to get started using Amazon Redshift with Amazon S3 Tables, there are additional steps you need to consider when working with your data in production environments, including who has access to your data and with what level of permissions.
In this post, we’ll build on the first post in this series to show you how to set up an Apache Iceberg data lake catalog using Amazon S3 Tables and provide different levels of access control to your data. Through this example, you’ll set up fine-grained access controls for multiple users and see how this works using Amazon Redshift. We’ll also review an example with simultaneously using data that resides both in Amazon Redshift and Amazon S3 Tables, enabling a unified analytics experience.
Solution overview
In this solution, we show how to query a dataset stored in Amazon S3 Tables for further analysis using data managed in Amazon Redshift. Specifically, we go through the steps shown in the following figure to load a dataset into Amazon S3 Tables, grant appropriate permissions, and finally execute queries to analyze our dataset for trends and insights.
In this post, you walk through the following steps:
- Creating an Amazon S3 Table bucket: In AWS Management Console for Amazon S3, create an Amazon S3 Table bucket and integrate with other AWS analytics services
- Creating an S3 Table and loading data: Run spark SQL in Amazon EMR to create a namespace and an S3 Table and load diabetic patients’ visit data
- Granting permissions: Granting fine-grained access controls in AWS Lake Formation
- Running SQL analytics: Querying S3 Tables using the auto mounted S3 Table catalog.
This post uses data from a healthcare use case to analyze information about diabetic patients and identify the frequency of age groups admitted to the hospital. You’ll use the preceding steps to perform this analysis.
Prerequisites
To begin, you need to add an Amazon Redshift service-linked role—AWSServiceRoleForRedshift
—as a read-only administrator in Lake Formation. You can run following AWS Command Line Interface (AWS CLI) command to add the role.
Replace
with your account number and replace
with the AWS Region that you are using. You can run this command from AWS CloudShell or through AWS CLI configured in your environment.
You also need to create or use an existing Amazon Elastic Compute Cloud (Amazon EC2) key pair that will be used for SSH connections to cluster instances. For more information, see Amazon EC2 key pairs.
The examples in this post require the following AWS services and features:
The CloudFormation template that follows creates the following resources:
- An Amazon EMR 7.6.0 cluster with Apache Iceberg packages
- An Amazon Redshift Serverless instance
- An AWS Identity and Access Management (IAM) instance profile, service role, and security groups
- IAM roles with required policies
- Two IAM users: nurse and analyst
Download the CloudFormation template, or you can use the Launch Stack button to automatically download it to your AWS environment. Note that network routes are directed to 255.255.255.255/32 for security reasons. Replace the routes with your organization’s IP addresses. Also enter your IP or VPN range for Jupyter Notebook access in the SourceCidrForNotebook
parameter in CloudFormation.
Download the diabetic encounters and patient datasets and upload it into your S3 bucket. These files are from a publicly available open dataset.
This sample dataset is used to highlight this use case, the techniques covered can be adapted to your workflows. The following are more details about this dataset:
diabetic_encounters_s3.csv
: Contains information about patient visits for diabetic treatment.
encounter_id
: Unique number to refer to an encounter with a patient who has diabetes.patient_nbr
: Unique number to identify a patient.num_procedures
: Number of medical procedures administered.num_medications
: Number of medications provided during the visitinsulin
: Insulin level observed. Valid values are steady, up, and no.time_in_hospital
: Duration of time in hospital in days.readmitted
: Readmitted to hospital within 30 days or after 30 days.
diabetic_patients_rs.csv
: Contains patient information such as age group, gender, race, and number of visits.
patient_nbr
: Unique number to identify a patientrace
: Patient’s racegender
: Patient’s genderage_grp
: Patient’s age group. Valid values are 0-10, 10-20, 20-30, and so onnumber_outpatient
: Number of outpatient visitsnumber_emergency
: Number of emergency room visitsnumber_inpatient
: Number of inpatient visits
Now that you’ve set up the prerequisites, you’re ready to connect Amazon Redshift to query Apache Iceberg data stored in Amazon S3 Tables.
Create an S3 Table bucket
Before you can use Amazon Redshift to query the data in an Amazon S3 Table, you must create an Amazon S3 Table.
- Sign in to the AWS Management Console and go to Amazon S3.
- Go to Amazon S3 Table buckets. This is an option in the Amazon S3 console.
- In the Table buckets view, there’s a section that describes Integration with AWS analytics services. Choose Enable Integration if you haven’t previously set this up. This sets up the integration with AWS analytics services, including Amazon Redshift, Amazon EMR, and Amazon Athena.
- Wait a few seconds for the status to change to Enabled.
- Choose Create table bucket and enter a bucket name. You can use any name that follows the naming conventions. In this example, we used the bucket name patient-encounter. When you’re finished, choose Create table bucket.
- After the S3 Table bucket is created, you’ll be redirected to the Table buckets list. Copy the Amazon Resource Name (ARN) of the table bucket you just created to use in the next section.
Now that your S3 Table bucket is set up, you can load data.
Create S3 Table and load data
The CloudFormation template in the prerequisites created an Apache Spark cluster using Amazon EMR. You’ll use the Amazon EMR cluster to load data into Amazon S3 Tables.
- Connect to the Apache Spark primary node using SSH or through Jupyter Notebooks. Note that an Amazon EMR cluster was launched when you deployed the CloudFormation template.
- Enter the following command to launch the Spark shell and initialize a Spark session for Iceberg that connects to your S3 Table bucket. Replace
,
and>
with the information your region, account and bucket name.
See Accessing Amazon S3 Tables with Amazon EMR for upgrades to software.amazon.s3tables package versions.
- Next, create a namespace that will link your S3 Table bucket with your Amazon Redshift Serverless workgroup. We chose encounters as the namespace for this example, but you can use a different name. Use the following SparkSQL command:
- Create an Apache Iceberg table with name
diabetic_encounters
. - Load csv into the S3 Table
encounters.diabetic_encounters
. Replace
with the Amazon S3 file path of thediabetic_encounters_s3.csv
file you uploaded earlier. - Query the data to validate it using Spark shell.
Grant permissions
In this section, you grant fine-grained access control to the two IAM users created as part of the prerequisites.
- nurse: Grant access to all columns in the
diabetic_encounters
table - analyst: Grant access to only
{encounter_id, patient_nbr, readmitted}
columns
First, grant access to the diabetic_encounters
table for nurse user.
- In AWS Lake Formation, Choose Data Permissions.
- On the Grant Permissions page, under Principals, select IAM users and roles.
- Select the IAM user nurse.
- For Catalogs, select
.:s3tablescatalog/patient-encounter - For Databases, select encounter
- Scroll down. For Tables, select diabetic_encounters.
- For Table permissions, select Select.
- For Data permissions, select All data access.
- Choose Grant. This will grant select access on all the columns in
diabetic_encounters
to the nurse
Now grant access to the diabetic_encounters
table for the analyst user.
- Repeat the same steps that you followed for nurse user up to step 7 in the previous section.
- For Data permissions, select Column-based access. Select Include columns and select the
encounter_id
,patient_nbr
, andreadmitted
columns - Choose Grant. This will grant select access on the
encounter_id
,patient_nbr
, andreadmitted
columns indiabetic_encounters
to the analyst
Run SQL analytics
In this section, you will access the data in the diabetic_encounters
S3 Table using nurse and analyst to learn how fine-grain access control works. You will also combine data from the S3 Table data with a local table in Amazon Redshift using a single query.
- In the Amazon Redshift Query Editor V2, connect to
serverless:rs-demo-wg
, an Amazon Redshift Serverless instance created by the CloudFormation template. - Select Database user name and password as the connection method and connect using super user
awsuser
. Provide the password you gave as an input parameter to the CloudFormation stack. - Run the following commands to create the IAM users nurse and analyst in Amazon Redshift.
- Amazon Redshift automatically mounts the Data Catalog as an external database named
awsdatacatalog
to simplify accessing your tables in Data Catalog. You can grant usage access to this database for the IAM users:
For the next steps, you must first sign in to the AWS Console as the nurse IAM user. You can find the IAM user’s password in the AWS Secrets Manager console and retrieving the value from the secret ending with iam-users-credentials. See Get a secret value using the AWS console for more information.
- After you’ve signed in to the console, navigate to the Amazon Redshift Query Editor V2.
- Sign in to your Amazon Redshift cluster using the IAM:nurse. You can do this by connecting to serverless:rs-demo-wg as Federated user. This applies the permission provided in Lake Formation for accessing your data in Amazon S3 Tables:
- Run following SQL to query S3 Table diabetic_encounters.
This returns all the data in the S3 Table for diabetic_encounters across every column in the table, as shown in the following figure:
Recall that you also created an IAM user called analyst that only has access to the encounter_id
, patient_nbr
, and readmitted
columns. Let’s verify that analyst user can only access those columns.
- Sign in to the AWS console as the analyst IAM user and open the Amazon Redshift Query Editor v2 using the same steps as above. Run the same query as before:
This time, you should only the encounter_id, patient_nbr, and readmitted columns:
Now that you’ve seen how you can access data in Amazon S3 Tables from Amazon Redshift while setting the levels of access required for your users, let’s see how we can join data in S3 Tables to tables that already exist in Amazon Redshift.
Combine data from an S3 Table and a local table in Amazon Redshift
For this section, you’ll load data into your local Amazon Redshift cluster. After this is complete, you can analyze the datasets in both Amazon Redshift and S3 Tables.
- First, as the analytics federated user, sign in to your Amazon Redshift cluster using Amazon Redshift Query Editor v2.
- Use the following SQL command to create a table that contains patient information.:
- Copy patient information from the file csv that’s stored in your Amazon S3 object bucket. Replace
with the location of the file in your S3 bucket. - Use the following query to review the sample data to verify that the command was successful. This will show information from 10 patients, as shown in the following figure.
- Now combine data from the Amazon S3 Table
diabetic_encounters
and the Amazon Redshiftpatient_info
. In this example, the query fetches information about what age group was most frequently readmitted to the hospital within 30 days of an initial hospital visit:
This query returns results showing an age group and the number of re-admissions, as shown in the following figure.
Cleanup
To clean up your resources, delete the stack you deployed using AWS CloudFormation. For instructions, see Deleting a stack on the AWS CloudFormation console.
Conclusion
In this post, you walked through an end-to-end process for setting up security and governance controls for Apache Iceberg data stored in Amazon S3 Tables and accessing it from Amazon Redshift. This includes creating S3 Tables, loading data into them, registering the tables in a data lake catalog, setting up access controls, and querying the data using Amazon Redshift. You also learned how to combine data from Amazon S3 Tables and local Amazon Redshift tables stored in Redshift Managed Storage in a single query, enabling a seamless, unified analytics experience. Try out these features and see Working with Amazon S3 Tables and table buckets for more details. We welcome your feedback in the comments section.
About the Authors
Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.
Jonathan Katz is a Principal Product Manager – Technical on the Amazon Redshift team and is based in New York. He is a Core Team member of the open source PostgreSQL project and an active open source contributor, including PostgreSQL and the pgvector project.