Streamlining Table Creation with Precision: Automating Data Engineering

Kartik Bhatnagar
3 min readAug 18, 2023

--

Making Table Creation Easy: A Team Effort of AWS Services

In the realm of data engineering, the process of table creation has often been a complex and time-consuming endeavor. However, a solution emerges, powered by the synergy of AWS Glue and Amazon Redshift, that revolutionizes this process through automation, saving time and reducing the risk of errors.

Streamlining Your Automation: A New Approach

Step 1: Embracing Automation

Enter the realm of automated table creation, a process that bypasses the traditional hurdles. This innovative method leverages the capabilities of AWS Glue and Redshift to streamline and expedite table setup.

Step 2: Harnessing AWS Glue Crawler

The journey commences with AWS Glue Crawler, a tool that acts as an intelligent detective, analysing your data and crafting table definitions without requiring scripting. This negates the need for painstakingly defining data types and schema details. Create the Glue Crawler by pointing to an S3 location and also run it so all tables definition will be generated with in the Glue database my_glue_db. You can also set the advanced feature of Glue like what to do when the data type changes or new columns are added.

  1. In the AWS Glue Console, select “Crawlers” and click “Add Crawler”.
  2. Provide a name for the crawler and specify the S3 path as the data source. Use the database name as my_glue_db.
  3. Follow the prompts to complete the crawler setup.
  4. Once the setup is complete, select your crawler and click “Run Crawler” to start the data crawling process.

Step 3: Building Bridges with Redshift

The pivotal phase involves the establishment of an external database within Amazon Redshift. This bridge, known as an “external schema,” connects Glue’s organized data with Redshift’s storage infrastructure, enabling seamless data transition.

The bridge is formed through this SQL command:

CREATE EXTERNAL SCHEMA my_ext_db
FROM DATA CATALOG DATABASE 'my_glue_db'
IAM_ROLE 'arn:aws:iam:::role/my-redshift-role';

By this stage you can also use Redshift Spectrum to query all tables scanned by Glue Crawler and available in the external database — my_ext_db. If you are happy with Redshift Spectrum, no need to move to Step 4.

Step 4: Use Glue external schema to create a redshift internal table.

The following SQL command achieves this transformation and creates a Redshift Internal table for you without touching the schema bit.

CREATE TABLE my_internal_db.my_table AS SELECT * FROM my_ext_db.my_table;

Step 5: Filling Empty Data Canvas in Internal Table

With the table structure in place, the final touch involves utilizing the `INSERT INTO` command to populate it with data. This elegant command simplifies a once-cumbersome task and prepares the table for active use.

INSERT INTO my_internal_db.my_table
SELECT *
FROM source_schema.source_table;

Embracing Efficiency: A Vision of Tomorrow

In essence, this method revolutionizes the table creation process and provides data for consumption. Tedious manual scripting using YAML/SQL and schema definition are replaced by a seamless orchestration of automation. Data engineers are unburdened from intricate setup tasks and granted more time to delve into data insights.

This innovative synergy between AWS Glue and Amazon Redshift forms a bridge between precision and power, demonstrating the true potential of technology in data engineering. Automated table creation serves as a testament to the ever-evolving landscape of data management, where innovation paves the way for efficiency and excellence.

--

--

No responses yet