Terraform IaC DevOps Course: Building and Benchmarking a Scalable Lakehouse on GCP

Introduction

This project involved building and benchmarking a scalable three-tier lakehouse solution on Google Cloud Platform (GCP) as part of a comprehensive Terraform IaC DevOps Course. The course focused on leveraging Infrastructure as Code (IaC) principles, implementing CI/CD pipelines, and utilizing various GCP services for big data processing and analysis.

Goals

Infrastructure as Code (IaC): Provision computing resources for big data analysis using Terraform.
CI/CD Pipelines: Set up automated pipelines for deploying cloud infrastructure using GitHub Actions.
Security: Utilize linters to detect security vulnerabilities in the infrastructure and implement Workload Identity Federation for secure authentication between GitHub Actions and GCP.
Big Data Processing: Run Apache Spark code in a distributed manner on a Hadoop cluster using Vertex AI notebooks and Dataproc.
Benchmarking and Scalability: Perform benchmarking and scalability tests of the lakehouse solution using the TPC-DI benchmark, dbt, GCP Composer (managed Apache Airflow), and Dataproc.

Architecture

The project implemented a three-tier lakehouse architecture on GCP, utilizing services such as:

Vertex AI Workbench: Managed JupyterLab environment for interactive data analysis and development.
Dataproc: Managed Apache Spark and Hadoop service for distributed data processing.
Composer: Managed Apache Airflow service for orchestrating data pipelines.
Cloud Storage: Storing data for the lakehouse.

Phases

Phase 1: Infrastructure Setup and CI/CD

This phase focused on setting up the GCP project, provisioning infrastructure using Terraform, and configuring CI/CD pipelines with GitHub Actions and Workload Identity Federation.

Phase 2a: Data Loading and Transformation

This phase involved:

Using a provided Jupyter notebook to generate and load data into the lakehouse.
Exploring the generated data files and understanding their format, content, and size.
Analyzing the data loading process using SparkSQL.
Implementing data transformations using dbt and running them through the orchestrated data pipeline.
Adding custom dbt tests to ensure data quality.

Phase 2b: Benchmarking and Scalability Testing

This phase focused on:

Modifying the Dataproc cluster configuration to test different machine types.
Running dbt with varying numbers of Spark executors (1, 2, and 5).
Collecting and analyzing the performance data, including total execution time and individual model execution times.
Visualizing and discussing the scalability of the solution based on the collected data.

Technologies Used

Google Cloud Platform (GCP): Vertex AI, Dataproc, Composer, Cloud Storage, IAM
Terraform: Infrastructure as Code
GitHub Actions: CI/CD
Workload Identity Federation: Secure Authentication
Apache Spark: Distributed Data Processing
Hadoop: Distributed File System and Resource Management
dbt: Data Build Tool
TPC-DI: Benchmark for Data Integration
Apache Airflow: Workflow Management Platform
JupyterLab: Interactive Development Environment
Python: Programming Language

Resources

GitHub Repository

Introduction#

Goals#

Architecture#

Phases#

Phase 1: Infrastructure Setup and CI/CD#

Phase 2a: Data Loading and Transformation#

Phase 2b: Benchmarking and Scalability Testing#

Technologies Used#

Resources#