Introduction

This project involved building and benchmarking a scalable three-tier lakehouse solution on Google Cloud Platform (GCP) as part of a comprehensive Terraform IaC DevOps Course. The course focused on leveraging Infrastructure as Code (IaC) principles, implementing CI/CD pipelines, and utilizing various GCP services for big data processing and analysis.

Goals

  1. Infrastructure as Code (IaC): Provision computing resources for big data analysis using Terraform.
  2. CI/CD Pipelines: Set up automated pipelines for deploying cloud infrastructure using GitHub Actions.
  3. Security: Utilize linters to detect security vulnerabilities in the infrastructure and implement Workload Identity Federation for secure authentication between GitHub Actions and GCP.
  4. Big Data Processing: Run Apache Spark code in a distributed manner on a Hadoop cluster using Vertex AI notebooks and Dataproc.
  5. Benchmarking and Scalability: Perform benchmarking and scalability tests of the lakehouse solution using the TPC-DI benchmark, dbt, GCP Composer (managed Apache Airflow), and Dataproc.

Architecture

The project implemented a three-tier lakehouse architecture on GCP, utilizing services such as:

  • Vertex AI Workbench: Managed JupyterLab environment for interactive data analysis and development.
  • Dataproc: Managed Apache Spark and Hadoop service for distributed data processing.
  • Composer: Managed Apache Airflow service for orchestrating data pipelines.
  • Cloud Storage: Storing data for the lakehouse.

Phases

Phase 1: Infrastructure Setup and CI/CD

This phase focused on setting up the GCP project, provisioning infrastructure using Terraform, and configuring CI/CD pipelines with GitHub Actions and Workload Identity Federation.

Phase 2a: Data Loading and Transformation

This phase involved:

  1. Using a provided Jupyter notebook to generate and load data into the lakehouse.
  2. Exploring the generated data files and understanding their format, content, and size.
  3. Analyzing the data loading process using SparkSQL.
  4. Implementing data transformations using dbt and running them through the orchestrated data pipeline.
  5. Adding custom dbt tests to ensure data quality.

Phase 2b: Benchmarking and Scalability Testing

This phase focused on:

  1. Modifying the Dataproc cluster configuration to test different machine types.
  2. Running dbt with varying numbers of Spark executors (1, 2, and 5).
  3. Collecting and analyzing the performance data, including total execution time and individual model execution times.
  4. Visualizing and discussing the scalability of the solution based on the collected data.

Technologies Used

  • Google Cloud Platform (GCP): Vertex AI, Dataproc, Composer, Cloud Storage, IAM
  • Terraform: Infrastructure as Code
  • GitHub Actions: CI/CD
  • Workload Identity Federation: Secure Authentication
  • Apache Spark: Distributed Data Processing
  • Hadoop: Distributed File System and Resource Management
  • dbt: Data Build Tool
  • TPC-DI: Benchmark for Data Integration
  • Apache Airflow: Workflow Management Platform
  • JupyterLab: Interactive Development Environment
  • Python: Programming Language

Resources