In this project, I built a small data warehouse example to track how actor data changes over time, using Slowly Changing Dimensions (SCD) Type 2. The goal was to capture historical changes while keeping the data model organized, scalable, and easy to query.
Highlights:
Designed base tables and custom SQL types for storing actor information
Implemented SCD Type 2 logic to preserve full history of changes
Applied window functions and CTEs to manage complex updates
Built incremental queries to handle new, changed, and unchanged records efficiently
Demonstrated core data warehousing concepts for dimensional modeling
A practical showcase of SQL-based data engineering, with a clear focus on data history and warehouse best practices.
This project highlights the design and implementation of a fact table in a data warehouse, built to analyze movie ratings and performance metrics. The focus was on applying best practices to ensure scalability, performance, and flexibility for different analytical needs.
Highlights:
Designed a fact table supporting both detailed and aggregated movie analytics
Implemented incremental loading patterns for efficient data refresh
Applied partitioning and indexing strategies to boost query performance
Integrated validation checks to maintain high data quality
Demonstrated aggregation techniques for flexible reporting and insights
A solid example of fact table modeling, combining performance optimization with real-world data engineering practices.
In this project, I explored gaming data using PySpark to uncover insights about player performance, map popularity, and medal achievements. A key focus was on organizing data efficiently using bucket-based strategies, which made joins and aggregations much faster when working with large-scale datasets.
Highlights:
Optimized joins to handle big datasets smoothly
Organized data with bucket-based strategies for faster queries
Built aggregation pipelines to analyze player stats and trends
Tuned performance for scalable, real-time style analytics
Combined Python, PySpark, and SQL for a flexible data workflow
A fun way to bring big data techniques into the world of gaming!
In this project, I built Spark jobs to process and analyze actor performance data across multiple years, creating an ETL pipeline that classifies actors based on movie ratings and tracks their career history. To ensure reliability, I used pytest for test-driven development, validating transformations and data quality at every step.
Highlights:
Developed Spark ETL jobs to calculate metrics and quality classifications
Optimized Spark configurations for performance and scalability
Implemented data quality checks and historical tracking
Applied window functions and complex joins for year-over-year analysis
Used pytest to test Spark jobs, ensuring robust and reproducible pipelines
A hands-on project that blends big data engineering with reliable testing practices!