DATAENGINEER101.COM

Essential Database Concepts (9 Cards)

SQL
Structured Query Language. Used for managing data in RDBMS. Foundation for Data Engineers.
NoSQL
Non-relational databases (e.g., MongoDB, Cassandra). Scalable and flexible for unstructured data.
ACID
Atomicity, Consistency, Isolation, Durability. Ensures reliable database transactions.
Indexing
A data structure that improves the speed of data retrieval operations on a database table.
Normalization
Organizing columns and tables to minimize data redundancy and dependency (1NF, 2NF, 3NF and more).
Partitioning
Dividing a large table into smaller, more manageable pieces (partitions) for performance.
Foreign Key
A field that links two tables, enforcing referential integrity.
Stored Procedure
A prepared SQL code block that you can save and reuse, often to perform complex operations. Check: automation
CDC
Change Data Capture. Identifying and tracking changes in database records in real-time. Check: Fivetran

Workflow Management (9 Cards)

Apache Airflow
Platform to programmatically author, schedule, and monitor workflows (DAGs).
DAG
Directed Acyclic Graph. The core structure of an Airflow workflow, defining task dependencies.
Task Instance
A specific run of a task on a specific date in Airflow.
Operator
A template for a specific kind of task (e.g., PythonOperator, PostgresOperator) in Airflow.
Sensor
A special type of Airflow operator that waits for a condition to be met (e.g., file existence).
Scheduler
The Airflow component that monitors DAGs and triggers tasks based on their schedule. Check: containers, pods
Kubernetes
Platform for deploying containerized applications. Also check: Docker
Prefect
A modern, open-source workflow orchestration tool emphasizing dynamic, reactive workflows.
Idempotency
A principle where an operation can be performed multiple times without changing the result beyond the initial application.

Data Transformation & Design (9 Cards)

dbt (data build tool)
Transformation tool. Turns data into models by writing select statements and defining dependencies.
Star Schema
Data warehouse design with a central **Fact** table and radiating **Dimension** tables.
Fact Table
Stores quantitative data (metrics) and foreign keys to dimension tables.
Dimension Table
Stores descriptive attributes (context) about the business process being analyzed.
Snowflake Schema
An extension of Star Schema where dimension tables are further normalized into sub-dimensions.
SCD Type 2
Slowly Changing Dimension Type 2. Tracks changes in dimension attributes by creating new rows with new effective dates.
ELT
Extract, Load, Transform. Loading raw data first, then transforming it within the data warehouse.
Data Vault
A modeling technique designed for agility and historical tracking, composed of Hubs, Links, and Satellites.
Surrogate Key
A system-generated primary key used in data warehouse dimension tables, independent of the source system's key.

Data Lake Architecture (9 Cards)

Bronze Data
Raw, source-aligned data. Minimal cleaning, retains history.
Silver Data
Cleaned, filtered, and transformed data. Entity-aligned, ready for light analysis.
Gold Data
Highly curated, aggregated, and modeled data (e.g., Star Schema). Optimized for reporting.
Data Lake
A centralized repository to store large amounts of structured, semi-structured, and unstructured data.
Data Warehouse
A central repository of integrated data from one or more disparate sources, optimized for reporting and analysis.
Delta Lake
An open-source storage layer that brings ACID transactions and reliability to data lakes (Parquet files).
Schema-on-Read
Applying a structure (schema) to data only when it is read from the source (typical in Data Lakes).
Data Mesh
A decentralized, domain-oriented data architecture paradigm that treats data as a product.
Parquet / ORC
Columnar file formats optimized for analytical queries, storage efficiency, and performance in distributed systems.