dataengineer101.com - Old School Data Engineering Flashcards

Essential Database Concepts (9 Cards)

SQL

Structured Query Language. Used for managing data in RDBMS. Foundation for Data Engineers.

NoSQL

Non-relational databases (e.g., MongoDB, Cassandra). Scalable and flexible for unstructured data.

ACID

Atomicity, Consistency, Isolation, Durability. Ensures reliable database transactions.

Indexing

A data structure that improves the speed of data retrieval operations on a database table.

Normalization

Organizing columns and tables to minimize data redundancy and dependency (1NF, 2NF, 3NF and more).

Partitioning

Dividing a large table into smaller, more manageable pieces (partitions) for performance.

Foreign Key

A field that links two tables, enforcing referential integrity.

Stored Procedure

A prepared SQL code block that you can save and reuse, often to perform complex operations. Check: automation

CDC

Change Data Capture. Identifying and tracking changes in database records in real-time. Check: Fivetran

Workflow Management (9 Cards)

Apache Airflow

Platform to programmatically author, schedule, and monitor workflows (DAGs).

DAG

Directed Acyclic Graph. The core structure of an Airflow workflow, defining task dependencies.

Task Instance

A specific run of a task on a specific date in Airflow.

Operator

A template for a specific kind of task (e.g., PythonOperator, PostgresOperator) in Airflow.

Sensor

A special type of Airflow operator that waits for a condition to be met (e.g., file existence).

Scheduler

The Airflow component that monitors DAGs and triggers tasks based on their schedule. Check: containers, pods

Kubernetes

Platform for deploying containerized applications. Also check: Docker

Prefect

A modern, open-source workflow orchestration tool emphasizing dynamic, reactive workflows.

Idempotency

A principle where an operation can be performed multiple times without changing the result beyond the initial application.

Data Transformation & Design (9 Cards)

dbt (data build tool)

Transformation tool. Turns data into models by writing select statements and defining dependencies.

Star Schema

Data warehouse design with a central **Fact** table and radiating **Dimension** tables.

Fact Table

Stores quantitative data (metrics) and foreign keys to dimension tables.

Dimension Table

Stores descriptive attributes (context) about the business process being analyzed.

Snowflake Schema

An extension of Star Schema where dimension tables are further normalized into sub-dimensions.

SCD Type 2

Slowly Changing Dimension Type 2. Tracks changes in dimension attributes by creating new rows with new effective dates.

ELT

Extract, Load, Transform. Loading raw data first, then transforming it within the data warehouse.

Data Vault

A modeling technique designed for agility and historical tracking, composed of Hubs, Links, and Satellites.

Surrogate Key

A system-generated primary key used in data warehouse dimension tables, independent of the source system's key.

Data Lake Architecture (9 Cards)

Bronze Data

Raw, source-aligned data. Minimal cleaning, retains history.

Silver Data

Cleaned, filtered, and transformed data. Entity-aligned, ready for light analysis.

Gold Data

Highly curated, aggregated, and modeled data (e.g., Star Schema). Optimized for reporting.

Data Lake

A centralized repository to store large amounts of structured, semi-structured, and unstructured data.

Data Warehouse

A central repository of integrated data from one or more disparate sources, optimized for reporting and analysis.

Delta Lake

An open-source storage layer that brings ACID transactions and reliability to data lakes (Parquet files).

Schema-on-Read

Applying a structure (schema) to data only when it is read from the source (typical in Data Lakes).

Data Mesh

A decentralized, domain-oriented data architecture paradigm that treats data as a product.

Parquet / ORC

Columnar file formats optimized for analytical queries, storage efficiency, and performance in distributed systems.