Data Engg – Databricks
01: Introduction to Big Data Concepts
- Big Data introduction
- OLTP vs OLAP
- SQL vs NoSQL
- Data Warehouses vs Data Lakes
- Batch vs Streaming processing
02: Apache Spark Programming Essentials
- Spark architecture (Driver, Executors, Cluster Manager)
- RDD vs DataFrame vs Dataset
- Transformations vs Actions
- Lazy evaluation & Catalyst optimizer
- Writing Spark code in PySpark
03: Spark SQL & DataFrame Analytics
- DataFrame operations (select, where, groupBy)
- Joins, aggregations, window functions
- UDFs & performance considerations
- Temporary & global views
- Data exploration & profiling with Spark SQL
04: Azure Data Lake & Cloud Storage Foundations
- Azure Data Lake Storage Gen1 vs Gen2
- Hierarchical namespace, ACLs & RBAC
- Storage account configuration (Blob, ADLS)
- Data organization (Bronze/Silver/Gold)
- Accessing ADLS using Databricks & ADF
05: Data Movement & Transformation in Azure
- ADF overview (Linked Services, Datasets, Pipelines)
- Integration Runtime & ADF architecture
- ETL vs ELT in ADF
- Copy Activity, Lookup, ForEach, Conditional logic
- Mapping Data Flows (joins, transformations, data quality)
- Incremental loads & Change Data Capture (CDC)
- Orchestrating data pipelines end-to-end
- Data migration scenarios
06: Introduction to Azure Databricks & Lakehouse
- Databricks workspace & components
- Lakehouse architecture
- Clusters, Jobs, Notebooks, SQL Warehouses
- Databricks Repos & version control
- When to use Databricks vs ADF vs Synapse
07: Databricks Workspace, Clusters & Notebooks
- Workspace UI deep dive
- Cluster types & autoscaling
- Notebook management
- REST API & Databricks CLI setup
- Using Git (Repos) in Databricks
08: Data Ingestion Techniques for the Lakehouse
- Ingesting CSV, JSON, XML, Parquet
- Auto Loader (cloudFiles)
- Mounting ADLS/Blob
- Streaming ingestion basics
- Optimizing ingestion for high-volume data
09: Data Management, Governance & Unity Catalog
- DBFS vs external tables
- Metastore & data governance
- Unity Catalog: catalogs, schemas, tables, permissions
- Lineage & auditing
- Securing data lake access
10: Databricks Utilities, Widgets & Automation
- dbutils for file, secret, job management
- Widgets for parameterized notebooks
- CI/CD with Databricks Repos
- Automating workflows with Jobs & Pipelines
- Operational best practices
11: Delta Lake Architecture & Operations
- ACID transactions
- Delta logs & versioning
- Schema enforcement & evolution
- Time travel & auditing
- Table optimization (OPTIMIZE, ZORDER)
12: LakeFlow & Modern Data Orchestration
- LakeFlow overview
- Orchestrating ingestion & transformation
- Integrating DLT, Auto Loader, and pipelines
- Monitoring, lineage, and governance
- Event-driven pipeline architectures
13: Power BI Integration
- Connecting Power BI to Databricks SQL Warehouse
- Import vs DirectQuery
- Optimizing queries for BI
- Using Delta tables for analytics
- Publishing & refreshing dashboards





