Corporate Onsite Workshop Syllabus

Databricks Agentic Data Engineering Workshop

2-Day Intensive Onsite 12–20 Data Engineers

This workshop teaches data engineering teams to use agentic coding assistants—Claude Code, GitHub Copilot, Codex CLI, Cursor, and others—to build, maintain, and evolve data pipelines, dashboards, and data applications on the Databricks Data Intelligence Platform. We teach the full agentic development lifecycle through the lens of data engineering: Unity Catalog context injection, Databricks AI Dev Kit skills, MCP-based tool access, event-driven backpressure (deterministic tools wired to agent events for imperative self-correction), and production-grade deployment patterns.

The curriculum is tool-agnostic by design—agents come and go, but the context-engineering and backpressure patterns are durable.

Engagement Model 2 × 2-hour online discovery sessions + 2-day onsite intensive

Duration 2 days (16 hours instruction + 4 hours guided lab)

Format Instructor-led, hands-on, cohort-based

Prerequisites Python + SQL proficiency; active Databricks workspace; Git fluency; basic CI/CD familiarity

Deliverables Databricks agentic dev environment config, BACKPRESSURE.md spec, agentic CI/CD pipeline

Pre-Workshop Discovery Sessions (Online)

Session 1 — Scoping & Stakeholder Alignment

2 hours · 2–3 weeks before workshop

Audience mapping: Identify participant roles, skill levels, and learning objectives
Databricks environment audit: Workspace configuration, Unity Catalog setup, existing pipelines and dashboards
Tool audit: Current agentic tools in use (Claude Code, Copilot, Cursor, Codex, etc.) and Databricks AI Dev Kit adoption status
Data landscape review: Key data sources, pipelines, quality challenges, and dashboard inventory
Pain point identification: Where does agentic data engineering currently break down?
Customization brief: Identify 2–3 real-world data scenarios from your backlog to use as capstone candidates

Deliverable: Customization brief and pre-workshop preparation checklist

Session 2 — Curriculum Customization & Environment Prep

2 hours · 1 week before workshop

Skill alignment: Identify which modules need depth vs. pace adjustment
Databricks environment validation: AI Dev Kit installation, MCP server configuration, Unity Catalog access, Git integration
BACKPRESSURE.md draft: Co-create initial data verification contract; identify deterministic tools to wire to agent events
Materials preview: Review custom playbooks, Databricks skills, and pipeline templates built for your team
(Optional) Scenario finalization: Lock in capstone project(s)

Deliverable: Finalized curriculum, customized materials, and environment readiness confirmation

Day 1 — Databricks Agentic Dev Environment & The Agentic Data Loop

Module 1: The Databricks Agentic Landscape

90 min

What agentic coding looks like on the Data Intelligence Platform

Databricks AI Dev Kit: the open-source toolkit for agentic Databricks development
- Python core library, MCP server with 50+ tools, 20+ Markdown skills
- Unity Catalog permissions enforced on every agent action
Databricks Coding Agent Integration: Unity AI Gateway for Cursor, Gemini CLI, Codex CLI
- Rate limiting, usage tracking, inference tables for governance
Agent Skills: task-specific Markdown instruction files for Databricks patterns
- "Build a Delta Live Table pipeline", "Create a Unity Catalog function", "Optimize a Spark job"
MCP on Databricks: standardized, secure interface connecting agents to data, tools, workflows
Tool comparison: Claude Code (terminal-native), Copilot (IDE-embedded), Codex CLI (fast scaffolding), Cursor (composer mode)
Security model: service principals, token scoping, least-privilege MCP access

Hands-on: Install the Databricks AI Dev Kit; configure Claude Code with Databricks MCP; run a side-by-side comparison with Copilot on the same data task

Module 2: Context Engineering for Data Workloads

120 min

Unity Catalog as the agent's shared brain

The Navigation Paradox in data engineering: why agents drown in schema context
Unity Catalog as structured agent memory: schemas, tables, columns, lineage, tags
Context injection patterns:
- Schema-first prompting: "Here are the 12 tables you can see; pick the right ones"
- Lineage-aware generation: "This dashboard depends on these 3 pipelines"
- Data profile injection: column distributions, null rates, cardinality
Code graphs for data pipelines: structural dependency mapping (DLT graphs, task dependencies)
LSP for data code: SQL language servers, Spark language servers
.claude/skills/ for Databricks: project-level data engineering skills
AGENTS.md for data teams: data governance rules, naming conventions, compliance requirements

Hands-on: Build a custom Databricks skill for "Create a GDPR-compliant ingestion pipeline"; configure Unity Catalog schema injection into agent prompts; compare agent output with/without lineage context

Module 3: The Agentic Data Loop — Pipelines, Dashboards, Apps

120 min

Agents that build data systems, not just SQL queries

The Agentic Data Loop: Requirement → Context Injection → Generate → Verify → Deploy → Monitor
Building Delta Live Tables with agents:
- Natural-language spec → DLT pipeline code → auto-generated expectations
- Schema evolution handling: agents that detect and adapt to schema changes
Dashboard generation: SQL → visualization spec → Databricks SQL dashboard
- Context injection: business metric definitions, stakeholder personas
Data application scaffolding: Delta tables → REST API → Streamlit/Gradio app
MCP tools for data agents:
- Unity Catalog query MCP: metadata, lineage, permissions
- Data profiling MCP: column stats, quality scores
- Job execution MCP: run, monitor, get results
- Notebook conversion MCP: .py ↔ .ipynb with Databricks magic commands
Prompt patterns: "Build a pipeline that ingests X, validates Y, and writes Z with these expectations"

Hands-on: Give an agent a natural-language spec for a 3-table ETL pipeline; have it generate DLT code with expectations; auto-generate a SQL dashboard from the output tables

Module 4: Backpressure I — Data Validation & Testing

90 min

BACKPRESSURE.md defines the data contract; deterministic tools enforce it with imperative feedback

BACKPRESSURE.md for data engineering: the verification contract
- Schema rules: column types, nullability, primary keys
- Data quality rules: row counts, null rates, distribution bounds
- Lineage rules: upstream tables must exist, downstream consumers notified
- Compliance rules: PII detection, GDPR deletion support, retention policy alignment
Deterministic tools as imperative feedback: SQL validators, schema checkers, and data quality tools wired to agent lifecycle events
- On-write: SQL compilation checks run immediately; return specific directive ("column 'user_id' does not exist in table 'customers'", "add NOT NULL constraint")
- On-generate: schema validation gates block invalid pipeline code before review
- On-test: data diff tools return failing assertions as imperative corrections
- Spark plan analysis: detecting full table scans, shuffle explosions with specific optimization directives
- Delta Lake constraint validation: CHECK constraints, generated columns
Agentic test generation for data pipelines:
- Great Expectations / dbt tests generated from natural language
- Data diff testing: comparing agent-built pipeline output to known-good baselines
- Schema drift detection: agents that auto-adapt tests when schemas change
The 3-iteration rule for data tasks: schema validation → data quality → performance check

Hands-on: Write a BACKPRESSURE.md data contract; wire deterministic tools (SQL validator, schema checker, data diff) to agent events; have an agent generate a pipeline and iterate until all tools pass

Day 2 — Verification, Validation & Production DataOps

Module 5: Backpressure II — E2E, Visual & Performance Validation

120 min

From data correctness to production readiness

Agentic data testing pyramid:
- Unit: SQL/function logic, UDFs, transformation logic
- Integration: pipeline step chaining, DLT expectations, job dependencies
- E2E: full pipeline from ingestion to dashboard with data validation
E2E data pipeline testing:
- Snapshot testing: deterministic output comparison across runs
- Data diff tools: comparing agent-built output to golden datasets
- Playwright for data dashboards: visual regression of SQL dashboard rendering
- Screenshot comparison: dashboard before/after agent changes
Performance backpressure:
- Query plan analysis: detecting expensive joins, spills, skew
- Cost governance: token budgets for agent reasoning, compute cost estimation
- Auto-optimization: agents that rewrite pipelines for better Spark plans
Data validation testing:
- Row-count assertions, aggregate checks, referential integrity
- Temporal validation: freshness checks, late-arriving data handling
- Anomaly detection: statistical bounds on key metrics

Hands-on: Build an E2E test for a DLT pipeline with golden-data comparison; configure Playwright screenshot tests for a SQL dashboard; have an agent optimize a slow query and validate the plan improvement

Module 6: Agentic Review, PRs & DataOps

90 min

Evidence-first review for data engineering changes

Evidence-first data PR review:
- Data diff as the primary review artifact
- Schema change impact analysis: downstream consumer notification
- Lineage-aware review: "This change affects 4 dashboards and 2 ML models"
GitHub Agentic Workflows for data repos:
- Automated PR description: schema changes, data diffs, lineage impact
- Data quality gate: fail PR if data tests don't pass
- Cost estimation: "This pipeline change will add $X/month in compute"
Databricks Git integration: notebooks in Git, agentic PR workflows
Reviewer confidence scoring for data PRs:
- Test coverage (unit + integration + E2E)
- Data diff size and direction
- Downstream blast radius (lineage depth)
- Backpressure pass rate
Agentic review of data pipelines: automated detection of common anti-patterns

Hands-on: Configure a GitHub Agentic Workflow for Databricks PRs; set up auto-generated PR descriptions with data diffs and lineage impact; build a review confidence scorecard

Module 7: DataOps, Staging & Blue/Green for Data

120 min

Production-grade deployment of agent-built data systems

CI/CD for agentic data engineering:
- Agent version pinning: model version, skill version, Unity Catalog snapshot
- Deterministic reproduction: prompt + context hash + catalog version = reproducible pipeline
Staging environments for data pipelines:
- Isolated staging schemas in Unity Catalog
- Data subsetting: synthetic or sampled data for fast staging validation
- Automated staging runs on agent PR creation
Blue/green deployment for data pipelines:
- Parallel pipeline runs: blue (current) vs. green (agent-built)
- Data diff comparison: validating green output matches blue semantics
- Instant rollback: Unity Catalog table swap or DLT pipeline rollback
Canary releases for data:
- 5% data volume to new pipeline version
- Metric comparison: row counts, aggregates, quality scores
- Automatic promotion or rollback based on data validation gates
Deployment validation:
- Health checks: pipeline completion, data freshness, quality thresholds
- Screenshot validation: dashboard rendering across environments
- Performance regression: query latency, cluster utilization
Observability: MLflow tracking for agent decisions, Databricks inference tables

Hands-on: Build a GitHub Actions pipeline for Databricks: agent PR → data tests → staging schema deploy → blue/green diff → canary gate → human approval → production swap

Module 8: The Production Agentic Data Workflow

90 min

End-to-end: from business requirement to deployed data system

The production agentic data workflow blueprint:
1. Business requirement in natural language (ticket, spec, stakeholder request)
2. Context injection (Unity Catalog schema, lineage, AGENTS.md, skills)
3. Agent generation (Claude Code / Copilot with Databricks AI Dev Kit)
4. Event-driven backpressure (BACKPRESSURE.md contract + deterministic tools: schema, quality, lineage, compliance)
5. Data test generation & execution (unit SQL tests, integration DLT tests, E2E snapshot)
6. Agentic review (automated PR analysis: diff, lineage impact, cost estimate)
7. Human review (evidence-first: data diffs, test results, dashboard screenshots)
8. Staging deployment + blue/green data validation
9. Canary release + data quality monitoring
10. Full rollout or automatic rollback
When to escalate to human: complex schema migrations, PII changes, breaking downstream contracts
Data governance: Unity Catalog audit trails, data lineage capture, compliance documentation
Cost governance: agent token budgets, compute cost estimation, auto-optimization
Security: sandboxed agent execution, least-privilege MCP, PII detection in agent output

Hands-on: Teams run a full agentic data workflow: "Build a customer churn prediction pipeline with dashboard"—from spec to production-ready PR with all gates

Optional Capstone Day (Day 3 — 6 hours)

Add-on package: Core + Capstone Day ($32,500). Teams of 3–4 take a real data engineering task (from their backlog or a curated scenario) through the complete agentic data workflow, guided by an instructor. Each team must demonstrate:

Context engineering: Unity Catalog injection, AGENTS.md, Databricks skills, MCP config
Agentic generation: Using ≥2 tools to build a data pipeline + dashboard
Backpressure compliance: BACKPRESSURE.md contract enforced by deterministic tools (schema, quality, lineage, performance)
Test coverage: Unit SQL tests + DLT integration tests + E2E snapshot + Playwright dashboard screenshots
Agentic review: Automated PR with data diff, lineage impact, cost estimate
Deployment plan: Staging schema → blue/green → canary with automatic rollback triggers

Instructor feedback: Real-time critique using the neurex.dev Agentic Data Workflow Review framework

Post-Workshop Resources

Databricks agentic starter kit: AI Dev Kit config, custom skills template, BACKPRESSURE.md spec
MCP server templates: Unity Catalog query, data profiling, job execution, notebook conversion
CI/CD blueprints: GitHub Actions for Databricks agentic PRs with all data gates
Tool configuration guides: Claude Code + Databricks, Copilot + Databricks, Codex + Databricks
30-day Slack access: Databricks-specific follow-up and troubleshooting

Platform Requirements


Databricks Workspace	Unity Catalog enabled; Serverless SQL + Jobs available
AI Dev Kit	`pip install databricks-ai-dev-kit` or equivalent
Agent Tools	Claude Code, Copilot, and/or Codex CLI installed locally
Git	Repository connected to Databricks Git integration
CI/CD	GitHub Actions or Azure DevOps for pipeline exercises
Data	Workshop data provided; customer data may be substituted for capstone