Data Build Tool Summary¶
- Dbt core is an open source CLI and database agnostic. It enables data teams to transform data within their warehouse using SQL by applying software engineering best practices like version control.
- dbt Cloud: A managed service with a web-based IDE, scheduler, job orchestration, and monitoring
Use Cases¶
- Modelling changes are easy to follow and revert
- Explicit dependencies between models
- Explore dependencies between models
- Data quality tests
- Incremental load of fact tables
- Track history of dimension tables
Major Concepts¶
- Models: basic building block of the business logic. Includes materialized tables and views, and SQL files. Models can reference each other and use templates and macros
Install¶
- Supported Python database
- Init a project:
- Or in virtual env created wuth uv and uv sync use
dbtdirectly.
pyproject.toml
The following dependencies are needed:
dbt_profile.yaml¶
Defines the structure of the project.
Work on Models¶
- Add Kimball structure as sources, dimensions, facts under the
modelsfolder - Add SQL materialized view using
SELECT .... No insert into -
Validate each new SQL creation: within the folder with the dbt_profile.yaml, to build a view in Snowflake
Example of output:
22:20:56 Found 1 model, 522 macros 22:20:56 22:20:56 Concurrency: 1 threads (target='dev') 22:20:56 22:20:57 1 of 1 START sql view model DEV.src_listings ................................... [RUN] 22:20:58 1 of 1 OK created sql view model DEV.src_listings .............................. [SUCCESS 1 in 1.17s] 22:20:59 22:20:59 Finished running 1 view model in 0 hours 0 minutes and 2.78 seconds (2.78s). 22:20:59 22:20:59 Completed successfully 22:20:59 22:20:59 Done. PASS=1 WARN=0 ERROR=0 SKIP=0 NO-OP=0 TOTAL=1adn within Snowflake:

-
dbt runcreates final sql under thetargetfolder
Materialization¶
There are four materialization:
- View: this is a lightweight representation of the data, not reused. no recreationg of the table as each execution.
- Table: reusable data in external table- recreate at each run
- Incremental: fact tables appends to tables - more like event data - table is not recreated each time.
- Ephemeral (CTEs): aliasing of the data and filtering data. Not adversitized in the data warehouse. For example all the sql under the
sourcesare becoming CTEs
Materializatio an be set golbally in the dbt_profile.yaml: all models are view, except in the dimensions folder as table:
models:
airbnb:
+materialized: view
dimensions:
+materialized: table
sources:
+materialized: ephemeral
Incremental¶
Specify a fact table is incremental and add condition for which the records are added to the table. The review_date of the record needs to be after the last record in the fct_reviews table:
{{
config(
materialized = 'incremental',
on_schema_change='fail'
)
}}
WITH src_reviews AS (
SELECT * FROM {{ ref('src_reviews') }}
)
SELECT * FROM src_reviews
WHERE review_text is not null
{% if is_incremental() %}
AND review_date > (select max(review_date) from {{ this }})
{% endif %}
-
Making a full-refresh:
-
With the sources as ephemeral the output of dbt run becomes:
23:16:16 1 of 4 START sql table model DEV.dim_hosts_cleansed ............................ [RUN]
23:16:18 1 of 4 OK created sql table model DEV.dim_hosts_cleansed ....................... [SUCCESS 14111 in 1.93s]
23:16:18 2 of 4 START sql table model DEV.dim_listings_cleansed ......................... [RUN]
23:16:20 2 of 4 OK created sql table model DEV.dim_listings_cleansed .................... [SUCCESS 17499 in 2.47s]
23:16:20 3 of 4 START sql incremental model DEV.fct_reviews ............................. [RUN]
23:16:23 3 of 4 OK created sql incremental model DEV.fct_reviews ........................ [SUCCESS 0 in 2.37s]
23:16:23 4 of 4 START sql table model DEV.dim_listings_with_hosts ....................... [RUN]
23:16:24 4 of 4 OK created sql table model DEV.dim_listings_with_hosts .................. [SUCCESS 17499 in 1.58s]
23:16:24
23:16:24 Finished running 1 incremental model, 3 table models in 0 hours 0 minutes and 9.83 seconds (9.83s).
dbt compiledoes not deploy to the target data warehouse
Sources and Seeds¶
- Seeds are local files that is uploaded to the data warehouse from dbt
- Sources is an abstraction layer on top of the input tables. The source freshness can be checked automatically.
-
use
dbt seedto populate the seed (csv file) to the data warehouse. -
Sources may be defined in a yaml:
-
From there the src_*.sql needs to be modified to do not reference ay table name in the data warehouse, but the source aliases.
-
For source freshness, we need to consider one DATE column and add a config element to the table to define refreshness condition:
-
Run the command:
sbt source freshnessto validate the data freshness.
Type-2 slowly changing dimensions¶
The goal is to keep history of change to the records over time and not just the last record per key. dbt adds dbt_valid_from and dbt_valid_to columns to mark each records to be valid time from and to. A current correct records have dbt_valid_to sets to null.
snapshots live in the snapshot folder. There are two strategies for assessing data changes: * Timestamp: a unique key and updated_at fields is defined at the source model. These columns are used for determining changes * Check: any changes in a set of columns (or all columns) will be picked up as an update.
-
To create snapshots we need a yaml file under the snapshot folder:
-
the
dbt snapshotwill create a new table with the columns added for the referenced table.00:04:36 1 of 1 START snapshot DEV scd_raw_listings ..................................... [RUN] 00:04:40 1 of 1 OK snapshotted DEV.scd_raw_listings ..................................... [SUCCESS 17499 in 3.44s]
* An update to an existing record and a new dbt snapshotwill create historical record.
Tests¶
Sources of Information¶
- Udemy training from Zoltan C. Toth with Git Repo. Example of data from Inside AirBnB.
- Dbt core
- Preset
- Snowflake username: jbcodeforce. Using key-pair authentication. Public key in Snowlflake