DBT (Data Build Tool)
- tool for transforming data within the datawarehouse
- Focuses solely on Transformation
- we can develop, test & document & deploy
ETL -> ELT
- Cost effective
- Scalability
- Flexibility
- Faster Time-to-Insight
- Improved Data Governance
Setup
- Depending on the warehouse, we can do:
- Start with
dbt init - The profile is in:
~/.dbt/profiles.yml
Directories
models: consists ofSQLqueries that carry out the transformations on our data platform (data warehouse)seeds: for storing static CSV filesanalysis: only for temp ad-hoc queries. Not materialized in the warehousemacros: piece of reusable sql jinja code goes here- We can have a
source.ymlin themodelsdir to add dynamic nature ofdb namesto ourmodels:
tests: for db tests
Modular Approach
- We can break down a complex SQL query into smaller SQL query statements and then can reference those smaller SQL statements in the original complex SQL query using
SELECT * from {{ ref('orders_stg') }} Owhereorders_stgis a sql query file.
View vs Table
- View is suitable for minor transformations. Table vice-versa.
- Transformations in views are executed at runtime whenever they are referenced. Transformations in Tables are pre-executed and resulting data is persistently stored in the data platform.
Seeds
- In dbt, seeds are CSV files that you can load into your data warehouse. They are particularly useful for handling static data that changes infrequently. For instance, you might have a CSV file containing a list of country codes and their corresponding country names.
Use Cases
-
Mapping Codes to Descriptive Names
As we’ve seen in the previous example, you can use a seed to map product codes to product names. This can be extended to any scenario where you have codes that need to be mapped to more descriptive names. For example, you might have a seed that maps error codes to error descriptions, or abbreviations to their full forms. -
Excluding Certain Data from Analysis
Suppose you have a list of test email addresses or user IDs that you want to exclude from your analysis. You can create a seed with these email addresses or user IDs, and then use the dbt seed command to load this data into your warehouse. In your dbt models, you can then exclude these test email addresses or user IDs from your analysis. -
Loading Small Reference Datasets
If you have small reference datasets that don’t change often, it can be more efficient to load them into your warehouse as seeds rather than storing them as tables in your source databases. For example, you might have a small dataset of exchange rates that you want to use in your dbt models. -
Data Validation
You can use seeds to validate the data in your warehouse. For example, you might have a seed that contains the expected results of a certain calculation. You can then create a dbt test that compares the actual results in your warehouse to the expected results in your seed. -
Machine Learning Model Testing
If you’re a data scientist or machine learning engineer, you can use seeds to load test data into your warehouse. You can then use this test data to evaluate the performance of your machine learning models.
Note
You may need elevated permissions for running dbt. DO: sudo chown -R neo:neo /home/neo/Documents/projs/de/prac1/custom_postgres/target/run/custom_postgres/seeds. Replace accordingly.
Tests
- 2 types of tests:
- Singular:
- created to serve a single model
- hard-coded values
- Generic:
- not hard-coded ones (as the name suggests)
- instead of writing multiple tests; write few
(model, column_name)- Built-in generic tests:
not_null: ensures no columns have null valuesunique: ensures each row in a table is uniqueaccepted_values: ensures the column values are within the specified valuesrelationships: ensures the relationships between tables are correct
- Singular:
Singular Test
- Just create a
sqlundertestsand write the normal sql statement targeting the failing records. - Example, revenue should not be negative.
- Then
dbt test
Generic Test
- Create a dir
genericundertestsand then a new test sql file. - Sample one:
- Then under
models, create atest_config.yaml(any name)
- Can also be applied to the source data to get the tests at an early stage. Inside the
sources.yml:
Docs
descriptionfield -> docs- we can create multiple docs in a same file
overview.mdundermodels:
- Then we can reference this doc in multiple descriptions by mentioning:
- Then do:
dbt docs generate - And
dbt docs serve
Jinja
- 3 main languages in DBT:
- SQL: to create model and tests
- YAML: for configurations
- Jinja: to make SQL & YAML dynamic
Macros
- Under
macros.
- Then use as:
Note
One dbt model will always result in one db object (table/view).
Packages
- For community, by community.
- Create a
packages.ymlfile at the root and enter the relevant packages name and version:
- Then do
dbt deps - Then use them as jinja templates inside the sql models!
Materializations
- ways to incorporate dbt models into Data Warehouse
dbt run- like insert it, update it, merge it or you know
Materialization Types
materialized='view'(lightweight transformation logic & reduced storage cost)materialized='table'(improve performance)materialized='ephemeral'(doesn't materializes as an object in the data warehouse)materialized='incremental'(only copy the new data not the old one)- or
Incremental & Snapshot
is_incremental()returns True only when:- the destination table already exists in the database
- dbt is not running in
full_refresh_mode - model is configured with
materialized='incremental'
SNAPSHOT topic is left
Note
If capital letter in the .csv file, wrap around " " in dbt for any SQL query.