Skip to content

DBT (Data Build Tool)

  • tool for transforming data within the datawarehouse
  • Focuses solely on Transformation
  • we can develop, test & document & deploy

ETL -> ELT

  • Cost effective
  • Scalability
  • Flexibility
  • Faster Time-to-Insight
  • Improved Data Governance

Setup

  • Depending on the warehouse, we can do:
1
2
3
4
pip install dbt-redshift
pip install dbt-bigquery
pip install dbt-snowflake
pip install dbt-postgres # if error comes for postgres, do sudo apt-get install libpq-dev python3-dev before
  • Start with dbt init
  • The profile is in: ~/.dbt/profiles.yml

Directories

  • models: consists of SQL queries that carry out the transformations on our data platform (data warehouse)
  • seeds: for storing static CSV files
  • analysis: only for temp ad-hoc queries. Not materialized in the warehouse
  • macros: piece of reusable sql jinja code goes here
  • We can have a source.yml in the models dir to add dynamic nature of db names to our models:
sources:
  - name: destination_db # source db name
    description: "Raw data from the destination PostgreSQL database."
    database: destination_db
    schema: public
    tables:
      - name: films # the name to reference in dbt
        identifier: films # the real db name
        description: "List of films."
      - name: actors
        description: "List of actors."
      - name: film_actors
        description: "Mapping between films and actors."
  • tests: for db tests

Modular Approach

  • We can break down a complex SQL query into smaller SQL query statements and then can reference those smaller SQL statements in the original complex SQL query using SELECT * from {{ ref('orders_stg') }} O where orders_stg is a sql query file.

View vs Table

  • View is suitable for minor transformations. Table vice-versa.
  • Transformations in views are executed at runtime whenever they are referenced. Transformations in Tables are pre-executed and resulting data is persistently stored in the data platform.

Seeds

  • In dbt, seeds are CSV files that you can load into your data warehouse. They are particularly useful for handling static data that changes infrequently. For instance, you might have a CSV file containing a list of country codes and their corresponding country names.

Use Cases

  • Mapping Codes to Descriptive Names
    As we’ve seen in the previous example, you can use a seed to map product codes to product names. This can be extended to any scenario where you have codes that need to be mapped to more descriptive names. For example, you might have a seed that maps error codes to error descriptions, or abbreviations to their full forms.

  • Excluding Certain Data from Analysis
    Suppose you have a list of test email addresses or user IDs that you want to exclude from your analysis. You can create a seed with these email addresses or user IDs, and then use the dbt seed command to load this data into your warehouse. In your dbt models, you can then exclude these test email addresses or user IDs from your analysis.

  • Loading Small Reference Datasets
    If you have small reference datasets that don’t change often, it can be more efficient to load them into your warehouse as seeds rather than storing them as tables in your source databases. For example, you might have a small dataset of exchange rates that you want to use in your dbt models.

  • Data Validation
    You can use seeds to validate the data in your warehouse. For example, you might have a seed that contains the expected results of a certain calculation. You can then create a dbt test that compares the actual results in your warehouse to the expected results in your seed.

  • Machine Learning Model Testing
    If you’re a data scientist or machine learning engineer, you can use seeds to load test data into your warehouse. You can then use this test data to evaluate the performance of your machine learning models.

Note

You may need elevated permissions for running dbt. DO: sudo chown -R neo:neo /home/neo/Documents/projs/de/prac1/custom_postgres/target/run/custom_postgres/seeds. Replace accordingly.

Tests

  • 2 types of tests:
    • Singular:
      • created to serve a single model
      • hard-coded values
    • Generic:
      • not hard-coded ones (as the name suggests)
      • instead of writing multiple tests; write few
      • (model, column_name)
      • Built-in generic tests:
        • not_null: ensures no columns have null values
        • unique: ensures each row in a table is unique
        • accepted_values: ensures the column values are within the specified values
        • relationships: ensures the relationships between tables are correct

Singular Test

  • Just create a sql under tests and write the normal sql statement targeting the failing records.
  • Example, revenue should not be negative.
1
2
3
SELECT orderID
FROM {{ ref('order_fact') }}
WHERE Revenue < 0
  • Then dbt test

Generic Test

  • Create a dir generic under tests and then a new test sql file.
  • Sample one:
1
2
3
4
5
6
string_not_empty.sql # file name
{% test string_not_empty(model, column_name) %}
    select {{ column_name }}
    from {{ model }}
    where TRIM({{ column_name }}) = ''
{% endtest %}
  • Then under models, create a test_config.yaml (any name)
models:
    - name: customers_stg
        columns:
            - name: Email
                tests:
                    - string_not_empty

            # built in ones can be directly applied
            - name: StatusCD
                tests:
                    - accepted_values:
                            values: ["01", "02", "03"]              
  • Can also be applied to the source data to get the tests at an early stage. Inside the sources.yml:
1
2
3
4
5
6
7
8
9
...
- name: emp
    identifier: employee
    columns:
        - name: address
            tests:
                - not_null
                - string_not_empty
...

Docs

  • description field -> docs
  • we can create multiple docs in a same file overview.md under models:
1
2
3
4
5
{%docs StatusCD %}

All the documentation goes here related to the StatusCD attribute.

{% enddocs %}
  • Then we can reference this doc in multiple descriptions by mentioning:
1
2
3
4
5
6
...
- name: StatusCD
    description: "{{ doc('StatusCD') }}"
    tests:
        - accepted_values:
                values: ["01", "02", "03"]
  • Then do: dbt docs generate
  • And dbt docs serve

Jinja

  • 3 main languages in DBT:
    • SQL: to create model and tests
    • YAML: for configurations
    • Jinja: to make SQL & YAML dynamic

Macros

  • Under macros.
1
2
3
4
5
6
{% macro to_celsius(fahrenheit_col, decimal_places) %}
    ROUND(({{ fahreheit_col }} - 32) * 5/9, {{ decimal_places }})
{% endmacro %}

-- can have multiple macros within the same sql file
-- can also have whole sql queries as a macro
  • Then use as:
{{ to_celsius('avg_temp_fahreheit', 2)}}
Note

One dbt model will always result in one db object (table/view).

Packages

  • For community, by community.
  • Create a packages.yml file at the root and enter the relevant packages name and version:
1
2
3
packages:
    - package: dbt-labs/dbt_utils
        version: 1.1.1
  • Then do dbt deps
  • Then use them as jinja templates inside the sql models!

Materializations

  • ways to incorporate dbt models into Data Warehouse
  • dbt run
  • like insert it, update it, merge it or you know

Materialization Types

  • materialized='view' (lightweight transformation logic & reduced storage cost)
  • materialized='table' (improve performance)
  • materialized='ephemeral' (doesn't materializes as an object in the data warehouse)
  • materialized='incremental' (only copy the new data not the old one)
  • or
1
2
3
{% snapshot snapshot_name %}
...
{% endsnapshot %}

Incremental & Snapshot

  • is_incremental() returns True only when:
    • the destination table already exists in the database
    • dbt is not running in full_refresh_mode
    • model is configured with materialized='incremental'
1
2
3
4
{{ 
    config(materialized="incremental",
                unique_key="ORDER_ID")
}}

SNAPSHOT topic is left

Note

If capital letter in the .csv file, wrap around " " in dbt for any SQL query.