Data Engineering with Snowflake + AWS

Skills

SQL
Python
AirFlow (data orchestration)
Apache Spark (batch processing)
- used before DBT. at the initial stages of data engineering pipeline to perform the heavy lifting
Apache Kafka (stream processing)

Postgres Docker setup

docker pull postgres
docker run --name db -e POSTGRES_PASSWORD=secret -d postgres
docker exec -u postgres db createdb postgres-db: creating the db
docker exec -it db psql -U postgres -d postgres-db: entering the db

Problems with Traditional Data Warehouses

Scalability
Disaster (single point of failure)
No concurrency
No distributed processing
Meant for OLTP Processing (row oriented and normalized)

Example

SQL Server
PostgreSQL
MySQL

Cloud Data Warehouses

On-demand scalability
OLAP (columnar dbs)

Example

BigQuery
Redshift
Snowflake
AWS Athena
Hadoop Hive

Two aspects of Data Ecosystem

1) Data Warehouse Management

Strong SQL & Data Modelling Skills
Billing/Cost Management
User/Access Management

2) ETL/Data Movement

Strong Script/Coding Skills
Writing pipeline using python/java/scala
Data Orchestration Tools (Airflow, Step Functions)
Deploy. & Integration of analytics/ML Model
Logging, alerting & Notifications

About SnowFlake

SAAS Offering on Cloud
Snowflake separates its architecture into the below 3 layers:
- Cloud Services
- Compute Services
- Database Services
Snowflake architecture is a hybrid of "Shared Disk Database" and "Shared Nothing Database".

Shared Nothing vs Shared Disk DB

Certainly! Here's a comparison between Shared Disk Database and Shared Nothing Database in a tabular form:

Feature	Shared Disk Database	Shared Nothing Database
Architecture	All nodes share a common disk storage.	Each node has its own private storage.
Data Distribution	Data is centrally stored and accessible by all nodes.	Data is partitioned and distributed across nodes.
Scalability	Limited scalability due to shared disk bottleneck.	High scalability as each node operates independently.
Fault Tolerance	Potential single point of failure due to shared disk.	High fault tolerance; failure in one node doesn't affect others.
Data Consistency	Easier to maintain consistency with a central disk.	Requires complex mechanisms to maintain consistency across nodes.
Performance	Can suffer from contention and I/O bottlenecks.	Generally better performance due to distributed nature and no I/O contention.
Cost	Can be costlier due to need for high-performance shared storage.	Can be more cost-effective as it uses commodity hardware.
Complexity	Simpler to manage as it uses centralized storage.	More complex to manage due to data partitioning and distribution.
Example Systems	Oracle RAC, IBM DB2 PureScale	Google Spanner, Cassandra, Hadoop HDFS
Use Cases	Suitable for applications needing high availability and data integrity.	Suitable for large-scale data processing and distributed applications.

Snowflake Architecture

Cloud Services
Query Processing (Virtual Warehouse)
Database Storage

Snowflake - Virtual Warehouses

Cluster of computing resources
Size: XS to 5XL
Can be of 2 types:
- Single Cluster Warehouse
  - (Min Cluster, Max Cluster) = (1, 1)
- Multi Cluster Warehouse
  - (Min, Max) = (>=1, >=1)
  - can be executed in 2 modes:
    - Maximized: (2, 2)
    - Minimized: (1, 2)
Max Cluster: max no of warehouses that can be spun-up during an operation
Min Cluster: default no of warehouses that Snowflake spuns up while creating a warehouse

Scaling Policy

Standard (default): prevents queuing of operations by starting additional warehouses
Economy: conserves credits by keeping warehouses fully-loaded rather than starting additional warehouses

Snowflake Billing

No of credits * Cost per Credit
Storage Cost: per TB/month
Compute Cost: per minute

Resource Monitor - We can resource monitor to monitor the consumption of credits - To notify if certain threshold is exceeded - And suspend if the same happens - We can also set the monitoring either on an account-based level or warehouse level.

Snowflake Tables

For tables, we have:
- Permanent Tables
- Temporary Tables - within session only
- Transient Tables - persists after session
- Temp & Transient Tables don't have Time-travel & Fail-safe features enabled.

Time-Travel

Snowflake time travel allows accessing historical data (changed/deleted one) at any point within a defined period. For:
- restoring objects (db/schema/tables) [everything in snowflake is an object]
- backing up from a specific time in the past

Fail Safe

a non-configurable 7-day period during which historical data may be recoverable by Snowflake.
This period starts immediately after the Time Travel retention period ends.

Snowflake Views

allows the result of a query to be accessible as if it were a table
is of 2 types:
- Non-materialized
- Materialized:
  - pre-computed data set derived from a query specification (the SELECT in the view definition and stored for later use)
  - as the data is pre-computed, querying materialized view is faster that executing query against the base table of the view
  - Can't perform joins in a materialized view
  - Data accessed through materialized view is always current regardless of the DML done
Create materialized view if:
- the query results dont change often
- the results of the view are used often
- the query consumes a lot of resources
Create non-materialized view if:
- negative
Create secure views if:
- only needed.
- not that performant as others

Partitions

Partitioning: process of dividing a huge dataset into smaller chunks so that they can be manipulated independently.
In Snowflake, data is auto partitioned into smaller chunks of data in the range of 50 MB to 500 MB each called micro-partitions.
The micro-partitions can overlap in the range of values.
The metadata for each of these partitions is stored in the cloud services layer.
The metadata contains:
- Range of values for each of the column in micro-partitions
- number of unique values
- additional pts req. for efficiently query processing and data scanning.

Clustering

a process to optimize data retrieval
performed on micro-partitions to ensure similar data resides in the same micro-partition which can be fetched in a single query
Clustering Key: a column or a group of columns that are designated to locate the data in the same micro-partition
- helpful for ensuring the data is sorted/ordered inside the micro-partition; good while querying large tables
Pruning is a process where snowflake avoids scanning the unncessary micro-partitions using the clustering keys in the where clause thus resulting in a better performance.
All of these data warehouses are columnar storage ones because of the efficient querying properties of organizing the data into columnar form.

In order to check the details of the micro-partitions in snowflake, do:

select system$clustering_information('<TABLE_NAME>');

The sample result:

{
  "cluster_by_keys" : "LINEAR(L_SHIPDATE)",
  "total_partition_count" : 10336,
  "total_constant_partition_count" : 8349,
  "average_overlaps" : 0.6908,
  "average_depth" : 1.4082,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 8310,
    "00002" : 599,
    "00003" : 844,
    "00004" : 417,
    "00005" : 149,
    "00006" : 17,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  },
  "clustering_errors" : [ ]
}

- The more close total_constant_partition_count is to the total_partition_count, the better the clustering is done. - total_constant_partition_count just means that during the ingestion of new data, when the clustering was being rearranged, how many partitions remained un-arranged. Better if high. - Also, partition_depth_histogram is a key:value pair of depth:partitions. For example, there are no partitions with depth 0. 8310 for depth 1. Low depth is a good thing. - NOTE: Clustering is handled automatically. Might need to turn it off if getting expensive.

SnowPipe

Data Ingestion Pipeline
Can ingest from AWS,GCP,Azure.
Manually as well as automatically
But the Billing is not clearly defined. Depends on several factors

Snowpipe: Data Loading Considerations

No of load operations that run in parallel cannot exceed the number of data files to be loaded
Snowflake recommends producing files roughly 100-250 MB (or larger) in size compressed
Loading large files is not recommended (50GB, 100GB, etc): SPLIT THEM
Also, we cannot have unlimited small files. Aggregate them to minimize the processing overhead for each file.

Tasks

Similar to a scheduler that can execute SQL Queries at a defined frequency or time
Can be used to execute:
- A single SQL Query
- Multiple SQL Queries (tree of tasks)
Compute resources:
- Serverless (Snowflake managed)
- User Managed (one of the existing virtual warehouses)
NO COST for TASKS, ONLY FOR THE COMPUTE.

Partition Key: High or Low cardinality?

When choosing a partition key (also known as a cluster key in some contexts) for your data, the cardinality of the key is an important consideration. The cardinality refers to the uniqueness of the values within the column chosen as the key. Here's a detailed look at how the cardinality of a partition key impacts performance and management, and some guidelines for choosing the right cardinality:

High Cardinality vs. Low Cardinality

High Cardinality: This means the column has many unique values. For example, a column like "User ID" where each user has a unique identifier.
Low Cardinality: This means the column has few unique values. For example, a column like "Gender" which might only have values like "Male" and "Female".

Factors to Consider

Query Patterns:
High Cardinality: Ideal if your queries often filter on unique values or small ranges of values. For example, if you frequently query by specific users or by time (with many distinct timestamps), high cardinality keys can help in quickly locating the relevant partitions.
Low Cardinality: Suitable if your queries often filter on broad categories. For example, if you frequently group or filter data by a few distinct categories like region or product type, low cardinality keys can be effective.
Data Distribution:
High Cardinality: Ensures more even distribution of data across partitions, which can help avoid hotspots and balance the load more effectively. However, too high cardinality might result in too many small partitions, which can be inefficient.
Low Cardinality: Can lead to uneven data distribution if certain values are significantly more common than others. This might result in some partitions being much larger than others, causing performance bottlenecks.
Performance:
High Cardinality: May improve performance for highly selective queries (e.g., querying specific user activities), as the data to be scanned is minimized.
Low Cardinality: May benefit queries that aggregate or scan large portions of data, but can suffer if the partitioning scheme results in large, uneven partitions.
Maintenance and Management:
High Cardinality: More partitions to manage, which can increase overhead but also provide finer granularity for managing and optimizing data.
Low Cardinality: Fewer partitions to manage, but requires careful consideration to avoid unbalanced partition sizes.

Query Processing & Caching

Snowflake caches three types of data:
- Query Results: expire after 24 hours of non-usage or if underlying data changes
- Warehouse Cache (ssd/local cache): dropped as soon as the warehouse is suspended
- Metadata (Cloud Services Layer): always persisted and updated; never dropped.

There is also a search optimization feature.

Streams

Change Data Capture (CDC): process of capturing DML Operations (updates, inserts and delete) in your tables
Achieved using Streams
Stream can be thought of as a table
helps in:
- incremental data loads
- updating search indexes/history
- streaming analytics/anomaly detection
- even-driven data platforms

Stream Types

Standard: tracks all DML changes to the source table
Append-only: tracks row inserts only. Updates and deletes are not recorded.
Insert-only: same as append-only but for external tables.

Streams Staleness

Data retention for the streams
like 2 days, 5 days etc.

Change Tracking

Read-only alternative to streams

UDF (User Defined Functions)

Let's us extend operations that are not available through system defined functions in SnowFlake.
UDF's are reusable code which can be written in:
- SQL
- JS
- Java
- Python
2 types:
- Scalar: returns one output row
- Tabular: returns zero, one or multiple rows for each input row

Pushdown

improves performance by filtering out unneeded rows as early as possible during query processing
can also reduce memory consumption.
However, pushdown can allow confidential data to be exposed indirectly.
basically, don't load/read the records that don't match the filter conditions from the WHERE clause, then scan the remaining rows in memory and select the columns from the select clauses.

REMEMBER: Snowflake query optimizer bypasses the optimizations(pushdowns) when it comes to Secure UDFs.

External Functions

code executed outside Snowflake
remotely executed code
relayed through proxy service
type of UDF
unlike UDF, an external function doesnt contain its own code; instead, the external function calls code that is stored and executed outside Snowflake

Column-Level Data Masking

to mask or hide the values of a column (say an intern joined and you don't want to give him the access to a senstive column data)
we can mask using star, random numbers, etc.

Row-Level Data Security

instead of creating different views for different users, we can implement row-level access i.e. filter out rows based on a query according to the accessing user

Snowflake Vs Other solutions

Comparing Snowflake with other cloud provider solutions like Amazon Redshift, Google BigQuery, and Databricks involves evaluating several factors, including architecture, performance, scalability, ease of use, cost, and additional features. Here’s a detailed comparison:

Snowflake vs. Amazon Redshift

Architecture:
Snowflake: Uses a multi-cluster, shared data architecture with separation of storage and compute, allowing independent scaling.
Redshift: Uses a single-cluster architecture where storage and compute are coupled, although Redshift Spectrum allows querying data in S3, providing some flexibility.
Performance and Scalability:
Snowflake: Automatically scales compute resources up or down and can handle high concurrency.
Redshift: Requires manual resizing of clusters, though RA3 instances allow scaling compute and storage independently to some extent.
Ease of Use:
Snowflake: Fully managed, no need for database administration tasks.
Redshift: Requires more manual administration, including vacuuming and indexing.
Cost:
Snowflake: Pay-as-you-go pricing for compute and storage independently.
Redshift: Pay-per-hour for the cluster, with additional charges for storage and Redshift Spectrum queries.
Data Sharing and Collaboration:
Snowflake: Enables secure data sharing without data movement.
Redshift: Less advanced data sharing capabilities, often requiring data copying.
Support for Semi-Structured Data:
Snowflake: Native support for JSON, Avro, Parquet.
Redshift: Supports JSON and other semi-structured data through Redshift Spectrum, but with more complexity.

Snowflake vs. Google BigQuery

Architecture:
Snowflake: Separation of storage and compute with a multi-cluster, shared data architecture.
BigQuery: Serverless, fully managed data warehouse with a unique architecture that separates storage and compute automatically.
Performance and Scalability:
Snowflake: Scales compute resources based on workloads with high concurrency support.
BigQuery: Scales automatically with virtually unlimited compute power and high concurrency support.
Ease of Use:
Snowflake: Fully managed, straightforward to use with no infrastructure management.
BigQuery: Also fully managed and serverless, with seamless scaling and automatic query optimization.
Cost:
Snowflake: Pay-as-you-go for compute and storage independently.
BigQuery: Charges based on the amount of data processed by queries (query-by-query pricing) and storage costs.
Data Sharing and Collaboration:
Snowflake: Secure data sharing capabilities without data duplication.
BigQuery: Data sharing through datasets, with integration into Google’s ecosystem for collaboration.
Support for Semi-Structured Data:
Snowflake: Native support for JSON, Avro, Parquet.
BigQuery: Strong support for semi-structured data formats like JSON and Avro.

Snowflake vs. Databricks

Architecture:
Snowflake: Multi-cluster, shared data architecture with separation of storage and compute.
Databricks: Built on Apache Spark, integrates with data lakes, and provides a unified analytics platform for data engineering, data science, and machine learning.
Performance and Scalability:
Snowflake: Scales compute resources dynamically with high concurrency.
Databricks: Leverages Spark's distributed computing capabilities, highly scalable with optimized performance for big data and machine learning workloads.
Ease of Use:
Snowflake: User-friendly, fully managed service requiring minimal administration.
Databricks: More complex due to the integration of various data processing and analytics tools, requires some Spark knowledge.
Cost:
Snowflake: Pay-as-you-go for compute and storage independently.
Databricks: Pay for compute resources and data processing, with different pricing tiers for data engineering and analytics workloads.
Data Sharing and Collaboration:
Snowflake: Secure and straightforward data sharing without data movement.
Databricks: Collaborative environment with notebooks for data scientists and engineers, integrates well with data lakes for sharing data.
Support for Semi-Structured Data:
Snowflake: Native support for JSON, Avro, Parquet.
Databricks: Excellent support for semi-structured data through Spark, capable of handling complex data processing and transformations.

Summary

Snowflake: Best for ease of use, flexibility, and strong data sharing capabilities. Ideal for organizations needing a fully managed, highly scalable, and easy-to-use data warehouse with advanced features.
Amazon Redshift: Good for users already within the AWS ecosystem, offers powerful performance but requires more manual management and tuning.
Google BigQuery: Best for a serverless, fully managed data warehouse experience with powerful automatic scaling and cost-efficient query pricing. Excellent for big data analytics.
Databricks: Ideal for big data and machine learning workloads, offering a unified analytics platform that integrates deeply with data lakes. Best suited for data engineers and data scientists requiring advanced data processing capabilities.