Skip to content

Data Engineering with Snowflake + AWS

Skills

  • SQL
  • Python
  • AirFlow (data orchestration)
  • Apache Spark (batch processing)
    • used before DBT. at the initial stages of data engineering pipeline to perform the heavy lifting
  • Apache Kafka (stream processing)

Postgres Docker setup

  • docker pull postgres
  • docker run --name db -e POSTGRES_PASSWORD=secret -d postgres
  • docker exec -u postgres db createdb postgres-db: creating the db
  • docker exec -it db psql -U postgres -d postgres-db: entering the db

Problems with Traditional Data Warehouses

  • Scalability
  • Disaster (single point of failure)
  • No concurrency
  • No distributed processing
  • Meant for OLTP Processing (row oriented and normalized)

Example

  • SQL Server
  • PostgreSQL
  • MySQL

Cloud Data Warehouses

  • On-demand scalability
  • OLAP (columnar dbs)

Example

  • BigQuery
  • Redshift
  • Snowflake
  • AWS Athena
  • Hadoop Hive

Two aspects of Data Ecosystem

1) Data Warehouse Management

  • Strong SQL & Data Modelling Skills
  • Billing/Cost Management
  • User/Access Management

2) ETL/Data Movement

  • Strong Script/Coding Skills
  • Writing pipeline using python/java/scala
  • Data Orchestration Tools (Airflow, Step Functions)
  • Deploy. & Integration of analytics/ML Model
  • Logging, alerting & Notifications

About SnowFlake

  • SAAS Offering on Cloud
  • Snowflake separates its architecture into the below 3 layers:
    • Cloud Services
    • Compute Services
    • Database Services
  • Snowflake architecture is a hybrid of "Shared Disk Database" and "Shared Nothing Database".

Shared Nothing vs Shared Disk DB

Certainly! Here's a comparison between Shared Disk Database and Shared Nothing Database in a tabular form:

Feature Shared Disk Database Shared Nothing Database
Architecture All nodes share a common disk storage. Each node has its own private storage.
Data Distribution Data is centrally stored and accessible by all nodes. Data is partitioned and distributed across nodes.
Scalability Limited scalability due to shared disk bottleneck. High scalability as each node operates independently.
Fault Tolerance Potential single point of failure due to shared disk. High fault tolerance; failure in one node doesn't affect others.
Data Consistency Easier to maintain consistency with a central disk. Requires complex mechanisms to maintain consistency across nodes.
Performance Can suffer from contention and I/O bottlenecks. Generally better performance due to distributed nature and no I/O contention.
Cost Can be costlier due to need for high-performance shared storage. Can be more cost-effective as it uses commodity hardware.
Complexity Simpler to manage as it uses centralized storage. More complex to manage due to data partitioning and distribution.
Example Systems Oracle RAC, IBM DB2 PureScale Google Spanner, Cassandra, Hadoop HDFS
Use Cases Suitable for applications needing high availability and data integrity. Suitable for large-scale data processing and distributed applications.

Snowflake Architecture

  • Cloud Services
  • Query Processing (Virtual Warehouse)
  • Database Storage

Snowflake - Virtual Warehouses

  • Cluster of computing resources
  • Size: XS to 5XL
  • Can be of 2 types:
    • Single Cluster Warehouse
      • (Min Cluster, Max Cluster) = (1, 1)
    • Multi Cluster Warehouse
      • (Min, Max) = (>=1, >=1)
      • can be executed in 2 modes:
        • Maximized: (2, 2)
        • Minimized: (1, 2)
  • Max Cluster: max no of warehouses that can be spun-up during an operation
  • Min Cluster: default no of warehouses that Snowflake spuns up while creating a warehouse

Scaling Policy

  • Standard (default): prevents queuing of operations by starting additional warehouses
  • Economy: conserves credits by keeping warehouses fully-loaded rather than starting additional warehouses

Snowflake Billing

  • No of credits * Cost per Credit
  • Storage Cost: per TB/month
  • Compute Cost: per minute

Resource Monitor - We can resource monitor to monitor the consumption of credits - To notify if certain threshold is exceeded - And suspend if the same happens - We can also set the monitoring either on an account-based level or warehouse level.

Snowflake Tables

  • For tables, we have:
    • Permanent Tables
    • Temporary Tables - within session only
    • Transient Tables - persists after session
    • Temp & Transient Tables don't have Time-travel & Fail-safe features enabled.

Time-Travel

  • Snowflake time travel allows accessing historical data (changed/deleted one) at any point within a defined period. For:
    • restoring objects (db/schema/tables) [everything in snowflake is an object]
    • backing up from a specific time in the past

Fail Safe

  • a non-configurable 7-day period during which historical data may be recoverable by Snowflake.
  • This period starts immediately after the Time Travel retention period ends.

Snowflake Views

  • allows the result of a query to be accessible as if it were a table
  • is of 2 types:
    • Non-materialized
    • Materialized:
      • pre-computed data set derived from a query specification (the SELECT in the view definition and stored for later use)
      • as the data is pre-computed, querying materialized view is faster that executing query against the base table of the view
      • Can't perform joins in a materialized view
      • Data accessed through materialized view is always current regardless of the DML done
  • Create materialized view if:
    • the query results dont change often
    • the results of the view are used often
    • the query consumes a lot of resources
  • Create non-materialized view if:
    • negative
  • Create secure views if:
    • only needed.
    • not that performant as others

Partitions

  • Partitioning: process of dividing a huge dataset into smaller chunks so that they can be manipulated independently.
  • In Snowflake, data is auto partitioned into smaller chunks of data in the range of 50 MB to 500 MB each called micro-partitions.
  • The micro-partitions can overlap in the range of values.
  • The metadata for each of these partitions is stored in the cloud services layer.
  • The metadata contains:
    • Range of values for each of the column in micro-partitions
    • number of unique values
    • additional pts req. for efficiently query processing and data scanning.

Clustering

  • a process to optimize data retrieval
  • performed on micro-partitions to ensure similar data resides in the same micro-partition which can be fetched in a single query
  • Clustering Key: a column or a group of columns that are designated to locate the data in the same micro-partition
    • helpful for ensuring the data is sorted/ordered inside the micro-partition; good while querying large tables
  • Pruning is a process where snowflake avoids scanning the unncessary micro-partitions using the clustering keys in the where clause thus resulting in a better performance.
  • All of these data warehouses are columnar storage ones because of the efficient querying properties of organizing the data into columnar form.

In order to check the details of the micro-partitions in snowflake, do:

select system$clustering_information('<TABLE_NAME>');

The sample result:

{
  "cluster_by_keys" : "LINEAR(L_SHIPDATE)",
  "total_partition_count" : 10336,
  "total_constant_partition_count" : 8349,
  "average_overlaps" : 0.6908,
  "average_depth" : 1.4082,
  "partition_depth_histogram" : {
    "00000" : 0,
    "00001" : 8310,
    "00002" : 599,
    "00003" : 844,
    "00004" : 417,
    "00005" : 149,
    "00006" : 17,
    "00007" : 0,
    "00008" : 0,
    "00009" : 0,
    "00010" : 0,
    "00011" : 0,
    "00012" : 0,
    "00013" : 0,
    "00014" : 0,
    "00015" : 0,
    "00016" : 0
  },
  "clustering_errors" : [ ]
}
- The more close total_constant_partition_count is to the total_partition_count, the better the clustering is done. - total_constant_partition_count just means that during the ingestion of new data, when the clustering was being rearranged, how many partitions remained un-arranged. Better if high. - Also, partition_depth_histogram is a key:value pair of depth:partitions. For example, there are no partitions with depth 0. 8310 for depth 1. Low depth is a good thing. - NOTE: Clustering is handled automatically. Might need to turn it off if getting expensive.

SnowPipe

  • Data Ingestion Pipeline
  • Can ingest from AWS,GCP,Azure.
  • Manually as well as automatically
  • But the Billing is not clearly defined. Depends on several factors

Snowpipe: Data Loading Considerations

  • No of load operations that run in parallel cannot exceed the number of data files to be loaded
  • Snowflake recommends producing files roughly 100-250 MB (or larger) in size compressed
  • Loading large files is not recommended (50GB, 100GB, etc): SPLIT THEM
  • Also, we cannot have unlimited small files. Aggregate them to minimize the processing overhead for each file.

Tasks

  • Similar to a scheduler that can execute SQL Queries at a defined frequency or time
  • Can be used to execute:
    • A single SQL Query
    • Multiple SQL Queries (tree of tasks)
  • Compute resources:
    • Serverless (Snowflake managed)
    • User Managed (one of the existing virtual warehouses)
  • NO COST for TASKS, ONLY FOR THE COMPUTE.

Partition Key: High or Low cardinality?

When choosing a partition key (also known as a cluster key in some contexts) for your data, the cardinality of the key is an important consideration. The cardinality refers to the uniqueness of the values within the column chosen as the key. Here's a detailed look at how the cardinality of a partition key impacts performance and management, and some guidelines for choosing the right cardinality:

High Cardinality vs. Low Cardinality

  • High Cardinality: This means the column has many unique values. For example, a column like "User ID" where each user has a unique identifier.
  • Low Cardinality: This means the column has few unique values. For example, a column like "Gender" which might only have values like "Male" and "Female".

Factors to Consider

  1. Query Patterns:
  2. High Cardinality: Ideal if your queries often filter on unique values or small ranges of values. For example, if you frequently query by specific users or by time (with many distinct timestamps), high cardinality keys can help in quickly locating the relevant partitions.
  3. Low Cardinality: Suitable if your queries often filter on broad categories. For example, if you frequently group or filter data by a few distinct categories like region or product type, low cardinality keys can be effective.

  4. Data Distribution:

  5. High Cardinality: Ensures more even distribution of data across partitions, which can help avoid hotspots and balance the load more effectively. However, too high cardinality might result in too many small partitions, which can be inefficient.
  6. Low Cardinality: Can lead to uneven data distribution if certain values are significantly more common than others. This might result in some partitions being much larger than others, causing performance bottlenecks.

  7. Performance:

  8. High Cardinality: May improve performance for highly selective queries (e.g., querying specific user activities), as the data to be scanned is minimized.
  9. Low Cardinality: May benefit queries that aggregate or scan large portions of data, but can suffer if the partitioning scheme results in large, uneven partitions.

  10. Maintenance and Management:

  11. High Cardinality: More partitions to manage, which can increase overhead but also provide finer granularity for managing and optimizing data.
  12. Low Cardinality: Fewer partitions to manage, but requires careful consideration to avoid unbalanced partition sizes.

Query Processing & Caching

  • Snowflake caches three types of data:
    • Query Results: expire after 24 hours of non-usage or if underlying data changes
    • Warehouse Cache (ssd/local cache): dropped as soon as the warehouse is suspended
    • Metadata (Cloud Services Layer): always persisted and updated; never dropped.

There is also a search optimization feature.

Streams

  • Change Data Capture (CDC): process of capturing DML Operations (updates, inserts and delete) in your tables
  • Achieved using Streams
  • Stream can be thought of as a table
  • helps in:
    • incremental data loads
    • updating search indexes/history
    • streaming analytics/anomaly detection
    • even-driven data platforms

Stream Types

  • Standard: tracks all DML changes to the source table
  • Append-only: tracks row inserts only. Updates and deletes are not recorded.
  • Insert-only: same as append-only but for external tables.

Streams Staleness

  • Data retention for the streams
  • like 2 days, 5 days etc.

Change Tracking

  • Read-only alternative to streams

UDF (User Defined Functions)

  • Let's us extend operations that are not available through system defined functions in SnowFlake.
  • UDF's are reusable code which can be written in:
    • SQL
    • JS
    • Java
    • Python
  • 2 types:
    • Scalar: returns one output row
    • Tabular: returns zero, one or multiple rows for each input row

Pushdown

  • improves performance by filtering out unneeded rows as early as possible during query processing
  • can also reduce memory consumption.
  • However, pushdown can allow confidential data to be exposed indirectly.
  • basically, don't load/read the records that don't match the filter conditions from the WHERE clause, then scan the remaining rows in memory and select the columns from the select clauses.

REMEMBER: Snowflake query optimizer bypasses the optimizations(pushdowns) when it comes to Secure UDFs.

External Functions

  • code executed outside Snowflake
  • remotely executed code
  • relayed through proxy service
  • type of UDF
  • unlike UDF, an external function doesnt contain its own code; instead, the external function calls code that is stored and executed outside Snowflake

Column-Level Data Masking

  • to mask or hide the values of a column (say an intern joined and you don't want to give him the access to a senstive column data)
  • we can mask using star, random numbers, etc.

Row-Level Data Security

  • instead of creating different views for different users, we can implement row-level access i.e. filter out rows based on a query according to the accessing user

Snowflake Vs Other solutions

Comparing Snowflake with other cloud provider solutions like Amazon Redshift, Google BigQuery, and Databricks involves evaluating several factors, including architecture, performance, scalability, ease of use, cost, and additional features. Here’s a detailed comparison:

Snowflake vs. Amazon Redshift

  1. Architecture:
  2. Snowflake: Uses a multi-cluster, shared data architecture with separation of storage and compute, allowing independent scaling.
  3. Redshift: Uses a single-cluster architecture where storage and compute are coupled, although Redshift Spectrum allows querying data in S3, providing some flexibility.

  4. Performance and Scalability:

  5. Snowflake: Automatically scales compute resources up or down and can handle high concurrency.
  6. Redshift: Requires manual resizing of clusters, though RA3 instances allow scaling compute and storage independently to some extent.

  7. Ease of Use:

  8. Snowflake: Fully managed, no need for database administration tasks.
  9. Redshift: Requires more manual administration, including vacuuming and indexing.

  10. Cost:

  11. Snowflake: Pay-as-you-go pricing for compute and storage independently.
  12. Redshift: Pay-per-hour for the cluster, with additional charges for storage and Redshift Spectrum queries.

  13. Data Sharing and Collaboration:

  14. Snowflake: Enables secure data sharing without data movement.
  15. Redshift: Less advanced data sharing capabilities, often requiring data copying.

  16. Support for Semi-Structured Data:

  17. Snowflake: Native support for JSON, Avro, Parquet.
  18. Redshift: Supports JSON and other semi-structured data through Redshift Spectrum, but with more complexity.

Snowflake vs. Google BigQuery

  1. Architecture:
  2. Snowflake: Separation of storage and compute with a multi-cluster, shared data architecture.
  3. BigQuery: Serverless, fully managed data warehouse with a unique architecture that separates storage and compute automatically.

  4. Performance and Scalability:

  5. Snowflake: Scales compute resources based on workloads with high concurrency support.
  6. BigQuery: Scales automatically with virtually unlimited compute power and high concurrency support.

  7. Ease of Use:

  8. Snowflake: Fully managed, straightforward to use with no infrastructure management.
  9. BigQuery: Also fully managed and serverless, with seamless scaling and automatic query optimization.

  10. Cost:

  11. Snowflake: Pay-as-you-go for compute and storage independently.
  12. BigQuery: Charges based on the amount of data processed by queries (query-by-query pricing) and storage costs.

  13. Data Sharing and Collaboration:

  14. Snowflake: Secure data sharing capabilities without data duplication.
  15. BigQuery: Data sharing through datasets, with integration into Google’s ecosystem for collaboration.

  16. Support for Semi-Structured Data:

  17. Snowflake: Native support for JSON, Avro, Parquet.
  18. BigQuery: Strong support for semi-structured data formats like JSON and Avro.

Snowflake vs. Databricks

  1. Architecture:
  2. Snowflake: Multi-cluster, shared data architecture with separation of storage and compute.
  3. Databricks: Built on Apache Spark, integrates with data lakes, and provides a unified analytics platform for data engineering, data science, and machine learning.

  4. Performance and Scalability:

  5. Snowflake: Scales compute resources dynamically with high concurrency.
  6. Databricks: Leverages Spark's distributed computing capabilities, highly scalable with optimized performance for big data and machine learning workloads.

  7. Ease of Use:

  8. Snowflake: User-friendly, fully managed service requiring minimal administration.
  9. Databricks: More complex due to the integration of various data processing and analytics tools, requires some Spark knowledge.

  10. Cost:

  11. Snowflake: Pay-as-you-go for compute and storage independently.
  12. Databricks: Pay for compute resources and data processing, with different pricing tiers for data engineering and analytics workloads.

  13. Data Sharing and Collaboration:

  14. Snowflake: Secure and straightforward data sharing without data movement.
  15. Databricks: Collaborative environment with notebooks for data scientists and engineers, integrates well with data lakes for sharing data.

  16. Support for Semi-Structured Data:

  17. Snowflake: Native support for JSON, Avro, Parquet.
  18. Databricks: Excellent support for semi-structured data through Spark, capable of handling complex data processing and transformations.

Summary

  • Snowflake: Best for ease of use, flexibility, and strong data sharing capabilities. Ideal for organizations needing a fully managed, highly scalable, and easy-to-use data warehouse with advanced features.
  • Amazon Redshift: Good for users already within the AWS ecosystem, offers powerful performance but requires more manual management and tuning.
  • Google BigQuery: Best for a serverless, fully managed data warehouse experience with powerful automatic scaling and cost-efficient query pricing. Excellent for big data analytics.
  • Databricks: Ideal for big data and machine learning workloads, offering a unified analytics platform that integrates deeply with data lakes. Best suited for data engineers and data scientists requiring advanced data processing capabilities.