Skip to content

Azure Basics for DE

  • Regions, AZ just like AWS.
  • Region Pair is a new thing. For replication of resources.

Resources

  • Anything created in Azure is a resource
  • Compute, DB, VNs

Resource Groups

  • groupings of resources
  • if creating a resource, need to place it into a resource group
  • a single resource can only be in one resource group at a time

Subscriptions

  • unit of management, billing and scale.

Management Groups

  • Grouping subscriptions.
  • basically hierarchy is: resources -> resource groups -> subscriptions -> management groups
AWS Azure
EC2 VM (1 os per VM)
ECS Azure Containers
EKS Azure Kubernetes Service (AKS)
Lambda Azure Functions
AWS VPC Azure VN
Direct Connect Azure ExpressRoute
Route53 Azure DNS
S3 Blob Storage
RedShift DataLake Storage
EFS Azure Files
SQS Queue Storage
DynamoDB etc Table Storage

Azure Virtual Desktops

  • enables us to use cloud-hosted version of Windows from any location

Azure Storage

  • Services:
    • Blob Storage: S3
    • DataLake Storage: RedShift
    • Azure Files: EFS
    • Queue Storage: SQS
    • Table Storage: DynamoDB etc
  • Redundancy is everywhere

Azure Data Fundamentals

Identity Data Formats

Structured Data

  • adheres to a schema
  • tabular format
  • fixed schema has the same fields or properties
  • relational model

Semi-structured Data

  • has some structure but allows for variation between entity instances
  • eg most emails may have an email but some may not have it
  • eg. JSON

Unstructured Data

  • eg. documents, images, audios, video data

Data Stores

  • for storing the above types of data
  • Two broad categories of data store in common use:
    • File Store
    • Database
Info

Parquet is a columnar data format. Created by Cloudera and Twitter.

  • contains row groups
  • data for each column is stored together in the same row group
  • supports efficient compression and encoding schemes

Exploring Transactional Data Processing

  • considered to be the primary function of business computing
  • records transactions that encapsulates specific events that the org. wants to track
  • follows ACID:
    • Atomicity
    • Consistency
    • Isolation
    • Durability

Exploring Analytical Data Processing

  • General Steps:
    • Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
    • Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or a data warehouse with a fully relational SQL engine.
    • Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. For example, sales revenue might be totaled by date, customer, and product.
    • The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.

Data Lakes

  • large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.

Data Warehouses

  • a relational database optimized for read operations
  • established way to store data in a relational schema that is optimized for read operations – primarily queries to support reporting and data visualization.
  • Data Lakehouses are a more recent innovation that combine the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse.
  • The table schema may require some denormalization of data in an OLTP data source (introducing some duplication to make queries perform faster).

OLAP Model

  • aggregated type of data storage that is optimized for analytical workloads
  • OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.

Azure Services

  • Azure SQL: based on Microsoft SQL server
  • Azure Database:
    • MySQL
    • PostgresSQL
    • MariaDB
  • Azure Cosmos DB: NoSQL
  • Azure Storage: Blob, file, table (used to host data lakes)
  • Azure Data Factory: ETL solution
  • Azure Synapse Analytics: combines ETL, data warehouse storage and data lake storage. Includes:
    • Pipelines
    • SQL
    • Apache Spark: distributed processing
    • Azure Synapse Data Explorer: high-performance data analytics solution that is optimized for real-time querying of log and telemetry data using Kusto Query Language (KQL)
  • Azure DataBricks: Azure-integrated version of DataBricks (combines Apache Spark with SQL)
    • creating anaytical data stores in Azure Databricks
  • Azure HDInsight: Azure-hosted cluster
    • Azure Spark
    • Azure Hadoop
    • Apache HBase
    • Apache Kafka
  • Azure Stream Analytics: real time

SQL

Statements

  • 3 groups:
    • Data Definition Language (DDL)
      • create, modify & remove tables in a db (CREATE, ALTER, DROP, RENAME)
    • Data Control Language (DCL)
      • used by admins
      • to grant access, deny & revoke access to specific users and groups (GRANT, DENY, REVOKE)
    • Data Manipulation Language (DML)
      • manipulate rows in a table
      • insert, modify, retrieve (SELECT, INSERT, UPDATE, DELETE)

DB Concepts

Normalization

  • a schema design process that minimizes data duplication and enforces data integrity
  • Basic def:
    • take the duplicate entity out (col)
    • then form the relationship between the col and the others using foreign key columns

View

  • virtual table based on the results of the SELECT query

Stored Procedure

  • defines SQL Statements that can be run on command
  • to encapsulate programmatic logic in a database for actions that applications need to perform while working with data

Index

  • helps searching data in a table
  • like back of a book
  • we specify a column from the table

Azure SQL

  • Includes services like:
    • SQL Server on Azure VM
    • Azure SQL Managed Instance (on-prem)
    • Azure SQL Database (fully managed)
    • Azure SQL Edge

Azure Blob Storage

  • store massive amount of unstructured data
  • blobs (binary large objects)
  • there are multiple blob containers and files inside them

Azure DataLake Storage Gen2

  • hierarchical data storage for analytical data lakes
  • for big data
  • for work with structured, unstructured, semi-structured

Azure Tables

  • NoSQL storage solution

Azure CosmosDB

  • store:
    • Documents
    • Graphs
    • Key-value Tables
    • Column Family Stores

Data Engineering

  • We can run ETL jobs using:

    • Azure Data Factory or
    • Azure Synapse Analytics or
    • Microsoft Fabric (for unified workplace)
  • Now, there are 2 types of data stores associated with ETL:

    • Data Warehouse:
      • relational db
      • data stored in a schema that is optimized for data analytics rather than transactional workloads
    • Data Lakehouse:
      • file store, usually on a distributed file system for high performance data access
      • tech. like Spark or Hadoop is used for processing queries
      • hybrid of data lake and data warehouse
      • Data Lakes are great for all type of data without the need for schema enforcement when the data is written to the shore
  • On Azure, there are 3 main PaaS services for implementing large-scale analytical store:
    • Azure Synapse Analytics
      • unified, end-to-end solution
    • Azure DataBricks
      • built on Apache Spark
      • Azure implementation of DataBricks
    • Azure HDInsight
      • supports open-source data analytics cluster types
      • not as user-friendly as the two above

Microsoft Fabric

  • SaaS
  • everything just too simple

For real-time anaytics, we have Azure Stream Analytics

Delta Lake

Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads.

The current version of Delta Lake included with Azure Synapse has language support for Scala, PySpark, and .NET and is compatible with Linux Foundation Delta Lake.