Azure Basics for DE

Regions, AZ just like AWS.
Region Pair is a new thing. For replication of resources.

Resources

Anything created in Azure is a resource
Compute, DB, VNs

Resource Groups

groupings of resources
if creating a resource, need to place it into a resource group
a single resource can only be in one resource group at a time

Subscriptions

unit of management, billing and scale.

Management Groups

Grouping subscriptions.
basically hierarchy is: resources -> resource groups -> subscriptions -> management groups

AWS	Azure
EC2	VM (1 os per VM)
ECS	Azure Containers
EKS	Azure Kubernetes Service (AKS)
Lambda	Azure Functions
AWS VPC	Azure VN
Direct Connect	Azure ExpressRoute
Route53	Azure DNS
S3	Blob Storage
RedShift	DataLake Storage
EFS	Azure Files
SQS	Queue Storage
DynamoDB etc	Table Storage

Azure Virtual Desktops

enables us to use cloud-hosted version of Windows from any location

Azure Storage

Services:
- Blob Storage: S3
- DataLake Storage: RedShift
- Azure Files: EFS
- Queue Storage: SQS
- Table Storage: DynamoDB etc
Redundancy is everywhere

Azure Data Fundamentals

Identity Data Formats

Structured Data

adheres to a schema
tabular format
fixed schema has the same fields or properties
relational model

Semi-structured Data

has some structure but allows for variation between entity instances
eg most emails may have an email but some may not have it
eg. JSON

Unstructured Data

eg. documents, images, audios, video data

Data Stores

for storing the above types of data
Two broad categories of data store in common use:
- File Store
- Database

Info

Parquet is a columnar data format. Created by Cloudera and Twitter.

contains row groups
data for each column is stored together in the same row group
supports efficient compression and encoding schemes

Exploring Transactional Data Processing

considered to be the primary function of business computing
records transactions that encapsulates specific events that the org. wants to track
follows ACID:
- Atomicity
- Consistency
- Isolation
- Durability

Exploring Analytical Data Processing

General Steps:
- Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
- Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or a data warehouse with a fully relational SQL engine.
- Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. For example, sales revenue might be totaled by date, customer, and product.
- The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.

Data Lakes

large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.

Data Warehouses

a relational database optimized for read operations
established way to store data in a relational schema that is optimized for read operations – primarily queries to support reporting and data visualization.
Data Lakehouses are a more recent innovation that combine the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse.
The table schema may require some denormalization of data in an OLTP data source (introducing some duplication to make queries perform faster).

OLAP Model

aggregated type of data storage that is optimized for analytical workloads
OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.

Azure Services

Azure SQL: based on Microsoft SQL server
Azure Database:
- MySQL
- PostgresSQL
- MariaDB
Azure Cosmos DB: NoSQL
Azure Storage: Blob, file, table (used to host data lakes)
Azure Data Factory: ETL solution
Azure Synapse Analytics: combines ETL, data warehouse storage and data lake storage. Includes:
- Pipelines
- SQL
- Apache Spark: distributed processing
- Azure Synapse Data Explorer: high-performance data analytics solution that is optimized for real-time querying of log and telemetry data using Kusto Query Language (KQL)
Azure DataBricks: Azure-integrated version of DataBricks (combines Apache Spark with SQL)
- creating anaytical data stores in Azure Databricks
Azure HDInsight: Azure-hosted cluster
- Azure Spark
- Azure Hadoop
- Apache HBase
- Apache Kafka
Azure Stream Analytics: real time

SQL

Statements

3 groups:
- Data Definition Language (DDL)
  - create, modify & remove tables in a db (CREATE, ALTER, DROP, RENAME)
- Data Control Language (DCL)
  - used by admins
  - to grant access, deny & revoke access to specific users and groups (GRANT, DENY, REVOKE)
- Data Manipulation Language (DML)
  - manipulate rows in a table
  - insert, modify, retrieve (SELECT, INSERT, UPDATE, DELETE)

DB Concepts

Normalization

a schema design process that minimizes data duplication and enforces data integrity
Basic def:
- take the duplicate entity out (col)
- then form the relationship between the col and the others using foreign key columns

View

virtual table based on the results of the SELECT query

Stored Procedure

defines SQL Statements that can be run on command
to encapsulate programmatic logic in a database for actions that applications need to perform while working with data

Index

helps searching data in a table
like back of a book
we specify a column from the table

Azure SQL

Includes services like:
- SQL Server on Azure VM
- Azure SQL Managed Instance (on-prem)
- Azure SQL Database (fully managed)
- Azure SQL Edge

Azure Blob Storage

store massive amount of unstructured data
blobs (binary large objects)
there are multiple blob containers and files inside them

Azure DataLake Storage Gen2

hierarchical data storage for analytical data lakes
for big data
for work with structured, unstructured, semi-structured

Azure Tables

NoSQL storage solution

Azure CosmosDB

store:
- Documents
- Graphs
- Key-value Tables
- Column Family Stores

Data Engineering

We can run ETL jobs using:
- Azure Data Factory or
- Azure Synapse Analytics or
- Microsoft Fabric (for unified workplace)
Now, there are 2 types of data stores associated with ETL:
- Data Warehouse:
  - relational db
  - data stored in a schema that is optimized for data analytics rather than transactional workloads
- Data Lakehouse:
  - file store, usually on a distributed file system for high performance data access
  - tech. like Spark or Hadoop is used for processing queries
  - hybrid of data lake and data warehouse
  - Data Lakes are great for all type of data without the need for schema enforcement when the data is written to the shore
On Azure, there are 3 main PaaS services for implementing large-scale analytical store:
- Azure Synapse Analytics
  - unified, end-to-end solution
- Azure DataBricks
  - built on Apache Spark
  - Azure implementation of DataBricks
- Azure HDInsight
  - supports open-source data analytics cluster types
  - not as user-friendly as the two above

Microsoft Fabric

SaaS
everything just too simple

For real-time anaytics, we have Azure Stream Analytics

Delta Lake

Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads.

The current version of Delta Lake included with Azure Synapse has language support for Scala, PySpark, and .NET and is compatible with Linux Foundation Delta Lake.