Azure Basics for DE
- Regions, AZ just like AWS.
- Region Pair is a new thing. For replication of resources.
Resources
- Anything created in Azure is a resource
- Compute, DB, VNs
Resource Groups
- groupings of resources
- if creating a resource, need to place it into a resource group
- a single resource can only be in one resource group at a time
Subscriptions
- unit of management, billing and scale.
Management Groups
- Grouping subscriptions.
- basically hierarchy is:
resources -> resource groups -> subscriptions -> management groups
| AWS | Azure |
|---|---|
| EC2 | VM (1 os per VM) |
| ECS | Azure Containers |
| EKS | Azure Kubernetes Service (AKS) |
| Lambda | Azure Functions |
| AWS VPC | Azure VN |
| Direct Connect | Azure ExpressRoute |
| Route53 | Azure DNS |
| S3 | Blob Storage |
| RedShift | DataLake Storage |
| EFS | Azure Files |
| SQS | Queue Storage |
| DynamoDB etc | Table Storage |
Azure Virtual Desktops
- enables us to use cloud-hosted version of Windows from any location
Azure Storage
- Services:
- Blob Storage: S3
- DataLake Storage: RedShift
- Azure Files: EFS
- Queue Storage: SQS
- Table Storage: DynamoDB etc
- Redundancy is everywhere
Azure Data Fundamentals
Identity Data Formats
Structured Data
- adheres to a schema
- tabular format
- fixed schema has the same fields or properties
- relational model
Semi-structured Data
- has some structure but allows for variation between entity instances
- eg most emails may have an email but some may not have it
- eg. JSON
Unstructured Data
- eg. documents, images, audios, video data
Data Stores
- for storing the above types of data
- Two broad categories of data store in common use:
- File Store
- Database
Info
Parquet is a columnar data format. Created by Cloudera and Twitter.
- contains row groups
- data for each column is stored together in the same row group
- supports efficient compression and encoding schemes
Exploring Transactional Data Processing
- considered to be the primary function of business computing
- records transactions that encapsulates specific events that the org. wants to track
- follows ACID:
- Atomicity
- Consistency
- Isolation
- Durability
Exploring Analytical Data Processing
- General Steps:
- Operational data is extracted, transformed, and loaded (ETL) into a data lake for analysis.
- Data is loaded into a schema of tables - typically in a Spark-based data lakehouse with tabular abstractions over files in the data lake, or a data warehouse with a fully relational SQL engine.
- Data in the data warehouse may be aggregated and loaded into an online analytical processing (OLAP) model, or cube. Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions from dimension tables. For example, sales revenue might be totaled by date, customer, and product.
- The data in the data lake, data warehouse, and analytical model can be queried to produce reports, visualizations, and dashboards.
Data Lakes
- large-scale data analytical processing scenarios, where a large volume of file-based data must be collected and analyzed.
Data Warehouses
- a relational database optimized for read operations
- established way to store data in a relational schema that is optimized for read operations – primarily queries to support reporting and data visualization.
- Data Lakehouses are a more recent innovation that combine the flexible and scalable storage of a data lake with the relational querying semantics of a data warehouse.
- The table schema may require some denormalization of data in an OLTP data source (introducing some duplication to make queries perform faster).
OLAP Model
- aggregated type of data storage that is optimized for analytical workloads
- OLAP data is pre-aggregated, queries to return the summaries it contains can be run quickly.
Azure Services
- Azure SQL: based on Microsoft SQL server
- Azure Database:
- MySQL
- PostgresSQL
- MariaDB
- Azure Cosmos DB: NoSQL
- Azure Storage: Blob, file, table (used to host data lakes)
- Azure Data Factory: ETL solution
- Azure Synapse Analytics: combines ETL, data warehouse storage and data lake storage. Includes:
- Pipelines
- SQL
- Apache Spark: distributed processing
- Azure Synapse Data Explorer: high-performance data analytics solution that is optimized for real-time querying of log and telemetry data using Kusto Query Language (KQL)
- Azure DataBricks: Azure-integrated version of DataBricks (combines Apache Spark with SQL)
- creating anaytical data stores in Azure Databricks
- Azure HDInsight: Azure-hosted cluster
- Azure Spark
- Azure Hadoop
- Apache HBase
- Apache Kafka
- Azure Stream Analytics: real time
SQL
Statements
- 3 groups:
- Data Definition Language (DDL)
- create, modify & remove tables in a db (CREATE, ALTER, DROP, RENAME)
- Data Control Language (DCL)
- used by admins
- to grant access, deny & revoke access to specific users and groups (GRANT, DENY, REVOKE)
- Data Manipulation Language (DML)
- manipulate rows in a table
- insert, modify, retrieve (SELECT, INSERT, UPDATE, DELETE)
- Data Definition Language (DDL)
DB Concepts
Normalization
- a schema design process that minimizes data duplication and enforces data integrity
- Basic def:
- take the duplicate entity out (col)
- then form the relationship between the col and the others using foreign key columns
View
- virtual table based on the results of the SELECT query
Stored Procedure
- defines SQL Statements that can be run on command
- to encapsulate programmatic logic in a database for actions that applications need to perform while working with data
Index
- helps searching data in a table
- like back of a book
- we specify a column from the table
Azure SQL
- Includes services like:
- SQL Server on Azure VM
- Azure SQL Managed Instance (on-prem)
- Azure SQL Database (fully managed)
- Azure SQL Edge
Azure Blob Storage
- store massive amount of unstructured data
- blobs (binary large objects)
- there are multiple blob containers and files inside them
Azure DataLake Storage Gen2
- hierarchical data storage for analytical data lakes
- for big data
- for work with structured, unstructured, semi-structured
Azure Tables
- NoSQL storage solution
Azure CosmosDB
- store:
- Documents
- Graphs
- Key-value Tables
- Column Family Stores
Data Engineering
-
We can run ETL jobs using:
- Azure Data Factory or
- Azure Synapse Analytics or
- Microsoft Fabric (for unified workplace)
-
Now, there are 2 types of data stores associated with ETL:
- Data Warehouse:
- relational db
- data stored in a schema that is optimized for data analytics rather than transactional workloads
- Data Lakehouse:
- file store, usually on a distributed file system for high performance data access
- tech. like Spark or Hadoop is used for processing queries
- hybrid of data lake and data warehouse
- Data Lakes are great for all type of data without the need for schema enforcement when the data is written to the shore
- Data Warehouse:
- On Azure, there are 3 main PaaS services for implementing large-scale analytical store:
- Azure Synapse Analytics
- unified, end-to-end solution
- Azure DataBricks
- built on Apache Spark
- Azure implementation of DataBricks
- Azure HDInsight
- supports open-source data analytics cluster types
- not as user-friendly as the two above
- Azure Synapse Analytics
Microsoft Fabric
- SaaS
- everything just too simple
For real-time anaytics, we have Azure Stream Analytics
Delta Lake
Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, and durability) transactions to Apache Spark and big data workloads.
The current version of Delta Lake included with Azure Synapse has language support for Scala, PySpark, and .NET and is compatible with Linux Foundation Delta Lake.