DP-203 : Data Engineering on Azure

DP-203 : Data Engineering on Azure

  • Learning
    • Concepts
      • Data Storage in Azure
        • Storage accounts, blob storage
      • Azure Data Factory / Synapse Pipelines
        • ADF components
        • Data Ingestion
        • Transformation
        • Slow Changing dims
        • Orchestration
        • SSIS
        • Operationalization
      • Azure Synapse Analytics
        • Components
        • Synapse Studio
        • Modern data warehouse
      • Working with Synapse Analytics
        • Analyze data in SA
        • Data Loading practices
        • Optimize
        • Integrate SQL and Spark
        • Developer features
        • Manage, monitor, optimize and secure
      • Synapse Spark pools
        • Spark in Synapse
        • Delta Lake in Synapse
        • SQL and spark advanced
        • Monitor and manage Spark
      • Hybrid transactional and analytical processing (HTAP)
        • Patterns
        • Synapse link
      • Synapse link Cosmosdb and SQL
      • Azure Data Bricks (ADB)
        • Spark on ADB
        • Read and Write data
        • Dataframes
        • Performance features
        • Dataframe columns
        • Architecture and Security
        • Delta lake
        • Streaming
        • CI/CD
        • Integrate with Synapse
        • Best Practices
      • ADLS Gen 2
      • Azure Stream Analytics
    • Facts (Stuff to memorize)
    • Procedures (Stuff to practice)
  • Content
    • Store Data in Azure
      • Data Storage approach in Azure
        • Classify Data and Operational Needs
          • Questions
            • What are the 3 types of structures for data?
          • Answers
            • The 3 types are
              • Structured : Tables and columns
              • Semi structured : json, xml, key value pairs
              • unstructured : images
        • Choose a storage solution on Azure
          • Questions
            • Advantages of ADLS G2?
            • What API’s does Cosmosdb offer?
          • Answers
            • Hadoop Compatible access, Superset of posix permissions, cost effective, optimised driver
            • Cosmosdb offers the following APIs
              • SQL API for most unstructured data (json)
              • MongoDb API for migration from mongodb
              • cassandra api for wide columnar data sets
              • Gremlin API for graph
              • table api for key value pair
      • Azure Storage Accounts
        • Questions
          • Which 4 services comprise Azure Storage?
          • What is a storage account?
          • What is ADLS G2 ?
          • What are the 2 tiers available for storage accounts?
          • Where is the archive storage tier available?
          • What do you need to do to access an archived file?
        • Answers
          • The 4 primitive cloud based data services are : Azure Blobs, Azure Files, Azure Queues, Azure Tables and they are offered as part of Azure Storage
          • A storage account is a container that groups one or more of instances of azure storage services together
          • Service on top of Blob storage. Ability to host data lake on azure. hierarchical namespace on blob storage. organize files and objects in a hierarchy of directories for efficient data access
          • Hot and Cold.
          • At the blob level
          • Rehydrate. i.e. change the access tier and choose hot or cold. It takes time to rehydrate.
    • Azure Data Factory
      • Progress
        • Completed Copy activity.. starting mapping data flows
      • Questions
        • What is a pipeline?
        • What is a linked service?
        • What is a dataset?
      • Answers
        • Logical grouping of activities
        • Sources and Destinations. Has a connection string.
        • A view of the data used to reference the data within the activities
    • Work with Data Warehouses using Azure Synapse Analytics
      • General
        • Questions
          • What are Data warehouse units (DWU)?
        • Answers
          • Analytic resources are defined as CPU + Memory + IO. These are bundled into units of compute scale called data warehouse units (DWU).
      • Analyse data in a relational data warehouse
        • Design a data warehouse schema
          • Background Questions
            • Why do we need two keys in a dimension table?
              We have the numeric key which performs better in queries and joins. The natural or business key is used to identify the specific instance of the entity in the source.
          • Questions
            • What is the grain of the fact table?
            • Why do we de-normalize dimensions?
            • When does it make sense to snowflake?
          • Answers
            • The dimension key’s columns help establish the grain of the fact table
            • To reduce number of joins and to simplify the data model
            • When you have large number of attributes in an entity and we have to share that among multiple dimensions. e.g. address
        • Creating datawarehouse tables – dedicated SQL pool
          • Background questions
            • Describe synapse sql (dedicated sql pool) architecture
              Dedicated SQL pool (formerly SQL DW) uses a node-based architecture. Applications connect and issue T-SQL commands to a Control node. The Control node hosts the distributed query engine, which optimizes queries for parallel processing, and then passes operations to Compute nodes to do their work in parallel. The Compute nodes store all user data in Azure Storage and run the parallel queries. The Data Movement Service (DMS) is a system-level internal service that moves data across the nodes as necessary to run queries in parallel and return accurate results.
            • What is advantage of decoupled storage and compute
              With decoupled storage and compute, when using a dedicated SQL pool (formerly SQL DW) one can: Independently size compute power irrespective of your storage needs. Grow or shrink compute power, within a dedicated SQL pool (formerly SQL DW), without moving data. Pause compute capacity while leaving data intact, so you only pay for storage. Resume compute capacity during operational hours.
            • How do you choose a distribution column?
              – To balance the parallel processing, select a distribution column that: Has many unique values. The column can have duplicate values. All rows with the same value are assigned to the same distribution. Since there are 60 distributions, some distributions can have > 1 unique values while others may end with zero values. – Does not have NULLs, or has only a few NULLs. For an extreme example, if all values in the column are NULL, all the rows are assigned to the same distribution. As a result, query processing is skewed to one distribution, and does not benefit from parallel processing – Is not a date column. All data for the same date lands in the same distribution. If several users are all filtering on the same date, then only 1 of the 60 distributions do all the processing work.
          • Questions
            • Are primary keys supported in dedicated sql pools?
            • Are unique constraints supported ?
            • What happens when primary keys are violated?
            • What is the default type of index?
            • MPP and SMP stands for?
            • What distributions are supported?
            • What types of distribution do we use for dimension tables?
            • What type of distribution do we use for fact tables?
            • What type of distribution do we use for staging tables?
            • What is DMS?
            • Unit of scale for compute
            • Which node has the distributed query engine?
            • How many compute nodes does Synapse have?
            • What data types does clustered columnstore indexes not support?
            • What is a table without a clustered index called and advantage of this type of table?
          • Answers
            • Yes as long as nonclustered and notenforced are both used.
            • Yes as long as notenforced is used. Used mainly for query optimization.
            • It returns incorrect results
            • Clustered columnstore
            • Massively parallel processing and Symmetric Multi Processing
            • Hash, round robin and replicated
            • Replicated for small tables and hash for large tables
            • Hash distribution with clustered columnstore index
            • Use round robin to distribute evenly across compute nodes
            • Data Movement service which moves data across compute nodes to run queries in parallel
            • DWU data warehouse unit
            • Control node
            • From 1 to 60 nodes depending on service level
            • varchar(max), nvarchar(max) and varbinary(max)
            • Heap table. Faster to load.
      • Using data loading best practices in Azure Synapse Analytics
        • Load methods into Synapse Analytics
        • Workload Management in Synapse
          • Questions
            • What are resource classes?
            • What are the 2 types of resource classes?
            • What are the 3 concepts that make up workload management in dedicated sql pool for Azure Synapse?
            • How many custom workload groups can you define in dedicated sql pools?
            • What are the configurable options for Workload Groups?
            • What is a classifier?
            • What is inside a classifier?
          • Answers
            • They are pre-determined resource limits defined for a user or role. Governs memory for each query. Controls number of concurrent queries? It is recommended to use workload management.
            • Static and Dynamic. Static allocates same amount of memory independent of current performance level of SQL pool. Dynamic resource classes have 4 pre-defined db roles: smallrc, mediumrc, largerc and xlargerc
            • Workload Classification, Workload Importance and Workload Isolation
            • 8 custom workload groups
            • These are configurable options
              Min. Resources% Min. & Max Resources % per Request Cap Resources % Concurrency Range Effective Values Classifiers Request Importance Query Execution Timeout
            • A classifier assigns an incoming request to a workload group. To ensure the correct requests are properly assigned multiple classifiers can be defined per workload group.
            • A Classifier consists of
              Name Member (user/role) Request importance (1-5) Label (used in a query) Start and End Time – The time that the classifier will be active for Session context : This is to replace the label. The session name will apply for the classifier during the entire session
        • Copy Activity
          • Questions
            • What are the 2 ways of ingesting data when source is ADLS G2?
            • Advantages of Copy over PolyBase?
          • Answers
            • PolyBase and Copy
            • Custom column and row delimiters
      • Optimize data warehouse query performance in Synapse Analytics
        • Performance Issues
          • Questions
            • What are concurrency slots?
            • What command helps view queries and their assigned performance?
            • What is the first indication of poor performance?
            • What are the indications of poor load performance?
            • What is the one of the root causes of poor query performance?
          • Answers
            • Number of queries that can be executed in your database at the same time. Depends on how many resources you have allocated to the pool i.e. the DWU’s allocated
            • Run a select query from sys.dm_pdw_exec_requests
            • Users reporting long running queries
            • Telemetry data from Azure Synapse pipelines or reports out of date
            • Inter-distribution data movements. One of the common movements is the shuffle where for each input row the hash is computed using the join columns and the row is sent to the node that owns the hash value
        • Table distribution design
          • Questions
            • What are the 3 types of table distributions?
            • What distribution delivers highest query performance?
            • What are the considerations in choosing a hash function?
            • What is the ideal size for replicated tables?
            • What are the advantages and disadvantages of replicated tables?
            • What kind of query would help you to determine which replicated tables have been modified but not rebuilt?
            • What would force a rebuilt of a modified replicated table?
            • What situations will impact replicated tables negatively?
          • Answers
            • Round robin, hash and replicated
            • Hash distribution. The data is sharded using a hash function and each row is assigned to a distribution. In the table definition one of the columns is assigned to a table distribution.
            • We need to choose a column that will evenly distribute the data across all nodes. These are the considerations
              • Choose a column with highest number of unique values
              • Choose a column with very few null values
              • Do not choose a data column as all data for a specific date will land in the same place. If a daily report is needed all business users will use only that node.
            • With compression the size should be less that 2GB.
            • Advantage is fastest performance for small tables. Disadvantage is that there is additional overhead when writing data since the data needs to be transferred to all the nodes.
            • A query such as :
              SELECT [ReplicatedTable] = t.[name]   FROM sys.tables t     JOIN sys.pdw_replicated_table_cache_state c       ON c.object_id = t.object_id   JOIN sys.pdw_table_distribution_properties p     ON p.object_id = t.object_id   WHERE c.[state] = ‘NotReady’     AND p.[distribution_policy_desc] = ‘REPLICATE’
            • select top 1 * from [ReplicatedTable]
            • Following situations will impact replicated tables negatively
              • Frequent insert delete and updates causing a rebuild
              • SQL pool being scaled frequently changes the number of compute nodes and rebuilds the replicated table
              • Table has a large number of columns. More efficient to distribute and create indexes on frequently used columns
        • Indexes to improve performance
          • Questions
            • What is a table with no indexes called?
            • What the default index created when no index options are specified on a table?
            • What is the segment size for columnstore index?
            • What is the difference between a nonclustered columnstore index and a clustered columnstore index?
            • What is the use of a nonclustered columnstore index?
          • Answers
            • Heap
            • Clustered Columnstore index (CCI)
            • 1,048,576 rows compressed and optimized by column
            • A nonclustered columnstore index is a secondary index that is created on a rowstore table, but a clustered columnstore index is the primary storage for the entire table.
            • A nonclustered columnstore index is useful for running real-time operational analytics. The OLTP workload uses the underlying clustered index (rowstore) while the analytics can run concurrently on the columnstore index.
        • Partitioning
          • Notes
            • Any partitioning added to a table is in addition to the distributions created behind the scenes. When creating partitions on clustered columnstore tables, it is important to consider how many rows belong to each partition
          • Questions
            • What are the benefits of partitioning in dedicated SQL pool?
            • What type of column is mostly used for partitioning?
            • What is the drawback of partitions?
          • Answers
            • Following are the benefits:
              • To improve the efficiency and performance of loading data by use of partition deletion, switching and merging. e.g. for rolling 36 months we could delete the oldest month every month very quickly.
              • Improving query performance, queries that filter on the partitioned column will perform faster as they do not have to scan all partitions.
            • Date
            • A table with too many partitions will hurt performance
    • DataBricks
      • Questions
        • What are the 2 types of clusters?
        • What are the 2 types of interactive clusters?
        • What is a Databricks Workspace?
      • Answers
        • The 2 types are :
          • Interactive Cluster
            • Analyse data using interactive notebooks
            • Multiple users can use a cluster and collaborate
          • Job Cluster
            • Work runs as a job. ADB will start the cluster and once job is complete the cluster is terminated
        • Standard Cluster and High concurrency cluster. Standard cluster is recommended for a single user, no fault isolation and no sharing of workloads and no table access control.
        • Workspace can refer to two different things
          • Workspace = Logical environment with Clusters + Data in DBFS + Server resources and
          • Workspace = Root Folder with notebooks, libraries and dashboards