A Deep Dive Into CDC With Azure Data Factory

Share This Post

Introduction

Change Data Capture (CDC) in SQL Server is a powerful feature designed to track and capture changes made to data within a database. It provides a reliable and efficient way to identify alterations to tables, allowing for the extraction of valuable insights into data modifications over time. By enabling CDC with Azure Data Factory, SQL Server enables a systematic and automated approach to monitoring and capturing changes, facilitating better data management, auditing, and analysis within the database environment.

Most Common Use-Cases: CDC With Azure Data Factory

Common scenarios where the CDC with Azure Data Factory proves beneficial include:

Audit Trail and Analytics: Monitoring data alterations for audit trails and conducting analytical assessments on change data.

Downstream Propagation: Efficiently propagating changes to downstream subscribers for synchronized data updates.

ETL Operations: Facilitating Extract, Transform, Load (ETL) operations to seamlessly transfer data changes from the Online Transaction Processing (OLTP) system to a data lake or data warehouse. Tools like Azure Data Factory can be employed for this purpose.

Event-Driven Programming: Enabling event-based programming for instantaneous responses triggered by data changes, enhancing real-time system interactions.

Usage: Some Queries

Here are SQL queries and commands for managing Change Data Capture (CDC) in SQL Server:

  • Check if CDC is enabled for the database:

Select  name, is_cdc_enabled from sys.databases;

  • Check which tables have CDC enabled::

Select  name, is_tracked_by_cdc from sys.tables;

  • First, the database needs to be enabled:
    • EXEC sys.sp_cdc_enable_db
  • Then enable all the tables to be audited:
  Stages of agile retrospective & 7 popular techniques

EXECUTE sys.sp_cdc_enable_table

        @source_schema = N’dbo’,

        @source_name = N’PslMaterials’,

        @role_name     = NULL;

  • To disable the database:
    • EXEC sys.sp_cdc_disable_db
  • To disable a table:
    • EXEC sys.sp_cdc_disable_table
    •     @source_schema = N’dbo’,
    •     @source_name   = N’MyTable’,
    •     @capture_instance = N’dbo_MyTable’

When CDC is enabled for a database, a dedicated schema named CDC is established. Within this schema, several essential tables are created to manage and store change data. It’s crucial to note that disabling CDC for a table or the entire database can lead to the removal of these tables, resulting in the loss of historical changes. To preserve this historical data, it is necessary to copy the changes to another table or file.

CDC Schema

The key tables within the CDC schema include:

  • cdc.change_tables: the list of tables with cdc enabled
  • cdc.captured_columns : the list of captured columns for each table
  • cdc.ddl_history: Documents Data Definition Language (DDL) statements that modify the source tables. These changes aren’t immediately applied to CDC tables; a restart of the CDC instance is required for the changes to take effect.
  • cdc.index_columns: Defines the primary key of CDC tables.
  • cdc.lsn_time_mapping: Manages long block sequence number time mapping.

Additionally, when a table is enabled for CDC, two more tables are created:

  • cdc.cdc_jobs: Handles CDC-related jobs.
  • cdc.SchemaName_TableName_CT: Represents the change table for a specific schema and table, for instance, dbo_PslVendors_CT.

Mirrors all fields from the original table with some extra columns needed for CDC: 

  • __$start_lsn: Binary code that keeps track of when changes were committed, helping maintain the order in which changes occurred.
  • __$seqval: Another binary code used to organize changes to a row within a transaction.
  • __$operation: A number indicating the type of change made to the data. 1 represents a deletion, 2 is for insertion, and 3 and 4 are for updates (capturing column values before and after the update).
  • __$update_mask: A series of bits indicating which columns were changed during an update.
  • <captured source table columns>: The remaining columns represent the specific data captured during the creation of the capture instance. If no columns were specified, all columns from the source table are included.
  Domotics with Home Assistant: concepts

CDC Implementation Details:

  • Every source table enabled for the CDC has its dedicated CDC table.
  • Ensure sufficient database space to accommodate the additional tables generated, preventing potential space shortages.
  • The SQL Server Agent capture job retrieves changes from the transaction log and incorporates them into the corresponding change tables.
  • Cleanup jobs manage the change tables, adhering to a retention policy to remove outdated data.
  • Query functions provide a means to access and utilize change data from the CDC change tables.
  • In Azure SQL databases, where SQL Server Agent is unavailable, the CDC scheduler assumes the role of capturing and cleaning up data.

Performance Considerations: Factors Impacting Performance

  • Number of CDC-Enabled Tables
    • The more tables enabled for CDC, the higher the processing overhead. Evaluate necessity against performance impact.
  • Frequency of Changes in Tracked Tables
    • Tables undergoing frequent changes increase the volume of captured data. Regularly changing data may impact performance.
  • Space Availability in the Source Database
    • CDC captures changes and stores them. Ensure adequate space in the source database to accommodate change tables without risking space shortages.
CTA Software

CDC with Azure Data Factory

In Azure cloud, Data Factory is a powerful tool for various needs, and now includes a preview for Change Data Capture (CDC), which simplifies the process, offering the seamless power of CDC. Let’s explore the steps to leverage this feature:

Step to create CDC in Data factory:

1. Let’s create a CDC: CDC can be executed as a standalone resource, eliminating the need for a pipeline as it is needed for example for running Data flows.

  Software requirements prioritization techniques you should know
new Cdc

2. Assign a name to the resource (it must be alphanumeric): Choose the source type, ranging from various types of databases to files. In the case of Azure SQL database, select the tables. CDC-enabled tables are automatically detected; otherwise, specify a field defining row modifications (typically a modified date field).

create cdc v2

3. Choose the destination: In this case, the same as the origin types: databases and also some storage where to store the files with the changes.

create cdc destination

4. Define the destination: The destination table will be created automatically with the Auto map option selected. Choose a key for the destination table.

new mapping

5. Define a latency among the given options: Real-time, 15-minute, 30-minute, 1 hour, 2 hours. Initiate the process, and the agent will read data at defined intervals.

6. Monitor: The green dots signify the instances when CDC was executed, occurring every 15 minutes in this example. The blue dots represent the captured changes during each execution, providing a clear monitoring interface.

monitor cdc

Conclusion

CDC stands out as a robust and influential tool, offering valuable capabilities for tracking and managing changes in databases. With the advent of CDC with Azure Data Factory, this power is seamlessly harnessed in a user-friendly and practical manner. The combination of CDC and Data Factory presents an efficient and accessible solution for implementing Change Data Capture with utmost satisfaction.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Subscribe To Our Newsletter

Get updates from our latest tech findings

Have a challenging project?

We Can Work On It Together

apiumhub software development projects barcelona
Secured By miniOrange