Slowly Changing Dimension
What is SCD?.
In Data Ware Housing world SCD plays important role.
SCD stands for Slowly Changing Dimension.
It means changes in Dimension table.
Dimension table would contains Customer table. Initially consider this is a master table with only one Entry for each customer.
If there are any changes in the customer details. We need to make new entry for the same customer with the changed details.
For this we have two options.
1) Create new entry in the same table by creating a current record status flag
2) Update the modifications in the master table and create the entry for old records in history table with Foreign key relation.
Type-1 Dimension
In this it overwrites the old data with new data. No track for history data.
Type-2 Dimension
The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers.
Type-3 Dimension
The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns designated for storing historical data.
Type-4 Dimension
The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes.
Type-6 Dimension
It is combination of 1,2,3 Dimensions.
In this changes will be done in the same table by inserting new record and maintaining old records. Each record is identified by its own surrogate key. The same shall be inserted in the fact table to know the facts for that surrogate key.
Best article http://en.wikipedia.org/wiki/Slowly_changing_dimension
What is Surrogate Key, how can we use this SCD?
Surrogate key is not for the user. This is to identify each individual record in the database.
Surrogate key cannot be used as storage model. Which means we cannot use the Surrogate key to insert any data based on surrogate key. This only for logical relation.
In slowly changing dimension if we have one master table with current customer details and another one to store the history of Customer details.
To identity each record uniquely in history table we can create a unique key which is called Surrogate Key in the combination of CustomerId as foreign key.
Check the Article: http://en.wikipedia.org/wiki/Surrogate_key
How to trace the slowly changing dimensions in ETL.
Often in these situations the data warehouse receives a complete updated copy of the customer dimension as frequently as once per day. Of course, it would be wonderful if only the "deltas" (the changed records) were delivered to the data warehouse, but more typically the data warehouse has to find the changed records by carefully searching the whole file. This comparison step of each field in each record between yesterday's version and today's version is messy and slow.
Here's a technique that accomplishes this comparison step at blinding speeds and has the added bonus of making your ETL program simpler. The technique relies on a simple CRC code that is computed for each record (not each field) in the incoming customer file. More on CRC's in a moment.
Here are the processing steps:
1. Read each record of today's new customer file and compute that record's CRC code.
2. Compare this record's CRC code with the same record's CRC code from yesterday's run, which you saved. You will need to match on the source system's native key (customer ID) to make sure you are comparing the right records.
3. If the CRC codes are the same, you can be sure that the entire 100 fields of the two records exactly match. YOU DON'T HAVE TO CHECK EACH FIELD.
4. If the CRC codes differ, you can immediately create a new surrogate customer key and place the updated record in the data warehouse customer dimension. This is a Type 2 slowly changing dimension (SCD). Or, a more elaborate version could search the 100 fields one by one in order to decide what to do. Maybe some of the fields trigger an overwrite of the data warehouse dimension record, which is a Type 1 SCD.
CRC stands for Cyclic Redundancy Checksum and it is a mathematical technique for creating a unique
code for every distinguishable input. The CRC code can be implemented in Basic or C
In Data Ware Housing world SCD plays important role.
SCD stands for Slowly Changing Dimension.
It means changes in Dimension table.
Dimension table would contains Customer table. Initially consider this is a master table with only one Entry for each customer.
If there are any changes in the customer details. We need to make new entry for the same customer with the changed details.
For this we have two options.
1) Create new entry in the same table by creating a current record status flag
2) Update the modifications in the master table and create the entry for old records in history table with Foreign key relation.
Type-1 Dimension
In this it overwrites the old data with new data. No track for history data.
Type-2 Dimension
The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers.
Type-3 Dimension
The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns designated for storing historical data.
Type-4 Dimension
The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes.
Type-6 Dimension
It is combination of 1,2,3 Dimensions.
In this changes will be done in the same table by inserting new record and maintaining old records. Each record is identified by its own surrogate key. The same shall be inserted in the fact table to know the facts for that surrogate key.
Best article http://en.wikipedia.org/wiki/Slowly_changing_dimension
What is Surrogate Key, how can we use this SCD?
Surrogate key is not for the user. This is to identify each individual record in the database.
Surrogate key cannot be used as storage model. Which means we cannot use the Surrogate key to insert any data based on surrogate key. This only for logical relation.
In slowly changing dimension if we have one master table with current customer details and another one to store the history of Customer details.
To identity each record uniquely in history table we can create a unique key which is called Surrogate Key in the combination of CustomerId as foreign key.
Check the Article: http://en.wikipedia.org/wiki/Surrogate_key
How to trace the slowly changing dimensions in ETL.
Often in these situations the data warehouse receives a complete updated copy of the customer dimension as frequently as once per day. Of course, it would be wonderful if only the "deltas" (the changed records) were delivered to the data warehouse, but more typically the data warehouse has to find the changed records by carefully searching the whole file. This comparison step of each field in each record between yesterday's version and today's version is messy and slow.
Here's a technique that accomplishes this comparison step at blinding speeds and has the added bonus of making your ETL program simpler. The technique relies on a simple CRC code that is computed for each record (not each field) in the incoming customer file. More on CRC's in a moment.
Here are the processing steps:
1. Read each record of today's new customer file and compute that record's CRC code.
2. Compare this record's CRC code with the same record's CRC code from yesterday's run, which you saved. You will need to match on the source system's native key (customer ID) to make sure you are comparing the right records.
3. If the CRC codes are the same, you can be sure that the entire 100 fields of the two records exactly match. YOU DON'T HAVE TO CHECK EACH FIELD.
4. If the CRC codes differ, you can immediately create a new surrogate customer key and place the updated record in the data warehouse customer dimension. This is a Type 2 slowly changing dimension (SCD). Or, a more elaborate version could search the 100 fields one by one in order to decide what to do. Maybe some of the fields trigger an overwrite of the data warehouse dimension record, which is a Type 1 SCD.
CRC stands for Cyclic Redundancy Checksum and it is a mathematical technique for creating a unique
code for every distinguishable input. The CRC code can be implemented in Basic or C
Comments
Post a Comment