EXPERIMENT-1
AIM: Study of ETL process and its tools.
What is ETL?
ETL is an abbreviation of Extract, Transform and Load. In this process, an ETL tool extracts
the data from different RDBMS source systems then transf
...
EXPERIMENT-1
AIM: Study of ETL process and its tools.
What is ETL?
ETL is an abbreviation of Extract, Transform and Load. In this process, an ETL tool extracts
the data from different RDBMS source systems then transforms the data like applying
calculations, concatenations, etc. and then load the data into the Data Warehouse system.
It's tempting to think a creating a Data warehouse is simply extracting data from multiple
sources and loading into database of a Data warehouse. This is far from the truth and requires
a complex ETL process. The ETL process requires active inputs from various stakeholders
including developers, analysts, testers, top executives and is technically challenging.
In order to maintain its value as a tool for decision-makers, Data warehouse system needs to
change with business changes. ETL is a recurring activity (daily, weekly, monthly) of a Data
warehouse system and needs to be agile, automated, and well documented.
Why do you need ETL?
There are many reasons for adopting ETL in the organization:
● It helps companies to analyze their business data for taking critical business decisions.
● ETL provides a method of moving the data from various sources into a data
warehouse.
● As data sources change, the Data Warehouse will automatically update.
● Allow verification of data transformation, aggregation and calculations rules.
● ETL process allows sample data comparison between the source and the target
system.
ETL Process in Data WarehousesStep 1) Extraction
In this step, data is extracted from the source system into the staging area. Transformations if
any are done in staging area so that performance of source system in not degraded. Also, if
corrupted data is copied directly from the source into Data warehouse database, rollback will
be a challenge. Staging area gives an opportunity to validate extracted data before it moves
into the Data warehouse.
Hence one needs a logical data map before data is extracted and loaded physically. This data
map describes the relationship between sources and target data.
Three Data Extraction methods:
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and response time
of the source systems. These source systems are live production databases. Any slow down or
locking could effect company's bottom line.
Some validations are done during Extraction:
● Reconcile records with the source data
● Make sure that no spam/unwanted data loaded
● Data type check
● Remove all types of duplicate/fragmented data
● Check whether all the keys are in place or not
Step 2) Transformation
Data extracted from source server is raw and not usable in its original form. Therefore it
needs to be cleansed, mapped and transformed. In fact, this is the key step where ETL
process adds value and changes data such that insightful BI reports can be generated
[Show More]