extract, transform, load (ETL)
Extract, transform, load (ETL) is a process in data warehousing that combines data from multiple sources into a single data set that has been pre-processed with consistent business rules. The ETL process allows data from many sources to be integrated into a single data model that has uniformity across the sources.
The ETL process involves three main steps:
- Extract: Data is collected from various source systems, which can be structured or unstructured. This data is then copied or exported to a staging area.
- Transform: In the staging area, the raw data is processed and transformed. This may include cleaning, renaming, deduplicating, and applying business rules to make the data suitable for analytical purposes. The transformation step is what ensures the data meets the necessary quality standards and is in the correct format for querying and analysis.
- Load: Finally, the transformed data is loaded into a target system, such as a data warehouse, data lake, or another type of data repository.
ETL is essential for data analytics and machine learning, as it provides a foundation for organizing data in a way that addresses specific business intelligence needs. ETL processes are typically automated and can be scheduled to run during off-peak hours to minimize the impact on source systems, though today many vendors are offering capabilities for real-time data integration.
ETL is often compared to ELT (extract, load, transform), which is a similar process but with a different order of operations. In ELT, raw data is loaded directly into the target data store before being transformed, which can be more suitable for handling large volumes of unstructured data.
Citations:
https://www.ibm.com/topics/etl
https://aws.amazon.com/what-is/etl/
https://www.matillion.com/blog/what-is-etl-the-ultimate-guide
https://en.wikipedia.org/wiki/Extract,_transform,_load