Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the development process of ETL?

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

What is the development process of ETL, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

What is ETL? ETL is the acronym for Extract (extract), Transformation (transform), and Load (load). In short, ETL is copying data between two locations.

Extract (extraction): reads data from different types of data sources, including databases.

Transform (conversion): converts extracted data into a specific format. The transformation also includes using other data in the system to enrich the data content.

Load (load): writes data to the target database, data warehouse, or another system.

Depending on the infrastructure, ETL can be divided into two broad categories.

Traditional ETL

Previously, data was usually stored in operating systems, files, and data warehouses. The data moves between these locations several times a day. ETL tools and scripts are ready-to-use.

Workflow of traditional ETL

This architecture is very difficult to manage and very complex. Here are some shortcomings of the traditional ETL architecture:

Processing between databases, files, and data warehouses is carried out in batches.

Currently, most companies need to analyze and manipulate real-time data. However, traditional tools are not suitable for analyzing logs, sensor data, measurement data and so on.

A very large domain data model requires a global structure.

Traditional ETL processing is slow, time-consuming, and requires a lot of resources.

Traditional architectures only focus on existing technologies. As a result, applications and tools are rewritten every time a new technology is introduced.

As time went by, big data changed the order of processing. The data is extracted and loaded into a warehouse and saved in the original format. Conversion is performed whenever data analysts or other systems need data. This process is called ELT. However, this process is most suitable for processing in a data warehouse. Systems such as Oracle Data Integration Platform Cloud provide this function.

The present situation of ETL

Compared with a decade ago, the state of data and processing in the world today has changed dramatically. It is not enough to use traditional ETL process to deal with modern data. Some of the reasons are as follows:

Modern data processing usually includes real-time data processing, and organizations also need real-time insight into the processing process.

The system needs to perform ETL on the data stream, cannot use batch processing, and should be able to automatically scale to handle higher data traffic.

Some single-server databases have been replaced by distributed data platforms (such as Cassandra, MongoDB, Elasticsearch, SAAS applications, etc.), messaging mechanisms (Kafka, ActiveMQ, etc.), and several other types of endpoints.

The system should be able to add additional data sources or destinations in a manageable manner.

Duplicate data processing due to a "write-and-use" architecture should be avoided.

Change the way of data capture technology, from requiring the integration of traditional ETL to supporting traditional operations.

Data sources are diverse and maintainability of new requirements needs to be taken into account.

The source and target endpoints should be uncoupled from the business logic. The data mapping layer is used to seamlessly connect new sources and endpoints without affecting the data conversion process.

Data mapping layer

The data received should be standardized before conversion (or enforcement of business rules).

The data should be converted to a specific format after conversion and before publishing to the endpoint.

Data cleaning is not the only data conversion process in the modern world. Data transformation also needs to meet many of the business needs of the organization.

Current data processing usually includes filtering, join, aggregation, sequence, pattern, and enrichment to perform complex business logic.

Data processing process

Streaming ETL that saves the world

New data requirements are the driving force that drives the organization forward. The vast majority of traditional systems in many organizations still work, using databases and file systems. These organizations are also experimenting with new systems and technologies. These technologies can handle big data and growth and faster data rates (such as tens of thousands of records per second), such as Kafka, ActiveMQ and so on. With the streaming ETL inheritance architecture, organizations do not need to plan, design, and implement a complex architecture to fill the gap between traditional and modern systems. Streaming ETL architects are scalable, manageable, and can handle large-capacity, structurally diverse real-time data. By decoupling data extraction and loading from data transformation, a source-destination model is formed, which can make the system compatible with future new technologies. This functionality can be achieved through many systems, such as Apache Kafka (with KSQL), Talend, Hazelcast, Striim, and WS02 Streaming Integrator (with Siddhi IO).

Modern ETL function

As mentioned above, traditional systems typically put all data into databases and file systems for batch processing. This scenario illustrates why traditional event sources (such as files, Change Data Capture (CDC)) are integrated with the new streaming integration platform. Let's consider a practical application scenario in a factory, which has the following functions. Traditional systems:

Put all production data into file systems and databases in different formats.

The data is processed hourly or daily.

Handle events from CDC.

Handles event-centric data received by the new system through HTTP.

Send processed events to multiple destinations.

Monitor current inventory and send notifications when new inventory is needed.

Use the inventory quantity to view the analysis results.

Traditional ETL tools:

The ETL logic for the following processing is duplicated:

For each file and database with a different structure.

When the number of destination or source endpoints increases.

Repetitive business logic is difficult to manage and scale.

The data calculations required for analysis and monitoring are duplicated.

How the streaming platform architecture solves the modern ETL problem:

The Workflow of Modern streaming platform

The source (for example, file, CDC, HTTP) and the destination endpoint (such as Kafka, Elasticsearch, Email) are decoupled from the process:

The destination, source, and storage API are connected to multiple data sources.

Even if the data structures in the source and destination are different, the data mapping (such as data mapper) layer and streaming SQL (such as Query1) convert events received from multiple sources into a common source definition (such as Stream1) for later processing.

Streaming platform architecture can connect traditional types of data sources (such as files and CDC) to widely used modern data sources (such as HTTP).

Events generated by both traditional and modern systems are received and analyzed using the same workflow.

Aggregations, such as Aggregation1, are calculated on a per-minute, per-hour basis for the desired attributes.

The data is summarized on demand at any time, and there is no need to process and summarize the entire data set. Applications and visualization and monitoring tools can access the aggregated data through the API provided.

One or more business logic (such as BusinessRule1) can be added and changed seamlessly.

You can add any logic without changing existing components. As in the example above, according to BusinessRule1, an Email message is triggered when the degree of urgency increases.

Through the above architecture, we can see the integration of streaming platforms with traditional systems such as files, CDC and modern systems using Kafka and HTTP for ETL data processing.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report