In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
If you ask a company, "are ETL tools important?" I think the answer must be yes; if you ask companies, "do you have to use commercial ETL tools?" The result may not be so unified. Are ETL vendors good enough to survive in a changing data environment? ETL originated in a data warehouse, and although developers have a high learning curve, it provides many benefits, such as distributed processing, maintainability, and UI-based rather than scripting.
Coupling is an old concept for programming, but it is still a relatively new concept when it comes to data processing. It is well known that ETL streams are tightly coupled, but today's data flow pipelines are loosely coupled, and this approach also has disadvantages, such as creating data swamps with dark data.
Standardized transformations can still follow the ETL process, but for new concepts such as data self-service, old processes and practices cannot be used. Standard ETL processes such as data quality, data security, metadata management, and data governance are still relevant to data drivers.
The influence of data Lake
The arrival of big data has had an impact on the overall process of ETL. ETL must transform and begin to support big data ecosystem technology. Here is how ETL is affected by big data:
1. ETL is still related to the DW environment used. At present, DW and the data lake complement each other by extending and improving the architecture, and probably in the future, because all new use cases are built using the data lake.
2. Compared with using ETL tools / engines for processing and using RDBMS as storage to achieve standard conversion, using data lakes to process and store data provides a single platform, easy to use and cheaper.
3. The data Lake extends the analysis only from the standardized ETL, because the data Lake can be fetched for the first time, followed by data preparation, which is for self-service and ad-hoc, which is not available in ETL.
4. The data lake is used for data login / archiving, and even RDBMS cannot be handled as a storage solution. Therefore, you need to rethink how to implement the ETL tool.
5. ETL is not suitable for use in an unstructured environment, but the big data process can store semi-structured and unstructured data, which makes ETL must be converted in these directions.
As the new architecture and technologies emerged by big data are gradually weakening the role of traditional ETL, ETL tools need to support new technologies to be valuable, and need to shift to Hadoop and other open architectures, which also means that the role of traditional ETL vendors is decreasing.
What are the things you need to pay attention to in reshaping ETL:
1. Degree of integration with open source tools
Proprietary technologies for data processing and storage are losing relevance to ETL tools, and ETL vendors should be able to support all open source projects such as Spark, MR, and HDFS.
two。 Cloud centric
ETL tools should support cloud native architectures with on-premises versions, and there are new cloud native ETL tools, such as Snaplogic,Informatica Cloud and Talend Integration Cloud, that provide an integrated platform as a service (iPaaS) that can address many infrastructure challenges, but there are still some ETL functional limitations. Compared with emerging tools, these ETL tools are not self-service, and we should pay more attention to self-service and machine learning in the future. We can make these tools achieve ad-hoc and self-training as much as possible.
3. Prepare for data fusion
ETL is a developer-centered data conversion tool, while converged data preparation is a data conversion tool focusing on self-service. As more and more developers use the data lake for analysis, both ad hoc and standard processes, ETL will become irrelevant because self-service will become more common, merging the two into a single data transformation category tool that can be used for any standard or ad hoc transformation.
4. AI / ML
AI / ML is a facilitator that helps data engineers and developers get their work done easily and quickly through automated processes. Create a bridge between AI algorithms and data workers, and once the proposal is accepted by the developer, AI will begin to learn and adjust the classification and transformation according to the suggestion.
As a result, AI will continue to affect many parts of the data architecture, including self-learning algorithms such as data classification, data modeling, data storage, and so on, and ETL tools need to support AI solutions-some vendors have begun to provide AI capabilities but are far from being used as a standard solution.
5. Self-help design ability
ETL tools should support the creation of new self-service-based designs / processes by enhancing existing tools and providing new tools for such designs, which will help create new self-service-based use cases for the enterprise.
6. Real-time support
Provide real-time support through open source technology, and architecture of existing tools or create new tools for this purpose, and let the tool support all of big data's use cases in real time.
7. Big data quality
There is still no ETL tool that can improve the quality of big data. Few people can describe the big data process clearly, and there is no rule-based engine to support this kind of execution. ETL vendors should focus on this key area so that they can compete with new platform-based tools on Hadoop.
8. Match and merge big data support
In the gray areas of MDM and ETL-support for fetching data from the data lake is required. This is also a key area, which can be easily provided by vendors by using ML technology.
9. Unified metadata catalog support
In big data's era, enterprises needed to access all their data catalogs. Because ETL tools are already repositories of metadata, they can support the requirement that catalogs are automatically populated, data is automatically classified / tagged, and search capabilities and group / expert ratings are enabled.
10. Reusability-centered data Lake Design
The requirement that ETL tools should be designed to support reusable components has been around for a long time and it's time to pay attention to it.
Conclusion
Due to the arrival of big data era, enterprises pay more attention to the mastery of data and hope to get better insights at a lower cost. ETL tools need to be modified according to new requirements, and suppliers may gradually fade out of the ETL world, but ETL can still be provided as a basic tool for data conversion activities. In foreign countries, ETL providers such as Talend and Informatica have recognized these challenges and created new products specifically for big data and cloud computing.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.