In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article analyzes why big data needs a data lake. The content is detailed and easy to understand. Friends who are interested in "Why big data needs data Lake" can follow the editor's train of thought to read it slowly and deeply. I hope it will be helpful to everyone after reading. Let's follow the editor to learn more about why big data needs a data lake.
Since the concept of "data lake" was put forward in 2011, the industry has had a wide range of different understandings and definitions of the data lake.
"the data lake is a platform for centralized storage of massive, multi-source, multi-type data, and rapid processing and analysis of the data. it is essentially an advanced enterprise data architecture." This is a clear and complete definition of the data lake. However, we can not see the importance of the data lake to the enterprise from the definition. The following illustrates the value of the data lake to the enterprise from the point of view of the development of the data lake architecture, the importance of the data platform to the enterprise, and Huawei's data lake scheme.
I. the development of the data lake architecture
The data lake architecture is constantly changing and developing. In many scenarios, it is easy to confuse the data lake with the data warehouse. At first, the data lake solution is really designed to solve the bulky, high cost and long analysis cycle of the data warehouse, but there are obvious differences between the two. At the same time, with the development of cloud computing, big data and artificial intelligence technology, the data lake is constantly integrated with it. The architecture of the data lake is also improving.
The difference between data Lake and data Warehouse
There are many similarities and differences between the data lake and the data warehouse, which can be easily confused, but there are two most important differences:
Storage data type: the data warehouse stores data, models, and stores structured data; the data lake stores a large amount of original data in its native format, including structured, semi-structured and unstructured data. The data structure and requirements are not defined before the data is needed.
Data processing pattern: before we can load the data into the data warehouse, we first need to define it, which is called write-time pattern (Schema-On-Write). For the data lake, you simply load the raw data and then, when you are ready to use the data, give it a definition, which is called the read-time pattern (Schema-On-Read). These are two completely different methods of data processing. Because the data lake defines the model structure when the data is used, it improves the flexibility of the data model definition and can meet the efficient analysis requirements of more different upper-level businesses.
Integration and Development of data Lake and New Technology
1. The integration of data Lake and big data technology.
Hadoop technology has experienced more than ten years of development, and data Lake, as the most important data platform of the second data plane, is more and more closely integrated with Hadoop technology, complement each other and complement each other. For example, HBase enables the data lake to store large amounts of data; Spark enables the data lake to analyze massive data in batches faster; and Storm,Flink,NiFi enables the data lake to access and process IOT data in real time. Hadoop itself focuses more on data processing and application, but does not pay too much attention to the underlying data storage work. For example, the traditional Hadoop uses three-copy technology to save data, but the data utilization rate is only 33%, and the cost of data preservation is high. At the same time, customers have higher and higher requirements for the reliability of data hosted by Hadoop, and the demand for data protection (backup, disaster recovery, etc.) is becoming more and more obvious. Hadoop3.x has started the trend of separation of storage and computing, but these can not fully meet the needs of users. Data lake needs to continue to develop from data storage, data governance and other aspects.
2. the integration of data lake and cloud computing technology.
Cloud computing uses virtualization, multi-tenancy and other technologies to maximize the use of server, network, storage and other basic resources, reduces the cost of IT infrastructure, and brings huge economy for enterprises; at the same time, cloud computing technology realizes the rapid application and use of host, storage and other resources, which also brings more management convenience for enterprises. Under the traditional construction model, big data adopts the physical machine deployment model. In response to the flexible computing resource requirements of multi-business types and the great differences in computing performance and storage capacity, the deployment model of the integration of computing and storage is not flexible enough, nor can it provide the best performance-to-price ratio. At this time, we use cloud technology to deploy big data computing on the cloud, separate storage resources from computing resources, and achieve independent expansion and elastic scaling of computing and data. The current data lake architecture has been perfectly implemented and applied on the public cloud. For example, Microsoft Azure launched Data Lake cloud services in 2016. Amazon AWS can quickly build a set of data lake services based on S3, Glue and other basic cloud services. The management and search system of massive data sets within Google also points the way for data lake data management (see "Managing Google's data lake: an overview of the GOODS system" for details). A paper on the search and management of massive datasets within Google.
3. The integration of data lake and artificial intelligence technology.
In recent years, with the rapid development of artificial intelligence technology, training and reasoning need to deal with large or even multiple data sets at the same time, which are usually unstructured data such as video, pictures, text and so on. it comes from many industries, organizations and projects. the collection, storage, cleaning, conversion and feature extraction of these data is a series of complex and long projects. The data lake needs to provide artificial intelligence programs with a platform for rapid data collection, management and analysis, as well as the ability of high bandwidth, massive small file access, multi-protocol interworking and data sharing, which can greatly accelerate the process of data mining and deep learning.
Second, the importance of the data lake to the enterprise
Many people say: "data Lake is a new bottle of old wine", it is just a patchwork of concepts, in essence, there is no technological innovation. In fact, the term "data lake" is not important, what is important is whether it can really help enterprises achieve technological transformation and cope with the emerging new problems in the rapidly developing business environment.
The core value of the data lake is to bring the data platform operation mechanism to the enterprise. At present, many enterprises have not realized the benefits of data platform for enterprises. Today's business environment, driven by the ever-changing technological changes, is undergoing drastic changes. The traditional industry is constantly subverted by Internet companies, causing great pressure on the survival of many companies. The reason why Internet companies can constantly subvert traditional industries is not only the changes in their business models, but also because many of these companies adopt platform strategies to integrate the latest technology and competitiveness into the platform to empower the company's operations. make the company's business develop by leaps and bounds and squeeze the development space of other enterprises across the border. Traditional enterprises are in urgent need of change, and they need to, like Internet companies, use the sharp tools of information, digitization and new technology to form a platform system to empower the company's personnel and business to respond to challenges quickly.
Third, Huawei data Lake solution
Huawei data Lake solution keeps abreast of the pulse of the times, helping enterprises to make use of the data lake, a sharp tool of data platform, to promote the rapid development of business. Based on the advanced cloud-based system architecture, Huawei data Lake solution focuses on solving complex problems such as data unable to drive business development, high costs, waste of infrastructure resources, and so on, in the digital transformation of offline enterprises.
Basic Architecture of Huawei data Lake solution
The following article introduces Huawei data Lake solution in detail from four dimensions: centralized data storage and sharing, data governance, Cache on the computing side, and rapid data analysis.
Centralized data storage and sharing
Many enterprises usually ignore the value of data accumulation, data need to be continuously collected and stored from all aspects of the enterprise, so that it is possible to mine value information based on these data, guide business decision-making and drive the development of the company. Huawei data Lake solution to achieve centralized data storage and sharing is based on Huawei big data solution FunsionInsight and Huawei massive object storage architecture to achieve trillion-level reliable data storage and efficient analysis.
The use of a set of data storage resource pool can effectively solve the data chimney problem in the enterprise, provide a unified namespace, multi-protocol interaccess, achieve efficient sharing of data resources, and reduce data movement. For example, many automobile manufacturing enterprises are carrying out self-driving / self-driving research. The files generated by sensors, radars and other IOT devices on vehicles can be analyzed and processed by Hadoop (HDFS) through offline batch import or high-speed access to the network, and then into the HPC cluster (NFS) for simulation calculation, and can also be read to the GPU cluster for training (S3). In the whole process, data does not need to be copied and moved, and efficient data sharing is realized.
Data centralized storage and sharing is actually pooling storage resources and separating computing from data. At present, there are still many people can not accept big data's computing and data separation architecture, thinking that once the separation architecture is adopted, it will inevitably lead to a decline in performance. But in fact, the separation can greatly reduce the storage cost, effectively improve the utilization of computing resources, and enhance the flexibility of computing and storage clusters. However, it is not necessary to separate in all cases. According to our experience in multiple projects in many industries, such as government, operators, finance, enterprises, etc., the following situations are suitable for separation:
1. With the growth of the amount of data, the utilization rate of storage and computing resources is seriously unbalanced, for example: user retention analysis in user behavior analysis, the amount of stored data is increasing, but the computing resources are basically unchanged.
two。 The business department applies for computing or storage resources separately from the platform department, and the separate architecture can allocate resources more flexibly.
In addition, the appropriate stage can also be found from the dimension of the data life cycle, and the green part of the data cleaning, processing integration and archiving backup scenarios are suitable for the separation of storage and computing.
Note: the separation of storage and computing is often accompanied by the service of big data, which requires the management of resources from the perspective of cloud and flexible resource scheduling.
Data governance
The data should not only be saved, but also be managed well, otherwise the data lake will become a data swamp and waste a lot of IT resources. Whether the platform-based data lake architecture can drive enterprise business development, data governance is very important. The data collected by enterprises or collected from other industries are of various types and formats, and most of them are stored in the original format. Enterprises need to constantly integrate and process these original data. According to various business organizations, scenarios, needs to form clean data that is easy to analyze, let more people access the analytical data as much as possible. Data governance is a series of complex tasks, and here we focus on the management of metadata.
Huawei data Lake solution provides a set of centralized metadata management system for the massive data sets in the enterprise, providing global data resource catalogue, complete data metadata description and data consanguine. it is convenient for employees to quickly find and understand data, better support data analysis, metadata management asynchronously extract metadata from data services, and try not to affect the operation of the original system.
Calculation side Cache
The separation of computation and data will inevitably lead to a certain amount of network Imax O overhead, and the Cache on the computing side can effectively reduce the frequent network Imax O times. At the same time, the 10-gigabit network has become popular, or even higher, and the impact of the network on computing has been very limited. The computing side Cache uses a variety of algorithms to cache the data on the computing side, which can make the performance of the computing and data separation scheme in many scenarios even higher than that of the integrated method.
Rapid data analysis
Much of the work mentioned above is actually designed to speed up the process of data analysis. Rapid data analysis needs to provide a variety of data analysis engines. Based on Huawei's FusionInsight big data scheme, it provides a variety of analysis methods, such as Spark, HBase, ES, LibrA (a distributed data relational database based on SQL). Rapid analysis can not only clean the data stored in LibrA after integration, but also directly access the data in massive object storage without data extraction and reduce data conversion. Support high concurrent reading to improve the efficiency of real-time analysis. At the same time, it can support self-service data exploratory analysis.
Huawei data Lake solution provides complete data architecture support to build an one-stop data processing experience for enterprises, which has been used in many industries and customers. For example, Huawei data Lake solution supports the "one cloud, one lake, one platform" system architecture of Ping an City, and builds a physically dispersed (data scattered in various prefectures, cities, districts and counties) and logically unified data governance architecture for public security customers.
About why big data needs the data lake to share here, I hope the above content can improve everyone. If you want to learn more knowledge, please pay more attention to the editor's updates. Thank you for following the website!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.