Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to analyze the selection of data Lake Storage Architecture

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article introduces how to analyze the data lake storage architecture selection, the content is very detailed, interested friends can refer to, hope to be helpful to you.

1. Data Lake is a trend

To put it simply, the concept of data lake means that from the perspective of an enterprise, the whole data set is stored together uniformly, mainly through the means of BI and AI to calculate and analyze the original data. The type of data is not only structured and semi-structured, but also audio and video, such as some materials. Why should we do such a transformation based on the data lake? what benefits can the data lake bring to us? First, break the data island. That is to say, we do not consider how to deal with the original data, analyze it, or even consider whether it will solve a big business problem or not. let's first put it together to break the isolated island of data. it may provide a good opportunity for the later business development, evolution and calculation. Second, based on a unified, centralized collection of the entire data, a variety of calculations can be supported. Third, flexibility. Our data lake itself is flexible, and then the supported computing is also flexible. Elasticity may bring a lot of scalable space for costs on the cloud, making it possible for us to optimize storage and computing costs. Fourth, management. By putting the data together, we can provide such a unified and centralized management control. If you are familiar with the whole ecology of Hadoop, we often talk about a very large and complex ecological picture in the past. That diagram involves a lot of components, and the structural relationship is very complex. The architecture based on the data lake can be greatly simplified. As shown in the following figure, at the bottom is the data lake itself. Based on such a data lake storage, we can have a unified metadata service for the creation and management of the data lake, and then manage and develop the data around the data lake, and integrate with various data sources. But this is not the goal, the most important function is that we have to do calculations. The calculation of the data lake simply means that we have a variety of open source BI engines, or AI engines, each of which may have its own cluster, and then process the corresponding computing scenarios based on the data lake. Then meet our top applications based on the data lake, such as big data screen, data report, data mining, machine learning. Second, Lake storage / acceleration: the challenge is very great in the big data Lake architecture, the challenge for storage is great. First, the biggest factor is the amount of data. According to the concept of the data lake, we want to put all the data together, then the scale of the data is very large, the data scale can be expanded to the PB, EB level. Second, the size of the document. From the point of view of the storage system, the size of the file can be said to be very large, either very deep or very flat. Flattening means that there may be millions of files in a directory, forming such a large directory. Third, cost. I want to collect so much data, I want to put all the original data together, how to optimize the cost. Another challenge is that, according to the architecture of the data lake, the essence behind it is the separation of storage and computing. Now it is a specialized division of labor, storage to do storage, computing to do computing, which brings a very great improvement in R & D efficiency. However, after the separation, how to meet the computing throughput, how to meet the performance requirements of computing, this is also a reason for great challenges. In addition, under the whole scheme of the data lake, it is necessary to take into account that the computing scenarios are very rich and the computing environment is complex. Big data, we need to support analysis, interactive, real-time computing. Then AI has its own various engines to train.

Then there are computing scenarios, including EMR, ECS self-built, cloud native, and hybrid clouds. Some of these environments may involve how we provide a unified, centralized storage solution to meet such a rich computing scenario and environment. Suppose we can overcome the challenges of the amount of data, meet a variety of computing environments, can also provide cache acceleration, but also meet such a performance of storage. Now that the architect has decided that we are going to do data migration, what are the implementation challenges. We have to do a lot of data migration, and then we have to do a correct comparison. In addition, for example, Hive warehouse, Spark jobs, maybe tens of thousands of jobs we decided to migrate, and after the migration, we need to compare the results. After the migration, maybe I used to have a mature governance, operation and maintenance system, under the new architecture, how can I change as little as possible and continue to be supported. This is an implementation challenge. 3. Under the checklist data Lake architecture of perfect options, from the perspective of storage and acceleration, we can see that there are some challenges. Here is a summary of what the ideal selection looks like and what factors to consider. First, object-based storage, large-scale storage capacity. Second, the ability to operate large directory metadata. Third, the cache acceleration ability of flexible strategy. Fourth, and the ability of computing to get through optimization. Fifth, support the ability to store new forms in the data lake. Sixth, the ability to archive / compress / secure storage. Seventh, comprehensive big data + AI ecological support. Eighth, strong migration capability, even seamless migration capability. The above is as an ideal data lake storage, acceleration scheme, the best with a checklist. Some architects considering upgrading to the data lake architecture can compare this checklist to do the selection of the solution. On how to analyze the data lake storage architecture selection is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report