In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "how to understand Vineyard's entry into CNCF Sandbox". In the operation of practical cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Project introduction
In the existing big data analysis scenario, for end-to-end tasks, distributed file systems or object storage systems such as HDFS, S3 and OSS are usually used among different subtasks to share intermediate data between tasks. This method has many problems in terms of operation efficiency and R & D efficiency. The workflow of a risk control job shown in the following figure is an example:
In order to share intermediate data between different tasks in the workflow, the former task writes the result to the file system, and after completion, the latter reads the file as input, which brings additional serialization and deserialization, memory copy, and network and IO overhead. We have observed that more than 60% of the historical tasks took more than 40% of the execution time.
For the production environment, a new system (such as distributed graph computing) is often introduced to solve the problem of a particular paradigm efficiently, but such a system is often difficult to connect seamlessly with other systems in the workflow and requires a lot of repetitive IO, data format conversion and adaptive research and development work.
Sharing data using an external file system brings additional interruptions to the workflow, because often only when one task has completely written all the results can the next task start reading and computing, which makes cross-task pipelined parallelism unable to be applied.
When sharing intermediate data, especially in the cloud native environment, the existing distributed file system does not well deal with the location of distributed data, resulting in a waste of network overhead, thus reducing the efficiency of end-to-end execution.
In order to solve the above problems in the existing big data analysis workflow, we design and implement a distributed in-memory data sharing engine Vineyard.
Vineyard addresses these issues from the following three perspectives:
In order to make data sharing between tasks in end-to-end workflow more efficient, Vineyard supports zero-copy data sharing between systems through memory mapping, saving additional IO overhead.
To simplify the adaptation and development required for new computing engines to connect to existing systems, Vineyard provides out-of-the-box abstractions for common data types, such as Tensor, DataFrame, Graph, and so on, so that sharing intermediate results between different computing engines no longer requires additional serialization and deserialization. At the same time, Vineyard implements reusable components such as IO, data migration and snapshot in the form of plug-ins, so that it can flexibly register with the computing engine on demand and reduce the development cost independent of the computing engine itself.
Vineyard provides a series of operators for more efficient and flexible data sharing. For example, Pipeline operator implements pipelined parallelism across tasks, so that subsequent tasks can be calculated with the output of pre-ordered tasks, and the overall end-to-end efficiency is improved.
Vineyard integrates with Kubernetes. Through Scheduler Plugin, task scheduling can perceive the locality of the data needed. In Kubernetes, the Pod of a single task can be dispatched to the machine corresponding to the input data needed by Pod as much as possible, so as to reduce the network overhead required for data migration and improve end-to-end performance.
In the preliminary comparative experiment, compared with using HDFS to share intermediate data, Vineyard can greatly reduce the extra overhead introduced by exchanging intermediate results for evaluation tasks, and increase the end-to-end time of the whole workflow by 1.34 times.
Core function
Next, it introduces the core functions of Vineyard from two aspects: the design and implementation of the core of Vineyard, and how Vineyard contributes to big data's analysis task in the native cloud environment.
1. Distributed memory data sharing
Vineyard represents the data in memory as Object. Object can be Local or Global. Take distributed execution engines Mars and Dask as an example, a DataFrame is often divided into many Chunk to take advantage of the computing power of multiple machines. There are multiple Chunk on each machine. These Chunk are LocalObject in Vineyard. Together, these Chunk form a global view, namely GlobalDataFrame. This GlobalDataFrame can be shared directly with other computing engines, such as GraphScope, as input to graph data. With the abstraction of these data types, different computing engines on Vineyard can seamlessly share intermediate results, using the output of one task directly as the output of the next. More specifically, what if a specific type of Object is expressed in Vineyard so that it can be easily adapted to different computing engines? This is due to the flexibility that Vineyard provides on the presentation of Object. In Vineyard, an Object consists of two parts, Metadata, and a set of Blob. Actual data is stored in Blob, and Metadata is used to interpret the semantics of these Blob. For example, for Tensor,Blob, it is a piece of continuous memory that stores all the elements in Tensor, while Metadata records the type and shape of Tensor, and attributes such as row main order or column main order. In Python, the Object can be interpreted as a NDArray of Numpy, while in C++, the Object can be interpreted as a tensor in xtensor. Sharing this Tensor in SDK in these two different programming languages does not incur the additional overhead of IO, copying, serialization / deserialization, and type conversion. At the same time, the Metadata in Vineyard can be nested, which makes it easy to describe any complex data type as Object in Vineyard without limiting the expressive power of the computing engine. Take GlobalDataFrame as an example, see the structure of Metadata in the following figure.
two。 Collaborative scheduling of data and tasks in Cloud Native Environment
For a real deployment of big data analysis pipeline, only data sharing between tasks is far from enough. In a cloud environment, multiple subtasks contained in an end-to-end pipeline only consider the required resource constraints when they are scheduled by Kubernetes, and the co-locate of two consecutive tasks cannot guarantee that there is still network overhead caused by data migration when sharing intermediate results between two tasks. As shown in the following figure, when running Task B, the Pod of the two tasks is not aligned. Data shards A3 and A4 need to be migrated to the Vineyard instance where Pod resides.
In this regard, Vineyard represents the Vineyard Objects in the cluster as observable resources through CRD, and designs and implements a scheduler plug-in considering data locality based on Kubernetes's Scheduler Framework. After the completion of a current task Task A, the scheduler plug-in can know the location of all shards from the Metadata of the result object. When starting the next task, the scheduler gives a higher priority to the node where the data resides (Node 1 and Node 2 in the figure), so that task Task B is scheduled to the corresponding node as much as possible, thus saving the extra overhead introduced by data migration and improving end-to-end performance.
Get started quickly
Vineyard integrates Helm to facilitate user installation and deployment:
Helm repo add vineyard https://vineyard.oss-ap-southeast-1.aliyuncs.com/charts/helm install vineyard vineyard/vineyard
After installation, a Vineyard DaemonSet is deployed in the system and an UNIX domain socket is exposed for shared memory and IPC communication with the application's task Pod.
This is the end of the content of "how to understand Vineyard joining CNCF Sandbox". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.