Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the principles and best practices of Schedulerx2.0 distributed computing

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article shows you what Schedulerx2.0 distributed computing principles and best practices are like, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

1. Preface

The client of Schedulerx2.0 provides distributed execution, multiple task types, unified log and other frameworks. Users only need to rely on the jar package schedulerx-worker, through the programming model provided by schedulerx2.0, a set of highly reliable distributed execution engine can be realized with a few lines of code.

The focus of this article is to introduce the principles and best practices of schedulerx2.0-based distributed execution engine. I believe that after reading this article, everyone will be able to write efficient distributed jobs, perhaps several times faster:)

two。 Scalable execution engine

The overall architecture of Worker refers to the architecture of Yarn and is divided into three layers: TaskMaster, Container and Processor:

TaskMaster: similar to yarn's AppMaster, it supports a scalable distributed execution framework for jobInstance lifecycle management, container resource management, and failover capabilities. StandaloneTaskMaster (stand-alone execution), BroadcastTaskMaster (broadcast execution), MapTaskMaster (parallel computing, memory grid, grid computing) and MapReduceTaskMaster (parallel computing, memory grid, grid computing) are implemented by default.

Container: a container framework for executing business logic, supporting threads / processes / docker/actor, etc.

Processor: business logic framework, where different processor represents different task types.

Take MapTaskMaster as an example, the general principle is shown in the following figure:

3. Map Model of distributed programming Model

Schedulerx2.0 provides a variety of distributed programming models, this article mainly introduces the Map model (the later article will also introduce the MapReduce model, suitable for more business scenarios), a few lines of code can be distributed to multiple machines for distributed batch running, it is very easy to use.

For different batch scenarios, map model jobs also provide three execution modes: parallel computing, memory grid and grid computing.

Parallel computing: less than 300 subtasks, with a list of subtasks.

Memory grid: sub-task less than 5W, no sub-task list, fast.

Grid computing: less than 100W subtasks, no subtask list.

4. Principle of parallel computing

Because parallel tasks have a list of subtasks:

As shown in the figure above, the subtask list shows the status and machine of each subtask, as well as reruns, log viewing, and other operations.

Because parallel computing needs to achieve subtask-level visualization, and worker hangs and restarts can also support manual reruns, task needs to be persisted to the server:

As shown in the above figure:

Server triggers jobInstance to a worker, and select master.

MapTaskMaster selects a worker to execute the root task, and when the map method is executed, the MapTaskMaster is called back.

MapTaskMaster receives the map method and persists the task to the server.

At the same time, MapTaskMaster also has a pull thread that constantly pulls the task in the INIT state and dispatches it to other worker for execution.

5. Principle of grid computing

Grid computing needs to support millions of task. If all tasks are written back to server, server will not be able to handle it, so the storage of grid computing is actually distributed on the user's own machine:

As shown in the above figure:

Server triggers jobInstance to a worker, and select master.

MapTaskMaster selects a worker to execute the root task, and when the map method is executed, the MapTaskMaster is called back.

MapTaskMaster receives the map method and persists the task to the local H3 database.

At the same time, MapTaskMaster also has a pull thread that constantly pulls the task in the INIT state and dispatches it to other worker for execution.

6. Best practices 6.1 requirements

For example:

Read the data of status=0 in table A.

Process the data and insert the B table.

Modify the status=1 of the processed data in Table A.

The amount of data is 400 million +, hoping to shorten the time.

6.2 negative cases

Let's first see if there is something wrong with the following code.

Public class ScanSingleTableProcessor extends MapJobProcessor {private static int pageSize = 1000; @ Override public ProcessResult process (JobContext context) {String taskName = context.getTaskName (); Object task = context.getTask (); if (WorkerConstants.MAP_TASK_ROOT_NAME.equals (taskName)) {int recordCount = queryRecordCount (); int pageAmount = recordCount / pageSize;// calculate the number of pages for (int I = 0; I

< pageAmount ; i ++) { List recordList = queryRecord(i);//根据分页查询一页数据 map(recordList, "record记录");//把子任务分发出去并行处理 } return new ProcessResult(true);//true表示执行成功,false表示失败 } else if ("record记录".equals(taskName)) { //TODO return new ProcessResult(true); } return new ProcessResult(false); }} 如上面的代码所示,在root任务中,会把数据库所有记录读取出来,每一行就是一个Record,然后分发出去,分布式到不同的worker上去执行。逻辑是没有问题的,但是实际上性能非常的差。结合网格计算原理,我们把上面的代码绘制成下面这幅图: 如上图所示,root任务一开始会全量的读取A表的数据,然后会全量的存到h3中,pull线程还会全量的从h3读取一次所有的task,还会分发给所有客户端。所以实际上对A表中的数据: 全量读2次 全量写一次 全量传输一次 这个效率是非常低的。 6.3 正面案例 下面给出正面案例的代码: public class ScanSingleTableJobProcessor extends MapJobProcessor { private static final int pageSize = 100; static class PageTask { private int startId; private int endId; public PageTask(int startId, int endId) { this.startId = startId; this.endId = endId; } public int getStartId() { return startId; } public int getEndId() { return endId; } } @Override public ProcessResult process(JobContext context) { String taskName = context.getTaskName(); Object task = context.getTask(); if (taskName.equals(WorkerConstants.MAP_TASK_ROOT_NAME)) { System.out.println("start root task"); Pair idPair = queryMinAndMaxId(); int minId = idPair.getFirst(); int maxId = idPair.getSecond(); List taskList = Lists.newArrayList(); int step = (int) ((maxId - minId) / pageSize); //计算分页数量 for (int i = minId; i < maxId; i+=step) { taskList.add(new PageTask(i, (i+step >

MaxId? MaxId: i+step));} return map (taskList, "Level1Dispatch");} else if (taskName.equals ("Level1Dispatch")) {PageTask record = (PageTask) task; long startId = record.getStartId (); long endId = record.getEndId (); / / TODO return new ProcessResult (true);} return new ProcessResult (true) } @ Override public void postProcess (JobContext context) {/ / TODO System.out.println ("all tasks is finished.");} private Pair queryMinAndMaxId () {/ / TODO select min (id), max (id) from xxx return null;}}

As shown in the code above

Each task is not the record of the entire row of records, but the PageTask, with only two fields, startId and endId.

The root task, instead of reading the full A table, reads the minId and maxId of the entire table, and then constructs a PageTask for paging. For example, task1 stands for PageTask [1Pol 1000] and task2 for PageTask [1001Jol 2000]. Each task handles different data from Table A.

In the next level of task, if you get the PageTask, go to table A to process the data according to the id interval.

According to the above code and the principle of grid computing, the following picture is obtained:

As shown in the above figure

Table An only needs to be read once in full.

The number of subtasks is thousands or tens of thousands times less than the negative cases.

The body of subtasks is very small, and if there are large fields in the recod, it will be thousands or tens of thousands of times less.

To sum up, the number of visits to the A table has been reduced several times, and the storage pressure on the H3 has been reduced by tens of thousands of times. not only can the execution speed be much faster, but also guarantee that you will not hang up your local H3 database.

The above is what the principles and best practices of Schedulerx2.0 distributed computing are. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report