How to build High performance Model in Multi-system and Network Topology in TensorFlow 07/01 Update SLTechnology News&Howtos

How to build High performance Model in Multi-system and Network Topology in TensorFlow

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article shows you how to build a high-performance model in TensorFlow in multi-system and network topology. the content is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

Input pipe

The performance guide describes how to diagnose possible problems with input pipes and their solutions. In using a large number of inputs and higher sampling processing per second, we find that tf.FIFOQueue and tf.train.queue_runner cannot use the current multiple GPU to generate saturation, such as when using AlexNet to train ImageNet. This is because the Python thread is used as the underlying implementation, and the Python thread is too expensive.

Another approach we take in the script is to build the input pipeline through native parallelism in Tensorflow. Our method mainly consists of the following three stages:

Icano read: select and read image files from disk.

Image processing: decoding image records into pixels, preprocessing and generating minimum batches.

Data transfer from CPU to GPU: transfer images from CPU to GPU.

By leveraging data_flow_ops.StagingArea, the main parts of each phase are executed in parallel with other phases. StagingArea is an operator like a queue (queue) and similar to tf.FIFOQueue. The difference is that StagingArea provides simpler functionality and can be executed in parallel with other phases in CPU and GPU. The input pipeline is divided into three independent parallel operation phases, and this is scalable, making full use of the large multi-core environment. The rest of this section details each phase and the details of using data_flow_ops.StagingArea.

Parallel Ippo read

Data_flow_ops.RecordInput is used for parallel reading of disks. Given a list of input files representing TFRecords, RecordInput can use background threads to read records continuously. These records are placed in a large internal pool, and when the pool load reaches half of its capacity, there is a corresponding tensor output. This operation has its own internal thread, which is dominated by the Imax O time, which consumes the least CPU resources, which allows it to run in parallel with the rest of the model.

Parallel image processing

After reading images from RecordInput, they are passed as tensors to the image processing pipeline. To make it easier to interpret the image processing pipeline, assume that the target of the input pipeline is 8 GPU with a batch size of 256 (32 per GPU). The reading and processing of 256 image records are independent and parallel. Starting from the 256 RecordInput reads in the figure, each read operation is followed by a matching image preprocessing operation, which is performed independently and in parallel with each other. These image preprocessing operations include, for example, image decoding, distortion and resizing.

When the images pass through the preprocessor, they are joined into eight tensors of size 32. To achieve this, tf.parallel_stack is used instead of tf.concat, the purpose is implemented as a single operation, and all inputs need to be ready before joining them together. Tf.parallel_stack takes an uninitialized tensor as output, and when there is a tensor input, the tensor of each input is written to a specified part of the output tensor.

When all the tensors are input, the output tensor is transferred in the graph. This effectively hides the memory latency caused by generating the long tail (long tail) of all input tensors.

Parallel data transfer from CPU to GPU

Continue to assume that the target is 8 GPU with a batch size of 256 (32 per GPU). Once the input image is processed and joined by CPU, we will get 8 tensors with a batch size of 32. Tensorflow can make the tensor of one device be directly applied to any other device. To make the tensor available in any device, Tensorflow inserts an implicit copy. Copies are scheduled to run between devices before the tensor is actually used. Once the copy fails to finish running on time, the calculation that requires these tensors will stop and lead to performance degradation.

In this implementation, data_flow_ops.StagingArea is used to explicitly schedule parallel replicas. The end result is that when the calculation on GPU starts, all the tensors are available.

Software pipeline

Since all phases can be run under different processors, using data_flow_ops.StagingArea between them makes them run in parallel. StagingArea is a tf.FIFOQueue-like and queue-like operator, and tf.FIFOQueue provides simpler functions that can be executed in CPU and GPU.

Before the model starts running all phases, the input pipeline phase is warmed up to place the segmented cache between a set of data. In each run phase, a set of data is read from the segmented buffer at the beginning and pushed at *.

For example, there are three stages A, B, C, between which there are two segmented regions S1 and S2. When preheating, we run:

Warm up: Step 1: A0 Step 2: A1 B0 Actual execution: Step 3: A2 B1 C0 Step 4: A3 B2 C1 Step 5: A4 B3 C2

After preheating, S1 and S2 each have a set of data. For each step actually performed, a set of data from the segmented area is calculated, and a new set of data is added to the segmented area.

The benefits of this scheme are:

All phases are non-blocking because there is always a set of data in the segmented area after preheating.

Each phase can be processed in parallel because they can be started immediately.

Segmented caches have fixed memory overhead and at most one additional set of data.

You only need to call singlesession.run () to run all stages of a step, which makes it easier to analyze and debug.

The practice of building a high performance model

The following is a collection of additional practices that can improve model performance and increase model flexibility.

Modeling using NHWC and NCHW

The vast majority of Tensorflow operations used by CNN support NHWC and NCHW data formats. In GPU, NCHW is faster; but in CPU, NHWC is only occasionally faster.

Building a model that supports date formats increases its flexibility and works well on any platform. Benchmark scripts are written to support NCHW and NHWC. NCHW is often used when training models with GPU. NHWC is sometimes faster in CPU. NCHW can be used to train a flexible model in GPU, NHWC can be used to reason in CPU, and appropriate weight parameters can be obtained from training.

Using converged batch normalization

The default batch normalization in Tensorflow is implemented as a composite operation, which is a common practice, but its performance is not good. Converged batch normalization is an alternative and can achieve better performance in GPU. The following is an example of converged batch normalization using tf.contrib.layers.batch_norm:

Bn = tf.contrib.layers.batch_norm (input_layer, fused=True, data_format='NCHW' scope=scope)

Variable distribution and gradient polymerization

During the training, the variable values of the training are updated by aggregated gradients and increments. In the benchmark script, we show that we can build a variety of high-performance distribution and aggregation scenarios by using flexible and general-purpose Tensorflow primitives.

Include 3 examples of variable distribution and aggregation in the benchmark script:

Parameter server, each copy of the training model reads variables from the parameter server and updates variables independently. When variables are required for each model, they are copied to the standard implicit copy added by the Tensorflow runtime. The sample script shows how to use this method for local training, distributed synchronous training, and distributed asynchronous training.

Copy, place the same copy of each training variable on each GPU, and forward and reverse calculations begin as soon as the variable data is immediately available. All gradients in GPU are accumulated, and the sum of the accumulation is applied to each copy of the GPU variable to keep it synchronized.

Distributed replication places the copy of the training parameters in each GPU with the master copy on the parameter server, and forward and reverse calculations begin as soon as the variable data is available. The gradients of each GPU on a server are accumulated, and then the aggregated gradients in each server are applied to the master replica. When all modules do this, each module updates the variable copy from the master copy.

The following are other details about each method.

Parameter server variable

The most common way to manage variables in the Tensorflow model is the parameter server pattern.

In a distributed system, each worker process runs the same model, and the parameter server processes its own master copy of the variable. When a worker needs a variable from the parameter server, it can be referenced directly from it. Tensorflow adds an implicit copy to the drawing at run time, which makes variable values available on computing devices that need it. When the gradient is calculated on the worker, the gradient is transferred to the parameter server that has a specific variable, and the corresponding optimizer is used to update the variable.

Here are some techniques to improve throughput:

In order to balance the load, these variables are transferred between parameter servers according to their size.

When each worker has multiple GPU, the gradient of each GPU is accumulated and the single aggregation gradient is sent to the parameter server. This will reduce the network bandwidth and reduce the workload of the parameter server.

In order to coordinate workers, asynchronous update mode is often used, where each worker updates the master copy of the variable without synchronizing with other workers. In our model, we have shown that it is very easy to introduce synchronization mechanisms into workers, so all workers must be updated before the next step begins.

This parameter server method can also be applied to local training, in which case they do not propagate master copies of variables between parameter servers, but on CPU or distributed over available GPU.

Because of the simplicity of this setup, this architecture has been widely promoted in the community.

You can also use this mode in scripts by passing the parameter variable_update=parameter_server.

Variable replication

In this design, each GPU in the server has its own copy of the variable. These values are kept synchronized between GPU by applying a fully aggregated gradient to each GPU copy of the variable.

Because the variables and data are ready at the initial stage of the training, the forward calculation of the training can begin immediately. Aggregate the gradients of each device to get a fully aggregated gradient, and apply the gradient to each local copy.

Gradient aggregation between servers can be achieved in different ways:

Accumulate the sum on a single device (CPU or GPU) using Tensorflow standard operations, and then copy it back to all GPU.

Use Nvidia NCCL, which is described in the NCCL section below.

Variable replication in distributed training

The above method of variable replication can be extended to distributed training. A similar approach is to fully aggregate the gradients in the cluster and apply them to each local copy. This approach may appear in future versions of the script, but the current script takes a different approach. It is described below.

In this mode, except for each GPU copy of the variable, the master copy is stored in the parameter server. With this replication mode, you can start training immediately using a local copy of the variable.

As gradients of weights become available, they are sent back to the parameter server, and all local copies are updated:

All the gradients of GPU are aggregated together in the same worker.

The aggregate gradient from each worker is sent to a parameter server with its own variables, where a special optimizer is used to update the master copy of the variable.

Each worker updates the local copy of the variable from the master copy. In the example model, this is done in a load with cross copies while waiting for all modules to complete variable updates, and new variables can be obtained only after the load has been released by all copies. Once all the variables are copied, this marks the completion of one training step and the beginning of the next.

Although these sound similar to the standard usage of parameter servers, their performance is better in many cases. This is largely because there is no delay in the calculation, and most of the replication delay of the early gradient can be hidden by the calculation layer later.

You can use this mode in a script by passing the parameter variable_update=distributed_replicated.

NCCL

To propagate variables and aggregate gradients on different GPU on the same host, we can use Tensorflow's default implicit replication mechanism.

However, we can also choose NCCL (tf.contrib.nccl). NCCL is a library of Nvidia that enables efficient data transfer and aggregation across different GPU. It allocates a collaboration kernel on each GPU that knows how to leverage the underlying hardware topology and uses a single SM's GPU.

Experiments show that although NCCL usually accelerates the aggregation of data, it does not necessarily speed up training. Our assumption is that implicit copies are basically time-consuming because they replicate the engine on GPU, as long as its latency can be hidden by the master computing itself. Although NCCL can transfer data faster, it requires a SM and puts more pressure on the underlying L2 cache. Our results show that NCCL performs better under the condition of 8 GPU, but implicit copies usually perform better if there are fewer GPU.

Segmented variable

We further introduce a segmented variable mode, where we use segmented regions for variable reading and updating. Similar to the software pipeline in the input pipeline, this hides the delay in data copying. If the calculation takes longer than replication and aggregation, then replication itself can be considered time-consuming.

The disadvantage of this method is that all the weights come from the previous training steps, so this is a different algorithm from SGD, but by adjusting the learning rate and other super parameters, it is still possible to improve the convergence.

Execution of script

This section lists the core command-line arguments for executing the main script and some basic examples (tf_cnn_benchmarks.py)

Note: the configuration file force_gpu_compatible used by tf_cnn_benchmarks.py was introduced after version 1.1 of Tensorflow, and it is not recommended to build it from source until version 1.2 is released.

Main command line arguments

Model: the models used are resnet50, inception3, vgg16, and alexnet.

Num_gpus: this refers to the number of GPU used.

Data_dir: the path to data processing, if not set, then composite data will be used. To use Imagenet data, you can use these instructions (https://github.com/tensorflow/tensorflow/blob/master/tensorflow_models/inception#getting-started)) as a starting point.

Batch_size: the batch size of each GPU.

Variable_update: methods for managing variables: parameter_server, replicated, distributed_replicated, independent.

Local_parameter_device: the device used as a parameter server: CPU or GPU.

Single instance

# VGG16 training ImageNet with 8 GPUs using arguments that optimize for # Google Compute Engine. Python tf_cnn_benchmarks.py-local_parameter_device=cpu-num_gpus=8\-batch_size=32-model=vgg16-data_dir=/home/ubuntu/imagenet/train\-variable_update=parameter_server-nodistortions # VGG16 training synthetic ImageNet data with 8 GPUs using arguments that # optimize for the NVIDIA DGX-1. Python tf_cnn_benchmarks.py-local_parameter_device=gpu-num_gpus=8\-batch_size=64-model=vgg16-variable_update=replicated-use_nccl=True # VGG16 training ImageNet data with 8 GPUs using arguments that optimize for # Amazon EC2. Python tf_cnn_benchmarks.py-local_parameter_device=gpu-num_gpus=8\-batch_size=64-model=vgg16-variable_update=parameter_server # ResNet-50 training ImageNet data with 8 GPUs using arguments that optimize for # Amazon EC2. Python tf_cnn_benchmarks.py-local_parameter_device=gpu-num_gpus=8\-batch_size=64-model=resnet50-variable_update=replicated-use_nccl=False

Distributed command line arguments

1) ps_hosts: a comma-separated list of hosts is used as a parameter server in the format of: port (for example, 10.0.0.2 50000).

2) worker_hosts: (for example, 10.0.0.2 50001), a comma-separated list of hosts is used as a worker, in the: port format.

3) task_index: the host index in the list of ps_host or worker_hosts being started.

4) job_name: the type of work, such as ps or worker.

Distributed instance

The following is an example of training ResNet-50 on two hosts (host_0 (10.0.0.1) and host_1 (10.0.0.2)). This example uses composite data. Pass data_dir parameters if you want to use real data. # Run the following commands on host_0 (10.0.0.1):

Python tf_cnn_benchmarks.py-- local_parameter_device=gpu-- num_gpus=8\-- batch_size=64-- model=resnet50-- variable_update=distributed_replicated\-- job_name=worker-- ps_hosts=10.0.0.1:50000,10.0.0.2:50000\-- worker_hosts=10.0.0.1:50001,10.0.0.2:50001-- task_index=0 python tf_cnn_benchmarks.py-local_parameter_device=gpu-- num_gpus=8\-- batch_size=64 -- model=resnet50-- variable_update=distributed_replicated\-- job_name=ps-- ps_hosts=10.0.0.1:50000,10.0.0.2:50000\-- worker_hosts=10.0.0.1:50001,10.0.0.2:50001-- task_index=0 # Run the following commands on host_1 (10.0.0.2): python tf_cnn_benchmarks.py-- local_parameter_device=gpu-- num_gpus=8\-- batch_size=64-- model=resnet50-- variable_ Update=distributed_replicated\-job_name=worker-- ps_hosts=10.0.0.1:50000,10.0.0.2:50000\-- worker_hosts=10.0.0.1:50001,10.0.0.2:50001-- task_index=1 python tf_cnn_benchmarks.py-- local_parameter_device=gpu-- num_gpus=8\-- batch_size=64-- model=resnet50-- variable_update=distributed_replicated\-- job_name=ps-- ps_hosts=10.0.0.1:50000,10.0.0 .2 TensorFlow 50000\-- worker_hosts=10.0.0.1:50001,10.0.0.2:50001-- task_index=1 the above is how to build a high-performance model in multi-system and network topologies in TensorFlow Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.