What is the real-time multi-person 2D pose estimation based on TensorFlow2.x 07/02 Update SLTechnology News&Howtos

What is the real-time multi-person 2D pose estimation based on TensorFlow2.x

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

It is believed that many inexperienced people have no idea about the real-time multi-person two-dimensional pose estimation based on TensorFlow2.x. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Introduction

As Zhe Cao described in his 2017 paper, real-time multiperson 2D pose estimation is critical for machines to understand people in images and videos.

However, what is posture estimation?

As the name implies, it is a technique used to estimate the position of a person's body, such as standing, sitting or lying down. One way to get this estimate is to find 18 "body joints" or "Key Points" named in the field of artificial intelligence. The following image shows our goal of finding these points in the image:

The key point extends down from point 0 (upper neck) to the body joints, then back to the head, and finally point 17 (right ear).

The first meaningful work that emerged using artificial intelligence methods was DeepPose, a 2014 paper by Google's Toshev and Zegedy. A human pose estimation method based on depth neural network (DNNs) is proposed. In this method, human pose estimation is reduced to a human joint regression problem based on DNN.

The model consists of an AlexNet backend (7 layers) and an additional target layer, outputting 2k joint coordinates. An important problem with this approach is that, first, the model application must detect a person (classic object detection). Therefore, each human body found in the image must be processed separately, which greatly increases the time of processing the image.

This method is called "top-down" because you have to find the body first, and then find the joints associated with it.

The challenge of posture estimation

There are several problems with posture, such as:

Each image may contain an unknown number of people who can appear in any location or scale.

The interaction between people can lead to complex spatial interference, which is due to contact or limb joint connection, which makes the association of joints difficult.

Runtime complexity tends to increase with the number of people in the image, which makes real-time performance a challenge.

A more exciting way to solve these problems is OpenPose, which was introduced in 2016 by ZheCao and his colleagues at the Robotics Institute at Carnegie Mellon University.

OpenPose

The method proposed by OpenPose uses a nonparametric representation called the partial affinity force field (PAFs) to "connect" each body joint in the image and associate them with the individual.

In other words, OpenPose, in contrast to DeepPose, first finds all the joints on the image, and then "up" searches for the body that is most likely to contain that joint without using the detector that detects the person (the "bottom-up" approach). OpenPose can find the key points in the image, no matter how many people there are in the image. The following image, retrieved from the OpenPose demo at ILSVRC and COCO Symposium 2016, gives us an idea of the process.

The following figure shows the structure of two multi-stage CNN models for training. First of all, the feedforward network simultaneously predicts a set of two-dimensional confidence maps of human body positions (key points are labeled from (dataset/COCO/annotations/) and a set of two-dimensional partial affinity force fields (L).

After each stage, the prediction of the two branches and the image features are connected to the next stage. Finally, the confidence map and similarity domain are analyzed by greedy reasoning, and the two-dimensional key points of all people in the image are output.

During the implementation of the project, we will return to some of these concepts for clarification. However, it is strongly recommended that you follow the OpenPose ILSVRC and COCO seminar demonstrations in 2016 (http://image-net.org/challenges/talks/2016/Multi-person%20pose%20estimation-CMU.pdf) and CVPR 2017 video recording (https://www.youtube.com/watch?v=OgQLDEAjAZ8&list=PLvsYSxrlO0Cl4J_fgMhj2ElVmGR5UWKpB)) for better understanding.

TensorFlow 2 OpenPose (tf-pose-estimation)

The original OpenPose was developed using a model-based VGG pre-training network and Caffe framework. However, we will follow Ildoo Kim's TensorFlow implementation and introduce his tf-pose-estimation in detail.

Github link: https://github.com/ildoonet/tf-pose-estimation

What is tf-pose-estimation?

Tf-pose-estimation is a "Openpose" algorithm, which is implemented using Tensorflow. It also provides several variants that make some changes to the network structure for real-time processing on CPU or low-power embedded devices.

Tf-pose-estimation 's GitHub page shows experiments with several different models, such as:

Cmu: the weight of the model-based VGG pre-training network described in the original paper is Caffe format, which is converted and used in TensorFlow.

Dsconv: the same architecture as the cmu version, except for the deep separable convolution of mobilenet.

Mobilenet: based on mobilenet V1 paper, 12 convolution layers are used as feature extraction layers.

Mobilenet v2: similar to mobilenet, but using an improved version.

The research in this paper is carried out on mobilenet V1 ("mobilenet_thin"), which has medium performance in terms of computing budget and latency:

Part 1-install tf-pose-estimation

We referred to Gunjan Seth's article Pose Estimation with TensorFlow 2.0 (https://medium.com/@gsethi2409/pose-estimation-with-tensorflow-2-0-a51162c095ba).

Go to the terminal and create a working directory (for example, "Pose_Estimation") and move there:

Mkdir Pose_Estimationcd Pose_Estimation

Create a virtual environment (for example, Tf2_Py37)

Conda create-- name Tf2_Py37 python=3.7.6-y conda activate Tf2_Py37

Install TF2

Pip install-upgrade pippip install tensorflow

Install the basic software packages to be used during development:

Conda install-c anaconda numpyconda install-c conda-forge matplotlibconda install-c conda-forge opencv

Download tf-pose-estimation:

Git clone https://github.com/gsethi2409/tf-pose-estimation.git

Go to the tf-pose-estimation folder and install requirements

Cd tf-pose-estimation/pip install-r requirements.txt

Next, install SWIG, an interface compiler, and connect programs written in C and C++ to scripting languages such as Python. It works through the declarations found in the Cmax Candle + header file and uses them to generate the wrapper code that the scripting language requires access to the underlying Cmax Cure + code.

Conda install swig

Using SWIG, build a C++ library for post-processing.

Cd tf_pose/pafprocessswig-python-C++ pafprocess.i & & python3 setup.py build_ext-- inplace

Now install the tf-slim library, a lightweight library for defining, training, and evaluating complex models in TensorFlow.

Pip install git+ https://github.com/adrianc-a/tf-slim.git@remove_contrib

That's all! Now, it is necessary to conduct a quick test. Return to the tf-pose-estimation home directory.

If you are in order, you must be in the tf_pose/pafprocess. Otherwise, use the appropriate command to change the directory.

Cd.. /..

There is a python script run.py in the tf-pose-estimation directory. Let's run it with the following parameters:

Model=mobilenet_thin

Resize=432x368 (size of the image during preprocessing)

Image=./images/ski.jpg (sample images in the image directory)

Python run.py-model=mobilenet_thin-resize=432x368-image=./images/ski.jpg

Note that nothing happens in a few seconds, but after about a minute, the terminal should display something similar to the following figure:

More importantly, however, the image will appear on a separate OpenCV window:

Great! These pictures prove that everything is installed correctly and works well! We will cover it in more detail in the next section.

However, in order to quickly explain the meaning of these four images, the upper left corner ("Result") is a pose detection skeleton (in this case, ski.jpg) drawn with the original image as the background. The image in the upper right corner is a "heat map" that shows "detected components" (S), and both bottom images show the association (L) of the components. "Result" connects S and L.

The next test is a live video:

If only one camera is installed on your computer, use: camera=0

Python run_webcam.py-model=mobilenet_thin-resize=432x368-camera=1

If all goes well, a window appears with a real video, like the screenshot below:

Part 2-in-depth study of pose estimation in images

In this section, we will take a more in-depth look at our implementation of TensorFlow pose estimation. I suggest you follow this article and try to copy Jupyter Notebook:10_Pose_Estimation_Images, which can be downloaded from the GitHub project: https://github.com/Mjrovai/TF2_Pose_Estimation/blob/master/10_Pose_Estimation_Images.ipynb

For reference, this project was developed on MacPro (2.9Hhz Quad-Core i7 16GB 2133Mhz RAM).

Import library import sysimport timeimport loggingimport numpy as npimport matplotlib.pyplot as pltimport cv2from tf_pose import commonfrom tf_pose.estimator import TfPoseEstimatorfrom tf_pose.networks import get_graph_path, model_wh model definition, and TfPoseEstimator creation

You can use models located in the model/ graph subdirectory, such as mobilenet_v2_large or cmu (VGG pretrained model).

Cmu,*.pb files are not downloaded during installation because they are large. To use it, run the bash script download.sh located in the / CMU subdirectory.

This project uses mobilenet_thin (MobilenetV1), considering that all images used should be adjusted to 432x368.

Parameters:

Model='mobilenet_thin'resize='432x368'w, h = model_wh (resize)

Create an estimator:

E = TfPoseEstimator (get_graph_path (model), target_size= (w, h))

For ease of analysis, let's load a simple human body image. OpenCV is used to read images. Images are stored as RGB, but internally, OpenCV works with BGR. There is no problem displaying an image using OpenCV because it converts from BGR to RGB before displaying the image on a particular window (as shown in the previous section of ski.jpg).

Once the image is printed on the Jupyter unit, Matplotlib will be used instead of OpenCV. Therefore, the image needs to be converted before it can be displayed, as follows:

Image_path ='. / images/human.png'image = cv2.imread (image_path) image = cv2.cvtColor (image, cv2.COLOR_BGR2RGB) plt.imshow (image) plt.grid ()

Notice that the shape of this image is 567x567. When OpenCV reads the image, it automatically converts it into an array, with each value from 0 to 255, where 0 means "white" and 255 means "black".

Once the image is an array, it is easy to use shape to verify its size:

Image.shape

The result will be (567567), where the shape is (width, height, color channel).

Although you can use OpenCV to read images, we will use the function read_imgfile (image_path) in the tf_pose.common library to prevent any problems with the color channel.

Image = common.read_imgfile (image_path, None, None)

Once we take the image as an array, we can apply the method reasoning to the estimator (estimator, e), taking the image array as input

Humans = e.inference (image, resize_to_default= (w > 0 and h > 0), upsample_size=4.0)

After running the above command, let's examine the array e.heatmap. The shape of the array is (184 ~ 216), where 184 is the probability that a particular pixel belongs to one of 18 joints (0 to 17) + 1 (18:none). For example, when checking the upper-left pixel, "none" should appear:

You can verify the last value of this array

This is the maximum value; understandably, with a 99.6% probability, this pixel does not belong to any of the 18 joints.

Let's try to find the bottom of the neck (the midpoint between the shoulders). It is located at about 20% of the middle width (0.5*w=108) and height of the original picture, starting from top / bottom (0.2*h=37). So, let's examine this particular pixel:

It's easy to realize that the maximum value of position 1 is 0.7059... (or by calculating e.heatMat [37] [108] .max ()), which means that a particular pixel has a 70% chance of becoming a "neck". The following figure shows all 18 COCO keys (or "body joints"), showing that "1" corresponds to the "base of the neck".

You can draw for each pixel, a color that represents its maximum value. As a result, a heat map showing key points magically appears:

Max_prob = np.amax (e.heatMat [:,:,: 1], axis=2) plt.imshow (max_prob) plt.grid ()

We now draw key points on the adjusted original image:

Plt.figure (figsize= (155.8)) bgimg = cv2.cvtColor (image.astype (np.uint8), cv2.COLOR_BGR2RGB) bgimg = cv2.resize (bgimg, (e.heatMat.shape [1], e.heatMat.shape [0]), interpolation=cv2.INTER_AREA) plt.imshow (bgimg, alpha=0.5) plt.imshow (max_prob, alpha=0.5) plt.colorbar () plt.grid ()

Therefore, you can see the key points on the image, because the value displayed on the color bar means that if the yellow is darker, there is a higher probability.

To get L, the most likely connection (or "bone") between keys (or "joints"), we can use the result array of e.pafMat. Its shape is (184 2x19), where 38 (Pixel) is related to the probability that the pixel is connected to one of the 18 specific joints + none as part of a horizontal (x) or vertical (y) connection.

The function for drawing the above chart is in Notebook.

Draw bones using the draw_human method

The result of using the e.inference () method is passed to the list human, and you can draw the skeleton using the draw_human method:

Image = TfPoseEstimator.draw_humans (image, humans, imgcopy=False)

The result is as follows:

If necessary, you can just draw the skeleton as follows (let's rerun all the code to review):

Image = common.read_imgfile (image_path, None, None) humans = e.inference (image, resize_to_default= (w > 0 and h > 0), upsample_size=4.0) black_background = np.zeros (image.shape) skeleton = TfPoseEstimator.draw_humans (black_background, humans, imgcopy=False) plt.figure (figsize= (1558) plt.imshow (skeleton); plt.grid (); plt.axis ('off'))

Get key point (joint) coordinat

Pose estimation can be used in a series of applications such as robots, games or medicine. For this reason, it may be interesting to obtain physical key coordinates from the image for use by other applications.

By looking at the human list generated by e.inference (), you can verify that it is a list of individual elements and strings. In this string, each key appears with its relative coordinates and related probability. For example, for the portraits currently used, we have:

For example:

BodyPart:0- (0.49,0.09) score=0.79BodyPart:1- (0.49,0.20) score=0.75...BodyPart:17- (0.53,0.09) score=0.73

We can extract an array (size 18) from this list that contains the actual coordinates related to the shape of the original image:

Keypoints = str (humans [0]). Split ('BodyPart:') [1:]) .split (' -') .split ('score=') keypts_array = np.array (keypoints_list) keypts_array = keypts_array* (image.shape [1], image.shape [0]) keypts_array = keypts_array.astype (int)

Let's draw this array on the original image (the index of the array is key point). The results are as follows:

Plt.figure (figsize= (10)) plt.axis ([0, image.shape [1], 0, image.shape [0]]) plt.scatter (* zip (* keypts_array), swarm 200, color='orange', alpha=0.6) img = cv2.cvtColor (image, cv2.COLOR_BGR2RGB) plt.imshow (img) ax=plt.gca () ax.set_ylim (ax.get_ylim () [::-1]) ax.xaxis.tick_top () plt.grid () For I, txt in enumerate (keypts_array): ax.annotate (I, (keypts_ Array [I] [0]-5, keypts_ Array [I] [1] + 5)

Create a function to quickly copy the study of general images:

Notebook shows all the code developed so far, with "encapsulated" as a function. For example, let's look at another image:

Image_path ='.. / images/einstein_oxford.jpg'img, hum = get_human_pose (image_path) keypoints = show_keypoints (img, hum, color='orange')

Img, hum = get_human_pose (image_path, showBG=False) keypoints = show_keypoints (img, hum, color='white', showBG=False)

Multi-person image

So far, only images containing one person have been studied. Once we have developed an algorithm to capture all joints (S) and PAF (L) from the image at the same time, we find the most likely connection for simplicity. Therefore, the code to get the result is the same; for example, only when we get the result ("human"), the size of the list will match the number of people in the image.

For example, let's use an image of five people:

Image_path ='. / images/ski.jpg'img, hum = get_human_pose (image_path) plot_img (img, axis=False)

The algorithm finds that all S and L are associated with these five people. The result is very good!

From reading the image path to drawing the result, all the processes are less than 0.5 seconds, regardless of the number of people found in the image.

Let's complicate it, let's see a picture of people dancing more "mixed" together:

Image_path ='.. / images/figure-836178_1920.jpgimg, hum = get_human_pose (image_path) plot_img (img, axis=False)

The result seems to be good, too. We only draw key points, and everyone has a different color:

Plt.figure (figsize= (10 keypoints_2)) plt.axis ([0, img.shape [1], 0, img.shape [0]]) plt.scatter (* zip (* keypoints_1), plt.scatter (* zip (* keypoints_2), color='yellow') Alpha=0.6) ax=plt.gca () ax.set_ylim (ax.get_ylim () [::-1]) ax.xaxis.tick_top () plt.title ('Keypoints of all humans detected\ n') plt.grid ()

Part III: pose estimation in video and real-time cameras

The process of obtaining pose estimation in the video is the same as our processing of the image, because the video can be regarded as a series of images (frames). It is recommended that you try to copy Jupyter Notebook:20_Pose_Estimation_Video: https://github.com/Mjrovai/TF2_Pose_Estimation/blob/master/20_Pose_Estimation_Video.ipynb as described in this section.

OpenCV does a great job of dealing with video.

So let's take a .mp4 video and capture its frame using OpenCV:

Video_path ='.. / videos/dance.mp4cap = cv2.VideoCapture (video_path)

Now let's create a loop to capture each frame. With this framework, we will apply e.inference () and then draw the skeleton based on the result, just as we did with the image. Finally, a code is included to stop video playback when a key is pressed (for example, "Q").

The following is the necessary code:

Fps_time = 0while True: ret_val, image = cap.read () humans = e.inference (image, resize_to_default= (w > 0 and h > 0), upsample_size=4.0) if not showBG: image = np.zeros (image.shape) image = TfPoseEstimator.draw_humans (image, humans, imgcopy=False) cv2.putText (image "FPS:% f"% (1 / (time.time ()-fps_time)), (10,10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0,255,0), 2) cv2.imshow ('tf-pose-estimation result', image) fps_time = time.time () if cv2.waitKey (1) & 0xFF = = ord (' q'): breakcap.release () cv2.destroyAllWindows ()

The result was good, but a little slow. The film, with an initial number of 30 frames per second, will run in the "slow camera", about 3 frames per second.

Testing with a real-time camera

It is recommended that you try to copy Jupyter Notebook:30_Pose_Estimation_Camera (https://github.com/Mjrovai/TF2_Pose_Estimation/blob/master/30_Pose_Estimation_Camera.ipynb)) as described in this section.

The code required to run the live camera is almost the same as the code used for the video, except that the OpenCV videoCapture () method receives an integer that represents the actual camera used as an input parameter. For example, the internal camera uses "0" and the external camera "1". In addition, the camera should be set to capture the "432x368" frames used by the model.

Parameter initialization:

Camera = 1resize = '432x368' # resize the image before processing resize_out_ratio = 4.0# resize the heat map before post-processing model = 'mobilenet_thin'show_process = Falsetensorrt = False # for tensorrt processcam = cv2.VideoCapture (camera) cam.set (3, w) cam.set (4, h)

The loop part of the code should be very similar to the one used in the video:

While True: ret_val, image = cam.read () humans = e.inference (image, resize_to_default= (w > 0 and h > 0), upsample_size=resize_out_ratio) image = TfPoseEstimator.draw_humans (image, humans, imgcopy=False) cv2.putText (image, "FPS:% f"% / (time.time ()-fps_time)) (10,10), cv2.FONT_HERSHEY_SIMPLEX, 0.5,0.5,2) cv2.imshow ('tf-pose-estimation result', image) fps_time = time.time () if cv2.waitKey (1) & 0xFF = = ord (' q'): breakcam.release () cv2.destroyAllWindows ()

Similarly, when using this algorithm, the standard video capture of 30 FPS is reduced to about 10%.

After reading the above, have you mastered the method of real-time multi-person 2D pose estimation based on TensorFlow2.x? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.