Use OpenCV to train neural networks to detect gestures in Python! 07/19 Update SLTechnology News&Howtos

Use OpenCV to train neural networks to detect gestures in Python!

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Gesture → Prediction → Action

You can find the code in the Github project repository here, or view the final presentation slides here.

(github portal:

https://github.com/athena15/project_kojak

PPT Portal:

https://docs.google.com/presentation/d/1UY3uWE5sUjKRfV7u9DXqY0Cwk6sDNSalZoI2hbSD1o8/edit#slide=id.g49b784d7df_0_2488）

inspiration

Imagine that you are holding a birthday party, everyone is having fun, the music is also high to the limit, we often see on the chatter loudly call Tmall elves, millet love and other smart audio scenes, at this time does not work, it is likely that they can not hear your voice, basically you can not find the remote control, but if this time you open a hand in the middle of the conversation, a gesture, your smart home device can recognize this gesture, Turn off the music and turn up the lights on the birthday girl's face. That's kind of romantic, and kind of cool.

background

I've been curious about gesture detection for a long time. I remember when the first Microsoft Kinect came out--I could play games and control the screen with a wave of my hand. Slowly, devices such as Google Home and Amazon Alexa were released, and it seemed that gesture detection lost support for voice radar. But with the launch of video devices like Facebook portal and Amazon Echo Show, I wanted to see if it was possible to build a neural network that could recognize my gestures in real time and run my smart home device!

Data and my early models

I was excited by the idea and acted quickly, as if I had been shot at by a cannon. I started using the Gesture Recognition database on Kaggle.com and exploring the data. It consists of 20,000 labeled gestures, as shown below.

Strange images, but rich in labels

When I read images, the first problem I encountered was that my images were black and white. This means that NumPy arrays have only one channel instead of three (i.e. each array is shaped like (224, 224, 1)). So I couldn't use these images with the VGG-16 pre-trained model because the model requires 3-channel images of RGB. This is solved by using np.stack on the image list, X_data:

Once I overcame this obstacle, I set out to build a model that, using a training-test segmentation, fully displayed 2 out of 10 people in the photo. After re-running the model based on the VGG-16 architecture, my model achieved an overall F1 score of 0.74. This is very good because random guesses over 10 classes can only get an average accuracy of 10%.

However, training a model to recognize images from a homogeneous dataset is one thing. Another way to train it to recognize images it has never seen before is another way. I tried to adjust the lighting of the photos and use dark backgrounds-mimicking photos trained by models.

I've also tried image enhancement-flip, tilt, rotate, etc. Although these images are better done than before, I still can't predict, and in my opinion unacceptable--the results. I need to rethink this and come up with a creative way to make this project work.

Point: Train your model so that it is as close as possible to real-world images

Rethink the problem.

I decided to try something new. In my opinion, there is a clear disconnect between the strange look of the training data and the images my model might see in real life. I decided to try building my own dataset.

I've been using OpenCV, an open source computer vision library, and I needed an engineer to come up with a solution to grab an image from the screen, then resize and convert the image into a NumPy array that my model could understand. The method I used to convert the data was as follows:

In short, once you have the camera up and running, you can grab the frame, transform it, and get predictions from the model:

The connection pipeline between the webcam and my model was a huge success. I started thinking about what the ideal image would be, and I fed it into my model. An obvious obstacle is the difficulty of distinguishing the area of interest (in our case, a hand) from the background.

Extraction gestures

The method I used was one familiar to anyone familiar with Photoshop-background subtraction. Essentially, if you take a picture before your hand enters the scene, you can create a "mask" that will delete everything in the new image except your hand.

Background masking and binary image threshold

Once I subtract the background from my image, I then use binary thresholds to make the target gesture completely white and the background completely black. I chose this method for two reasons: it makes the outline of the hand clearer, which makes the model easier to generalize between users of different skin tones. This created the photo "silhouette" of my final training model.

Building new datasets

Now that I could accurately detect images in my hands, I decided to try something new. My old model didn't generalize well, and my ultimate goal was to build a model that could recognize my gestures in real time-so I decided to build my own dataset!

I chose to focus on five gestures:

I strategically chose 4 gestures that are also included in the Kaggle dataset, so I can cross-validate my model against these images later.

From here, I build the dataset by setting up my webcam and creating a click bind in OpenCV to capture and save images with unique file names. I tried to change the position and size of the gesture in the frame so that my model would be better. Soon, I had a dataset of 550 contour images each. Yes, you read that correctly, I took over 2700 pictures.

Training new models

Then I built a convolutional neural network using Keras and TensorFlow. I started with the excellent VGG-16 pre-trained model and added 4 dense layers and a drop layer at the top.

I then took the unusual step of choosing to cross-validate my model on the original Kaggle dataset I had tried before. This is key, if my new model cannot generalize to images of other people's hands that have not been trained before, then it is not much better than my original model.

To do this, I applied the same transformation to each Kaggle image I applied to the training data-background subtraction and binary thresholding. This gives them a familiar "look" similar to my model.

L, OK, Palm transformed Kaggle dataset gesture

results

The performance of the car exceeded my expectations. It correctly classified almost every gesture in the test set, ultimately achieving an F1 score of 98 percent, and accuracy scores of 98 percent. That's good news!

As any experienced researcher knows, models that perform well in the lab and poorly in real life are of little value. After my initial model suffered the same failure, this model performed well on real-time gestures.

Smart Home Integration

Before testing my model, I want to add that I have always been a smart home enthusiast, and my vision has always been to control my Sonos (wireless wifi slightly) and Philips Hue lights with my gestures. For easy access to Philips Hue and Sonos APIs, I used phue and SoCo libraries, respectively. They are all very simple to use, as shown below:

Using SoCo to control Sonos via a Web API can be said to be easier:

I then created bindings for different gestures to perform different actions with my smart home device:

When I finally tested my model in real time, I was very pleased with the results. The model accurately predicted my gestures most of the time, and I was able to use them to control lights and music. For demonstrations, see:

Source: towardsdatascience.com/training-a-neural-network-to-detect-gestures-with-opencv-in-python-e09b0a12bdf1

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.