How to use TensorFlow to build facial mask recognition system 07/09 Update SLTechnology News&Howtos

How to use TensorFlow to build facial mask recognition system

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, Xiaobian will bring you about how to use TensorFlow to build a facial mask recognition system. The article is rich in content and analyzed and described from a professional perspective. After reading this article, I hope you can gain something.

The complexity of TensorFlow and OpenCV libraries makes it possible to create automated solutions that not only maximize efficiency and ensure compliance, but also save lives.

We see computer vision image recognition technology used very frequently in our daily lives. Whether it's unlocking an iPhone through facial recognition, checking through an airport, or even capturing images of cars you pass through a toll booth, image classification allows machines to effectively program them. Whether it is the above method or the mask recognition system, technology should be integrated into our daily lives to achieve social good. Our goal is to do just that-create a mask recognition system where everyone can understand how image classification works so that our project can be applied and replicated in real-life practice. For the curious, this is how we use TensorFlow to create a facial recognition system that detects your facial boundaries and predicts whether you're wearing a mask in real time.

data acquisition

First, we need to collect images for training and test datasets. We want to create our own dataset that includes images of people wearing masks and images of people not wearing masks. We leverage the Selenium and BeautifulSoup libraries in Python to automate Web browsers to parse royalty-free images on Shutterstock.com. We created a script that asked users to enter the type of image (photo or any type of image) they wanted to scratch and their search query for that image. Additionally, users can specify how many pages to scrape from.

data preprocessing

After capturing 606 images of people with masks and 665 images of people without masks, we created a training set and a test set containing these images with the following specifications: 80% of randomly selected images of people with masks went to the training set, and 20% of randomly selected images of people with masks went to the test set. This process was repeated to train and test a dataset of images of people without masks.

With only a few images at our disposal, we can still build a powerful image classification using ImageDataGenerator in Keras to generate tensor batches, real-time data enhancements (multidimensional arrays) of image data to increase the number and diversity of datasets. The data augmentation process is accomplished by creating duplicate images (such as adding filters and flipping the original image) to increase data size without having to collect new images, while also helping train models to capture differences in location, image quality, and appearance.

build models

In machine learning, we often find the best solution to a real-world problem by following a natural world modeling framework. With image classification, we essentially demonstrate the functionality of the human eye by functionally training the human eye in an artificial neural network through a series of algorithms modeled on biological neural networks. For this project, we used classical building blocks found in convolutional neural networks (CNNs) to design models. The reason we decided to build CNN models is because they are popular in image recognition due to their high accuracy rates. CNN follows a hierarchical model that is used to construct a funnel-like network that outputs a fully connected layer where all neurons are connected to each other and the classification probability is determined.

The convolutional layer is the first layer that accepts the image input we trained. Its main purpose is to extract important features from input images. Convolutional layers are able to synthesize local features because they constrain the acceptance field of hidden layers in the neural network to be located. Each image is treated as a matrix of pixel values. Each matrix cell contains 3 channels (red, green and blue) that provide color saturation. On each image of our dataset, the convolutional layer applies 100 different filters or kernels of size 3 x 3, which means that the filters are stepped at 1 pixel, as shown in the following animation.

(3x3 Filters span 5x5 input images by one pixel to generate new 3x3 feature maps)

The output of the convolutional layer is called the feature map, and each individual element of the feature map is the result of the product of the matrix of pixel values and the filters we summed before sliding the filters by 1 pixel. For each convolution filter applied, there is a corresponding unique eigengraph product.

The following images provide a good visual idea of how to export each feature map. Because we used 100 filters on the convolutional layer, we will output 100 feature maps with each image in the dataset.

So why 100 filters? Because the more filters we use, the more image features we extract, and the better our network becomes at recognizing patterns in invisible images.

To create these feature maps, we use ReLU (Rectified Linear Units) activation functions that introduce nonlinearity into our model. ReLU not only overcomes the vanishing gradient problem, but also makes our model learn faster and perform better. The output is called the "rectification signature map" and it is the input to our merge step.

The next layer, Pooling, reduces the dimensionality of each feature graph while preserving important information. We chose to use MaxPooling with a pool size of 3x3 to select the maximum value in each window of each feature map. The result is a downsampled function map highlighting the most common features in the patch.

After merging the layers, we repeat this process again with the second convolutional layer + ReLU and merging layer, and then take the output of the second merging layer as the input to the Flattening layer. The flattening layer converts the corrected feature map of the second merged layer into a one-dimensional array (a single long feature vector) connected to the final classification model.

After flattening the layer, the Dropout layer ignores randomly selected neurons in a feed-forward fashion to avoid overfitting and reduce interdependencies between neurons in the fully connected layer. It forces neural networks to learn more powerful functions. Our model has 0.5 probabilities as our layer hyperparameters.

Finally, we decided to use three dense layers (fully connected layers) of 50, 35, and finally 2 neurons in our model. Dense networks output probabilities according to binary classification with no mask = 1 and mask = 0.

In combination with Adam Optimizer, an extension of Stochastic Gradient Descent (SGD), which uses momentum and adaptive learning rate to converge faster, and binary cross entropy as our loss function, which outputs the mean or negative logarithm of the probability of loss, we achieve optimal validation accuracy with this model. When validation accuracy plateaued and stalled with each cycle, we could run the model for up to 5 cycles. The following code is how we construct CNN models in Python.

Testing our models in real time

To test the model for optimal use, we used the VideoCapture function from the CV2 library. Cascade classifiers designed by OpenCV are used to detect faces in live video via detectMultiScale. Use a while loop to keep capturing images from mirrored live video. The model will then determine if the mask is worn in real time. Depending on the performance and accuracy of the model, binary classifier results are displayed by overlaying either a green rectangle (indicating that the person on the camera is wearing a mask) or a red rectangle (indicating that the person on the camera is not wearing a mask) around the face. Mask.

results

Using our chosen model, we found validation accuracy as high as 61%. This is true in all cases, and the code that is run varies in validation accuracy from one run to another. This is mainly due to the fact that this layer randomly selects neurons, so each time the code is run from scratch, the dropout layer has some different impact on the training of the model.

Although validation accuracy of up to 61% is certainly higher than random chance, the results are not perfect when tested in real-time using OpenCV. The model has generated false positives and false positives enough times. By examining the results, our model seems to be more sensitive to facial expressions and orientation as well as head position in practice. We think this could be a byproduct of our relatively small but highly diverse dataset. As a result, our model trained on a small number of images and found it difficult to find common patterns across the entire dataset.

We also tested other models, changing the composition and number of layers. Other notable models we trained were models with extra convolution and pooling layers, models with activation functions (ReLU and Softmax) in dense layers, and models with fewer dense layers. For models with extra convolution and merging layers or activation functions in dense layers, we did not notice significant consistent differences in validation accuracy to make the layers that include these worth the extra computational cost and complexity of the model. In models with fewer dense layers, we do notice a significant drop in validation accuracy, with the highest validation accuracy being about 54%.

As the saying goes,"Most of the work in deep learning involves processing data using Python scripts and then fine-tuning the architecture and hyperparameters of the deep network to get a viable model" --François Chollet, inventor of Keras

Conclusions and further improvements

After experimenting with different uniquely designed CNN models, our final model construction yielded optimal performance and accuracy.

Although the model cannot perfectly predict whether masks are worn, we believe that with more training data, we can get better results and accurately determine the use of masks by various masks, facial expressions, head positions and other factors. In addition, it is best to use network grabbing tools to purge data sets from outlier images, as they inherently pick up many unrelated images, to prevent images with a lot of noise from negatively affecting the ability of the model to perform accurate classification during training.

There are test model manifold approaches using transfer learning in which we use different reputations and pre-trained architectures such as RESNET, VGG, Lenet, AlexNet, and even Xception. We can then show overlapping training and validation accuracy and loss plots for these different transfer learning models by number of periods to compare which model gives the best results.

We believe that our framework and model can be replicated and implemented in all sectors of public life and can inspire the infinite possibilities of image classification through machine learning using TensorFlow and OpenCV libraries.

The above is how to use TensorFlow to build a facial mask recognition system shared by everyone. If you happen to have similar doubts, you may wish to refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.