How to use Python and create a simple speech recognition engine 03/26 Update SLTechnology News&Howtos

How to use Python and create a simple speech recognition engine

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

How to use Python and create a simple speech recognition engine, this article introduces the corresponding analysis and answer in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Speech recognition is the ability of a machine or program to recognize spoken words and phrases and convert them into machine-readable formats. In general, the simple implementation of these algorithms has a limited vocabulary that may recognize only words / phrases. However, more complex algorithms, such as Cloud Speech-to-Text and Amazon Transcribe, have a wide vocabulary and include dialects, noise, and slang.

Brief introduction

Speech is just a series of sound waves produced by the vibration around the air caused by our vocal cords. These sound waves are recorded by a microphone and then converted into electrical signals. Advanced signal processing techniques are then used to process signals, separating syllables and words. Thanks to incredible new developments in deep learning, computers can also learn to understand pronunciation from experience.

Speech recognition works through acoustics and language modeling using algorithms. Acoustic modeling represents the relationship between the language units of speech and audio signals; language modeling matches sounds with word sequences to help distinguish words that sound similar. In general, the deep learning model based on loop layer is used to recognize time patterns in speech to improve the accuracy in the system. Other methods can also be used, such as Hidden Markov Model (the first speech recognition algorithm uses this method). In this article, I will only discuss acoustic models.

Signal processing.

There are several ways to convert audio waves into elements that can be processed by algorithms, one of which (one will be used in this tutorial) is to record the height of sound waves at equidistant points:

We read thousands of times per second and record a number that represents the height of the sound waves at that time. This is an uncompressed .wav audio file. CD quality audio is sampled at 44.1 kHz (44100 readings per second). But for speech recognition, the sampling rate of 16khz (16000 samples per second) is enough to cover the frequency range of human speech.

In this way, audio is represented by a digital vector, where each number represents the amplitude of the sound wave at an interval of 16000 seconds. This process is similar to image preprocessing, as shown in the following example:

Thanks to the Nyquist theorem (1933-Vladimir Kurnikov (Vladimir Kotelnikov)), we know that as long as the sampling speed is at least twice the maximum frequency we want to record, we can use mathematical methods to perfectly reconstruct the original sound waves from interval sampling.

Python library

To accomplish this task, I used the Anaconda environment (Python 3.7) and the following Python libraries:

Ipython (v 7.10.2)

Keras (v 2.2.4)

Librosa (v 0.7.2)

Scipy (v 1.1.0)

Sklearn (v 0.20.1)

Sounddevice (v 0.3.14)

Tensorflow (v 1.13.1)

Tensorflow-gpu (v 1.13.1)

Numpy (v 1.17.2)

From tensorflow.compat.v1 import ConfigProto from tensorflow.compat.v1 import Session import os import librosa import IPython.display as ipd import matplotlib.pyplot as plt import numpy as np from scipy.io import wavfile import warnings config= ConfigProto () config.gpu_options.allow_growth = True sess = Session (config=config) warnings.filterwarnings ("ignore")

1. Data set

We use the voice instruction data set provided by TensorFlow in the experiment. It includes 65000 one-second-long words made up of 30 short words from thousands of different people. We will build a speech recognition system that can understand simple voice commands. You can download the dataset (https://www.kaggle.com/c/tensorflow-speech-recognition-challenge)) from here.

two。 Preprocessing audio wave

In the dataset used, some records last less than 1 second and the sampling rate is too high. So let's read the sound waves and use the following preprocessing steps to solve this problem. These are the two steps we are going to take:

Resampling

Delete short commands that are less than 1 second

Let's define these preprocessing steps in the following Python code snippet:

Train_audio_path ='. / train/audio/' all_wave = [] all_label = [] for label in labels: print (label) waves = [f for f in os.listdir (train_audio_path +'/'+ label) if f.endswith ('.wav')] for wav in waves: samples, sample_rate = librosa.load (train_audio_path +'/'+ label +'/'+ wav, sr = 16000) samples = librosa.resample (samples Sample_rate, 8000) if (len (samples) = = 8000): all_wave.append (samples) all_label.append (label)

As can be seen from above, the sampling rate of the signal is 16000 hz. We resampled it to 8000 hertz because most speech-related frequencies are at 8000 hertz.

The second step is to process our tag, where we convert the output tag to integer encoding and the integer encoded tag to one-hot vector, because this is a multi-objective problem:

From sklearn.preprocessing import LabelEncoder from keras.utils import np_utils label_enconder = LabelEncoder () y = label_enconder.fit_transform (all_label) classes= list (label_enconder.classes_) y = np_utils.to_categorical (y, num_classes=len (labels))

The last step of the preprocessing step is to reshape the 2D array to 3D, because the input to conv1d must be a 3D array:

All_wave = np.array (all_wave). Reshape (- 1)

3. Create training and validation sets

To execute our deep learning model, we will need to generate two sets (training and verification). For this experiment, I used 80% of the data to train the model and validated it on the remaining 20% of the data:

From sklearn.model_selection import train_test_split x_train, x_valid, y_train, y_valid = train_test_split (np.array (all_wave), np.array (y), stratify=y,test_size = 0.2)

4. Machine learning model architecture

I use the Conv1d and GRU layers to model the network for speech recognition. Conv1d is a convolution neural network with convolution only in one dimension, while the goal of GRU is to solve the gradient disappearance problem of standard cyclic neural networks. GRU can also be seen as a variant of LSTM, because the designs of the two are similar and, in some cases, can produce equally good results.

The model is based on two famous speech recognition methods, deepspeech H3 and Wav2letter++ algoritms. The following code demonstrates the model proposed using Keras:

From keras.layers import Bidirectional, BatchNormalization, CuDNNGRU, TimeDistributed from keras.layers import Dense, Dropout, Flatten, Conv1D, Input, MaxPooling1D from keras.models import Model from keras.callbacks import EarlyStopping, ModelCheckpoint from keras import backend as K K.clear_session () inputs = Input (shape= (8000L1)) x = BatchNormalization (axis=-1, momentum=0.99, epsilon=1e-3, center=True, scale=True) (inputs) # First Conv1D layer x = Conv1D (8L13, padding='valid', activation='relu' Strides=1) (x) x = MaxPooling1D (3) (x) x = Dropout (0.3) (x) # Second Conv1D layer x = Conv1D (16,11, padding='valid', activation='relu', strides=1) (x) x = MaxPooling1D (3) (x) x = Dropout (0.3) (x) # Third Conv1D layer x = Conv1D (32,9, padding='valid', activation='relu', strides=1) (x) x = MaxPooling1D (3) (x) x = Dropout (0.3) (x) x = BatchNormalization (axis=-1) Momentum=0.99, epsilon=1e-3, center=True, scale=True) (x) x = Bidirectional (CuDNNGRU (12828, return_sequences=True), merge_mode='sum') (x) x = Bidirectional (CuDNNGRU (128,return_sequences=True), merge_mode='sum') (x) x = Bidirectional (CuDNNGRU (128return_sequences=False), merge_mode='sum') (x) x = BatchNormalization (axis=-1, momentum=0.99, epsilon=1e-3, center=True, scale=True) (x) # Flatten layer # x = Flatten () (x) # Dense Layer 1 x = Dense (256th) Activation='relu') (x) outputs = Dense (len (labels), activation= "softmax") (x) model = Model (inputs, outputs) model.summary ()

Note: if only CPU is used to train this model, replace the CuDNNGRU layer with GRU.

The next step is to define the loss function as classification cross entropy because it is a multi-class classification problem:

Model.compile (loss='categorical_crossentropy',optimizer='nadam',metrics= ['accuracy'])

Early stopping and model checkpoints are callbacks to stop training the neural network at the appropriate time and to save the best model after each epoch:

Early_stop = EarlyStopping (monitor='val_loss', mode='min', verbose=1, patience=10, min_delta=0.0001) checkpoint = ModelCheckpoint ('speech3text_model.hdf5', monitor='val_acc', verbose=1, save_best_only=True, mode='max')

Let's train the machine learning model on batch size 32 and evaluate the performance on the retention set:

Hist = model.fit (x=x_train, y=y_train, epochs=100, callbacks= [early _ stop, checkpoint], batch_size=32, validation_data=

The output of this command is:

5. Visualization

I will rely on visualization to understand the performance of machine learning models over a period of time:

From matplotlib import pyplot pyplot.plot (hist.history ['loss'], label='train') pyplot.plot (hist.history [' val_loss'], label='test') pyplot.legend () pyplot.show ()

6. Forecast

In this step, we will load the best weights and define functions that identify audio and convert it to text:

From keras.models import load_model model = load_model ('speech3text_model.hdf5') def s2t_predict (audio, shape_num=8000): prob=model.predict (audio.reshape (1 minute shapenumMagne1)) index=np.argmax (prob [0]) return classes [index]

Predict the validation data:

Import random index=random.randint (0Len (x_valid)-1) samples=x_ valid [index] .ravel () print ("Audio:", classes [np.argmax (y _ valid [index])]) ipd.Audio (samples, rate=8000)

This is a script that prompts the user to record voice commands. You can record your own voice commands and test them on a machine learning model:

Import sounddevice as sd import soundfile as sf samplerate= 16000 duration = 1 # seconds filename = 'yes.wav' print ("start") mydata = sd.rec (int (samplerate * duration), samplerate=samplerate, channels=1, blocking=True) print ("end") sd.wait () sf.write (filename, mydata, samplerate)

Finally, we create a script to read the saved voice commands and convert them to text:

# reading the voice commands test, test_rate = librosa.load ('. / test/left.wav', sr = 16000) test_sample = librosa.resample (test, test_rate, 4351) print (test_sample.shape) ipd.Audio (test_sample,rate=8000) # converting voice commands to text s2t_predict (test_sample)

Last

Speech recognition technology has become a part of our daily life, but it is still limited to relatively simple commands. With the development of technology, researchers will be able to create more intelligent systems that can understand conversational voice.

This is the answer to the question about how to use Python and how to create a simple speech recognition engine. I hope the above content can be of some help to you. If you still have a lot of questions to solve, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.