Speaker Identification System Thesis Example | Topics and Well Written Essays

SPEAKER IDENTIFICATION SYSTEM SUMMARY FOR INDIVIDUAL CHAPTERS : CHAPTER 1 - INTRODUCTION : In this chapter the concepts of speaker recognition and identification are introduced. Initial discussion highlights the uniqueness, applications and advantages of speaker recognition and speaker identification method of authentication. This method is compared with the other methods of authentication. The concepts related to speech processing, speech recognition and speaker recognition are also introduced. The topic of speech processing details about the characteristics of the speech signal like the different qualities of speech, the variations in the acoustic properties of speech signal and the conversion stages in transforming the human speech signal into the digital speech signal. This digital speech is of main interest for the speaker authentication system. A spectrogram of speech signal is shown to characterize the signal energy in it. Speech recognition is described with a block diagram which details the stages of speech processing, word detector, pattern matching, etc. Later the basic classification of speaker recognition systems is described with a diagram. A clear demarcation between the speaker identification and speaker verification systems is described. These basic concepts lay the required foundation for the discussions to follow in later chapters. CHAPTER 2- SPEAKER IDENTIFICATION : This chapter deals with the concept of speaker identification in detail. Initially the human speech production mechanism is described. Next, the types of speaker identification systems are explained based on the dependency on the text information. The two main types discussed are text independent speaker identification and text dependent speaker identification. Later the speaker identification system is described in detail with a block diagram after a short discussion about speech models. The various stages like pre-emphasis filtering, analog to digital conversion, frame blocking mechanisms, windowing techniques and auto correlation analysis are discussed in detail. The pre-emphasis filtering is dealt for both the cases of frame by frame speech signal sequence and the entire speech signal. The analog to digital converters which are used in practice are highlighted with their specifications. The detailed theory behind frame blocking is explained with the aid of a diagram. The characteristics and performance parameters of various windowing methods are shown. Finally the use of auto correlation analysis for extracting the harmonic and formant properties from speech signal is emphasized. CHAPTER 3 – FEATURE EXTRACTION : The above discussion about feature extraction describes the methods of selecting and estimating the appropriate features in the speech signal using best possible methods. Methods like Linear prediction coefficients (LPC), Linear prediction cepstral coefficients (LPCC), Mel filter bank cepstral coefficients (MFCC), Bark filter bank cepstral coefficients (BFCC) and Uniform filter bank cepstral coefficients (UFCC) are dealt in detailed. It is shown that the Linear prediction coefficients give information about formant frequency and bandwidth of the speech signal. Nonetheless, a more suitable alternative for LPC is LPCC. The cepstral coefficients other than the zeroth coefficient represent the features of the speech signal. In Mel filter bank, cepstral coefficients are calculated on the mel scale using triangular filters. This frequency mapping has been dealt in this chapter and the cepstral coefficients are computed according to the given equations. The chapter also outlines the advantages of MFCC in application to GMMs and speaker identification systems. Another feature extraction method named BFCC, is also discussed whose performance is similar to MFCC. Finally the UFCC discussed has lower performance than MFCC and BFCC. It is still suitable for speaker identification because it gives uniform resolution at all frequencies. Thus this chapter gives a wide idea about the various methods of speech feature extraction. CHAPTER 4 – SPEAKER MODELLING : The concepts of speaker modeling discussed in this chapter begins with a brief introduction about human voice production and it’s uniqueness. The different models for speakers such as Template models and Stochastic models are explained. In depth analysis of the Gaussian Mixture Model (GMM) and the Vector Quantization (VQ) models are done. It is shown that the GMM which uses around 32 mixture components performs well. These mixture components influence the amplitude of the speakers reference signal. It is shown that the feature vectors of the GMM models are conditional Probability Density Functions, which depend on the speakers voice characteristics. The best match is chosen based on the maximum likelihood estimate method. A template model namely, the vector quantization (VQ) method is also explained in detail. The VQ codebook method is discussed with relevant diagrams showing the clustering and the partitioning sequence based on Euclidean distance measurement method. The best match is shown to be obtained in VQ under minimum distortion conditions. The remaining chapter gives knowledge about with K – means algorithm and Expectation maximization (EM) algorithm which are used in the best match detection procedure. CHAPTER 5 – THE SYSTEM PERFORMANCE AND RESULTS : In this chapter a best approach for Speaker Identification was chosen and the evaluations were carried out on a selected speech database called the TIMIT database. This database has a vast collection of speaker utterances which are of 3 seconds duration. The system evaluation is done under two stages namely training phase and testing phase. During training eight utterances are combined to form a speech frame. This speech is subjected to signal processing and feature extraction methods like LPCC, MFCC , BFCC , UFCC. This thesis uses the GMM for speaker modeling. Each speaker has a model. The testing is carried out for different values of feature order , different number of mixture components and at different levels of Signal – to – Noise Ratios. For the purpose of evaluation the performance parameter is the percentage of correct identification. The performance is checked for various combinations of order, mixture components, etc. the best results are found to be obtained for lengthy utterances and by the use of an additional vector - Gp of Levinson Durbins algorithm. The best percentage of 99.20 % is obtained for MFCC and Gp taken together. CONCLUSION : In this thesis Text Independent Speaker Identification is done by using Gaussian Mixture Models ( GMM ). The thesis discussion covers the various concepts in speech processing, speech recognition, speaker recognition and speaker modeling. Each stage in the process of speaker identification, discusses about all related methods involved in that stage. More emphasis is given for feature extraction and speaker modeling which are essential in speaker identification and are generally done through pattern matching. The text independent speaker identification is chosen rather than text dependent method because it is more sophisticated to use. For such a speaker identification system mainly stochastic models are used for the speaker modeling. In this thesis the stochastic Gaussian Mixture Model for speaker modeling has been chosen because of it’s high rate of success in this field of speaker identification. One Gaussian Mixture Model is used to represent each speaker in the training set. These Gaussian Mixture Models are obtained with the help of k- means algorithm for clustering and the Expectation Maximization ( EM ) algorithm for maximizing the likelihood match. The results of the tests show that the Gaussian Mixture Model is superior than any other type of speaker modeling. The tests are based on the TIMIT ( Texas Instruments Massachusetts Institute of Technology ) database. This database is chosen specifically because of it’s wide accessibility and use. The database has speeches of around 630 speakers , with ten conversations for each speaker. The database also gives various environmental noises characterized at various levels of Signal to Noise Ratios. The Signal to Noise Ratios available are 15 , 20 , 25 , 30 dBs. and clean speech. Each speaker is allowed ten utterances , each utterance lasts for a duration of 3 seconds. The sampling frequency of these speech signals are 16 khz , without session interval. During the training phase eight utterances from each noise environment were chosen. During the testing phase the remaining two utterances were used. The experimental tests were conducted under two phases namely the training phase and the testing phase. The training phase started with the combination of the eight utterances to get a long speech for 24 secs. After subjecting this speech frame to the speech processing stages, it is subjected to feature extraction methods like Linear Prediction Cepstral Coefficients , Mel filter Bank Cepstral Coefficients , Bark filter Bank Cepstral Coefficients and Uniform filter Bank Cepstral Coefficients. The experimental analysis of feature extraction done by LPCC , MFCC , BFCC and UFCC are compared by computing the histograms for each method . The normalized curves for the above methods are also shown. After feature extraction, the GMM is used to model every speaker. In the testing phase the same steps are repeated until feature extraction. Then to identify the speaker correctly , the EM algorithm has been adopted. The main performance criteria set is the percentage of number of correct identifications against the total number of speakers. The experimental results for the LPCC , MFCC , BFCC , UFCC under different noise levels or SNR levels are analyzed. The SNR levels of 15 , 20 , 25 , 30 and clean speech were used. The performance of all the four LPCC , MFCC , BFCC , UFCC were found to be poor for the SNR whose value is 15. For the SNR of 25 , the performance is good with values of 50.32 % for MFCC , 47.54 % for BFCC , 61.19 % for UFCC and finally 61.98 % for LPCC. For these results the experimental parameters used were an utterance length of 3.6 seconds , feature order of 8 , 10 , 12 , and the number of mixture components of 8, 16, 32. The results show that the performance is good for the LPCC method compared to the other three methods. For the same set of parameters, the performance figure is high when the number of mixture coefficients are considered rather than the SNR. The performance values found show that the MFCC method has good performance compared to others when the tests were conducted for 8 coefficients whereas the BFCC has good performance for 12 coefficients. Further increase in the feature order above 32 , does not show any improvement in the system performance characteristics. When the speech frame length was increased to 6 seconds, the performance of all the methods are good for higher values of feature extraction order say 12 coefficients. The lower range of feature extraction orders with 8 coefficients shows variations and MFCC proves to be a better option. For higher values of mixture components say 8, 16 , 32 mixtures, the system performance shows variations at different levels. The BFCC and UFCC show good performance of 99.8413 % with 8 mixture components. The UFCC proved better with the performance of 99.841 % for 16 mixtures and finally the BFCC was found to have good performance of 97.7778 % with 32 mixtures. This thesis involves the Levinson Durbin algorithm for computing the Gp for each frame. This computation of Gp, increases the energy content of the speech signal. This Gp vector when included in the feature matrix, improves the overall system performance of identifying the uttered speaker correctly. In this case of analysis the MFCC along with Gp showed good performance percentage of 99.20 % for correct identification of the speaker. Next to MFCC , the UFCC along with Gp has a good performance of correct speaker identification with a performance percentage of 98.89 %. Finally , the BFCC along with Gp came out with a performance percentage of 98.49 % in identifying the speaker correctly. Thus we can conclude that the use of Levinson Durbin algorithm for the computation of Gp and the inclusion of this Gp as an additional feature vector along with other feature vectors , improves the system performance to a greater extent. Also it has been proved that the best performance of correct Speaker Identification is obtained by increasing the length of the test utterances. Thus in this thesis a sophisticated method of Speaker Identification has been derived , tested and the results shown are the best with respect to correct Speaker Identification. Read More

Speaker Identification System - Thesis Example

Extract of sample "Speaker Identification System"

CHECK THESE SAMPLES OF Speaker Identification System

Problems Exist in the Contemporary US

Automatic Speaker Recognition

Threats to E-Commerce

Speech and Speaker Recognition

How Do Teachers Observe and Evaluate Elementary School Students Foreign Language Performance

Project Management and Operation Planning

Business intelligence: a Managerial Approach

Marks and Spencer PESTEL Analysis