Automatic Speaker Recognition Thesis Example | Topics and Well Written Essays

Summary of individual chapters : CHAPTER 1 : The above chapter discusses about the field of Automatic Speaker Recognition ( ASR ). It gives an introduction about speaker recognition. The advantages and applications of this unique method of authentication is explained in detail. A comparison of this automatic speaker recognition with other methods is highlighted to emphasize the uniqueness of the automatic speaker recognition system. A brief note on the the two main stages of automatic speaker recognition system namely the enrollment and testing stages are shown. Next, the classification of the automatic speaker recognition is shown to be of Automatic Speaker Identification ( ASI ) and Automatic Speaker Verification ( ASV ). Later the concepts of Automatic Speaker Identification and Automatic Speaker Verification are dealt in detail. These two methods are compared to highlight the advantages of each. The commonly possible errors like False Acceptance and False Rejection are discussed and their dependency on the speech threshold is also highlighted. The relation between the Equal Error Rate ( ERR ) and the threshold are outlined to derive a system performance index which will be useful in the system implementation and testing stages. Finally another classification of the Automatic Speaker Recognition is described to be Text Dependent and Text Independent classes. With this introduction we now start with the detailed discussion of Text Independent Speaker Identification System. CHAPTER 2 TEXT INDEPENDENT SPEAKER IDENTIFICATION SYSTEM : In this chapter the theory and methodology behind Text Independent Speaker Identification Systems were discussed. The chapter started with the discussion about human voice and speech production mechanism. The human vocal track modeled as the acoustical tube, enhances the correlation between the physical nature of the vocal track, with the resonant properties of the acoustical tube. This eases the modeling and parameter extraction of the speech signal. A detailed description of the voiced and unvoiced sounds, plosives etc. has been dealt. The next section of this chapter explains the purpose and process of feature extraction from a speaker?s speech signal. The prominent methods of feature extraction like Linear Prediction Cepstral Coefficients ( LPCC ), Mel Frequency Cepstral Coefficients ( MFCC ), Bark Frequency Cepstral Coefficients ( BFCC ) and Uniform Frequency Cepstral Coefficients ( UFCC ) are analyzed. The derivation of the LP coefficients by Yule Walker method is shown with the aid of a diagram. It is shown that for better performance of the speaker identification system, LPCC with Mahalanobis distance measure is preferred. Apart from LPCC, the MFCC , BFCC and UFCC feature extractors are also explained in detail. Under the discussion about Pattern Matching, the template models ( Dynamic Time Wraping, Vector Quantization ) and stochastic models ( GMM , HMM ) are explained. The concept of Neural Networks for training and testing of speech has also been analyzed. More emphasis is given on the GMM which gives a smooth approximation to arbitrarily shaped densities CHAPTER 3 THE DESCRIPTION AND PERFORMANCE OF THE SYSTEM : The above chapter shows the intended implementation of the speaker identification system. The three phases of training, testing and performance evaluation are carried out in detail. The corpus used for evaluation is explained earlier in the chapter. The performance evaluation is done on the TIMIT database which has around 630 speaker?s utterances. During the training phase, utterances of 24 seconds were taken. The feature extraction methods include LPCC, MFCC, BFCC and UFCC. The LPC coefficients are computed using Levnson Durbin method which are later converted into cepstral coefficients. In MFCC, BFCC and UFCC the entire utterance is converted into feature vectors. The thesis uses the GMM and EM algorithm to model a speaker. The best match is obtained by the likely hood calculation method. The main performance parameter is the percentage of correct identification. The evaluation is done for TIMIT speech signals with varying SNRs. The utterances were of 3 seconds and 6 seconds duration and the feature orders were 8, 10, 12. Is it proved that the performance increases when the length of the utterances are increased from 6 seconds. Also the combined effect of the feature extractors and the Gp vector gives a better performance in identifying the speakers correctly. CONCLUSION : This these outlines that the concept of Automatic Speaker Recognition as a mechanism by which a person is recognized from a spoken phrase. These speaker recognition systems can be used to identify or verify a person?s authentication. This led to the discussion about speech production , speech processing , extraction of unique features from the speaker?s utterance , modeling of a speaker and finally the process of pattern matching to identify the speaker. The main idea of this thesis is to implement and evaluate a Text Independent Speaker Identification system. The text independent speaker identification needs to identify the speaker from the uttered word , even if the uttered word is new. This method is more sophisticated than the text dependent speaker identification system, because the text dependent method is confined to identify only speaker utterances that are already trained and stored in the system. The thesis also explains the different possible errors in these methods of speaker identification like false acceptance and false rejection. The equal error rate of these two is also explained for system performance evaluation. The system?s corpus uses the most easily accessible TIMIT ( Texas Instruments Institute of Technology ) database. The utterances of speakers in the TIMIT database are ten for each speaker and the duration for the utterance is 3 seconds. This database is rich in phonetics comprising of the dialect sentences( SA ) , diverse sentences ( SI ) and the compact sentences ( SX ). From this database the clean channel speech with sound booth environment has been chosen for the evaluation of this system. The accuracy in the identification of the speaker has been shown by the histograms of the various feature extractors and by the comparison tables in chapter three. This thesis uses the prominent four types of feature extractors, which are the Linear Prediction Cepstral Coefficients ( LPCC ) , Mel Frequency Cepstral Coefficients ( MFCC ) , Bark Frequency Cepstral Coefficients ( BFCC ) and the Uniform Frequency Cepstral Coefficients ( UFCC ) . these current methods of feature extraction are under improvement and a lot of research is still expected in these areas. The present Linear Prediction Cepstral Coefficient method is good for many cases of speaker feature extraction. During the training phase , the speech utterance is subjected to feature extractors. In Linear Prediction Cepstral Coefficient the Levinson Durbin algorithm is used to compute the prediction coefficients , they are later converted into cepstral coefficients. In Mel Frequency Cepstral Coefficients and Bark Frequency Cepstral Coefficients ?M? number of triangular filters and Discrete Cosine Transform ( DCT ) are used to get the cepstral coefficients. The performance of these feature extractors are well understood by their histogram representations. The stochastic Gaussian Mixture Model with 32 mixture density components is used to model the speakers. The training is done by EM algorithm. The analysis in this thesis shows that the Gaussian Mixture Model ( GMM ) is the most preferred model for speaker identification. This stochastic model uses conditional probability that depends upon the speaker?s utterance. This conditional probability density function is evaluated by a set of vectors. When they attain the expected density, they show a high probability of being identified. The identification is according to the Expectation Maximization ( EM ) algorithm. This algorithm computes a new model from the previous value by iteration method until convergence is attained. It is evident that the iteration leads to a maximum likelihood match of the feature vectors and the utterance. The performance parameter for the evaluation of the system is the calculation of the percentage of correct speaker identification based on the number of correct segments and total segments. The analysis of the performance of the system with 32 mixtures , 24 seconds duration of utterance, under different conditions of SNR ( signal to Noise Ratio ) reveals that the LPCC method of feature extraction is better ( 61.98 % ) for higher ranges of SNR and the UFCC extractor works well ( 4.71 % ) for the lower ranges of SNR. Under clean speech conditions the performance of LPCC it proved to be better. From the TIMIT database 10 % of the speakers were evaluated by this LPCC feature extractor and the percentage of identification is found to be 100 %. When the system was analysed for 16 model orders and 13 feature orders of MFCC , BFCC , UFCC , and Gp the performance is the best. In the above 13 feature orders 12 coefficients are of feature extractors and one coeffecient is of Gp. this parameter Gp is derived by Levinson Durbin algorithm as described in LPC computation. Under yhis combination of feature extractors and the parameter Gp , the best performance percentage of 99.20 % has been obtained by the MFCC + Gp combination. The evaluation of this system for different values of coefficients shows that the MFCC performs well for lower number of coefficients say 8 coefficients , while the UFCC does well for higher number of coefficients like 12 coefficients. Also the identification performance is tested for utterances that are more than 3 seconds. When using 6 seconds for utterances, the performance is found to be better with values of 96.83 % for MFCC , 97.14 % for UFCC and 97.78 % for BFCC. Thus it can be concluded that the usage of Levinson Durbin algorithm for the calculation of Gp and it?s combination with the feature vectors gives a great improvemrnt in the performance of correct identification of the speaker. Further the use of a lengthy utterance enhances the percentage of correct speaker identification. Read More

Automatic Speaker Recognition - Thesis Example

Extract of sample "Automatic Speaker Recognition"

CHECK THESE SAMPLES OF Automatic Speaker Recognition

Information Retrieval, Inverse Document Frequency

Speech and Speaker Recognition

Industrial Relations and Workplace Change

Leadership, Motivation, and Communication

ATM and Process of Withdrawing Money

Comparison of Asda, Marks and Spencer and Tesco

Business intelligence: a Managerial Approach

Principles of Management - Causes of Conflict in a Workplace