Summary of individual chapters: CHAPTER 1 : The above chapter discusses about the field of Automatic Speaker Recognition ( ASR ). It gives an introduction about speaker recognition. The advantages and applications of this unique method of authentication is explained in detail. A comparison of this automatic speaker recognition with other methods is highlighted to emphasize the uniqueness of the automatic speaker recognition system. A brief note on the the two main stages of automatic speaker recognition system namely the enrollment and testing stages are shown. Next, the classification of the automatic speaker recognition is shown to be of Automatic Speaker Identification ( ASI ) and Automatic Speaker Verification ( ASV ).
Later the concepts of Automatic Speaker Identification and Automatic Speaker Verification are dealt in detail. These two methods are compared to highlight the advantages of each. The commonly possible errors like False Acceptance and False Rejection are discussed and their dependency on the speech threshold is also highlighted. The relation between the Equal Error Rate ( ERR ) and the threshold are outlined to derive a system performance index which will be useful in the system implementation and testing stages.
Finally another classification of the Automatic Speaker Recognition is described to be Text Dependent and Text Independent classes. With this introduction we now start with the detailed discussion of Text Independent Speaker Identification System. CHAPTER 2 TEXT INDEPENDENT SPEAKER IDENTIFICATION SYSTEM: In this chapter the theory and methodology behind Text Independent Speaker Identification Systems were discussed. The chapter started with the discussion about human voice and speech production mechanism. The human vocal track modeled as the acoustical tube, enhances the correlation between the physical nature of the vocal track, with the resonant properties of the acoustical tube.
This eases the modeling and parameter extraction of the speech signal. A detailed description of the voiced and unvoiced sounds, plosives etc. has been dealt. The next section of this chapter explains the purpose and process of feature extraction from a speaker�s speech signal. The prominent methods of feature extraction like Linear Prediction Cepstral Coefficients ( LPCC ), Mel Frequency Cepstral Coefficients ( MFCC ), Bark Frequency Cepstral Coefficients ( BFCC ) and Uniform Frequency Cepstral Coefficients ( UFCC ) are analyzed.
The derivation of the LP coefficients by Yule Walker method is shown with the aid of a diagram. It is shown that for better performance of the speaker identification system, LPCC with Mahalanobis distance measure is preferred. Apart from LPCC, the MFCC, BFCC and UFCC feature extractors are also explained in detail. Under the discussion about Pattern Matching, the template models ( Dynamic Time Wraping, Vector Quantization ) and stochastic models ( GMM, HMM ) are explained.
The concept of Neural Networks for training and testing of speech has also been analyzed. More emphasis is given on the GMM which gives a smooth approximation to arbitrarily shaped densitiesCHAPTER 3 THE DESCRIPTION AND PERFORMANCE OF THE SYSTEM: The above chapter shows the intended implementation of the speaker identification system. The three phases of training, testing and performance evaluation are carried out in detail. The corpus used for evaluation is explained earlier in the chapter. The performance evaluation is done on the TIMIT database which has around 630 speaker�s utterances. During the training phase, utterances of 24 seconds were taken.
The feature extraction methods include LPCC, MFCC, BFCC and UFCC. The LPC coefficients are computed using Levnson Durbin method which are later converted into cepstral coefficients. In MFCC, BFCC and UFCC the entire utterance is converted into feature vectors. The thesis uses the GMM and EM algorithm to model a speaker. The best match is obtained by the likely hood calculation method. The main performance parameter is the percentage of correct identification. The evaluation is done for TIMIT speech signals with varying SNRs.
The utterances were of 3 seconds and 6 seconds duration and the feature orders were 8, 10, 12. Is it proved that the performance increases when the length of the utterances are increased from 6 seconds. Also the combined effect of the feature extractors and the Gp vector gives a better performance in identifying the speakers correctly.