Speech and Speaker Recognition Assignment Example | Topics and Well Written Essays

Name : xxxxxxxxxxx Institution : xxxxxxxxxxx Course : xxxxxxxxxxx Title : Speech and Speaker Recognition Tutor : xxxxxxxxxxx @2010 Speech and Speaker Recognition Introduction Speech is the most vital form of transferring information and hence a lot of effort has been made in order to deal with hindrances to its production. Speeches vary according to their structure. The presence of dedicated researchers has led to various discoveries and innovative ways to address the quality of communication. In this work, speech errors, methods of analysis as well as examples are all used to bring out the true nature of speech and as well as providing a stepping stone from which evidence could be gathered about the speech related issues. Dysarthic speech recognition Dysarthia as a family of neurogenic speech disorders affects almost the complete speech sub-system like laryngeal, velopharyngeal and the articulary subsystems. Dysarthric problems are as a result of the muscular control disruption because of the peripheral’s or even central nervous system’s lesions which results in the disruption of the message transmission for effective movement through motor controls, due to these complications it can also be grouped under the neuromotor disorder (Enderby, 1983). It varies in intelligibility as well as rate of production. Even though there are various aspects in the dysarthric complications that can be considered, however this part will deal with approaches and the analysis Approaches There are various speech recognition approaches employed in dealing with the dysarthic speech including the automatic speech recognition system which is essential in the assessment of the dysarthic speech, the Sy and Horowitz’s model (1) for determining the link between judgment from naïve listeners and the dynamic time warping response. discrete hidden Markov models is inaccurate and therefore not suitable for the disarthic assessment (Pidal et al. [2]).Dysarthia assessment using the intelligible metric have also been considered by Charmichael and green. The continuous speech recognition system performance is correlated with the frenchise dysarthic assessment hence its application in this assessment (Luettin 1997). Assessment Dysarthia –related problems are mostly in the form of acoustic variations from the normal speech. The speech recognition system can be applied for the assessment if the relationship between the variation and speech recognition performance system is captured. The main issue is to get the acoustic variation for the assessment (Chin-Hui & Soong 1996). The systems used Continuous speech recognition system This is developed for the assessment of dysarthic speech. The problems encountered while testing arthritic speech include the invariability among the different arthritic speakers and negative accuracy due to the greater insertion numbers. Although there is correct recognition of the number of phonemes, it also has a setback of lack of correlation between the response of continuous speech recognition system and the arthritic speech. Extent and location of damage could affect the prolongation of the phoenemes as the duration analysis between the normal and dysarthic speech below shows. Dysarthric speech 1 Accuracy = 100 −deletion + substitution + insertion percent Total number of phonemes 2. Duration d = 1 1 − aii Durational analysis Computation of duration of each phoneme for each speaker is computed from the time aligned phonetic transcription. The phonemes duration of the dysarthric speakers is always higher than those for normal speakers indicating the elongation of phoneme duration of the dysarthric speakers than that of normal speakers. The duration provides information about the intelligibility of the dysarthric speakers.normaluzed variance of duration of phonemes ranges from 3-20 times in greatness than the normal speech. The table below highlights the variance and intelligibility scores from .Column 1 of table 1shows the ratings of dysarthirc speakers while the mark * indicates the speakers with uncorrelation in variance and intelligibility scores. Table 1 Dysarthric speakers and their corresponding word and sentence intelligibility scores as found in FDA provided with the nemours database and the normalized variances of the duration of phonemes group speakers word sentence variance Mild mh bb fb ll 8 4 - 4 4 8 - 4 2.7 3.6 4.1 7.7* severe bk sc bv jf 0 1 0 4 0 1 2 3 20.7 11.5 7.8* 17.7 moderate rk rl 4 4 1 3 5.3 8.3 Isolated- style phoneme recognition system The decision- metric applied here is the acoustic-likelihood of phonemes for given models. Its performance correlates well with the intelligibility. All the diagrams and experiments illustrate how disarthritic people are faced with diverse challenges and, as a result, the disease that affects the speech of a person, has received a lot of attention from various researchers and medical experts. There have been various methods employed to understand the disease so as to provide permanent solutions for the patients. The performance of speech recognition system has greatly helped in understanding velopharyngeal dysfunction. Performance analysis This approach suggests that the articulator always becomes less active due to lower performance. For example it has been observed that all the articulators of ‘bk’ speakers are affected adversely hence requires the attentiveness of the whole articulators in order for speech therapy to be successful. The articulators of speaker ‘mh’ are functioning properly while the articulators of ‘rk’are all moderately working with exception of bilabial, which was proved to be also severe, hence the need for the speaker to be attentive on the lip movement. However, in some speakers, the speech recognition system may not correlate with the scores of FDA. All these analysis show that the articulation may be used to show the exact location where disarticulation occurs in the speaker. Distributed speech recognition The European telecommunications standard institute is among the bodies that are concerned with communication standards and has been massively involved in the researches on the distributed speech recognition .distributed speech recognition is very vital in the mobile as well as telecommunication industry .the most important concerns here involve front end processing at the client and tonal language recognition. Speech recognition has been of a tremendous interest in application of automatic speech recognition in communication networks. Bandwidth restrictions and communication errors have been a major setback for any network, especially the wireless networks. The distributed speech recognition has eliminated the limitations of bandwidth. The only major issue remains the transmission errors that are still present. There have been standards that have been formed to ensure operations across the feature extraction from a client device and compatible recognizer from a remote server, they include the ETSI,3GPP and IETF.They are helpful in the commercial services dealing with speech and multimodal services over mobile networks (Mohan,1991). There is distribution of ASR system between the server and the client in the distribution speech recognition. The performance of the speech’s feature extraction is always at the client and the ASR feature. they get distributed and compressed to server through a dedicated channel in which decoding takes place and input in to the backend of the ASR.research has proved that the performance of DSR is always better compared to NSR because the processing of speech for best perceptual quality is always in the latter model, though it may not end up producing the optimal performance in recognition. There are several schemes proposed to be used in compressing the features of the ASR.however, the need for modifications on DSR to the current infrastructure of mobile communication is one disadvantage, for instance the need for another dedicated channel to transmit the compressed bitstream of the MFCC feature. Digalakis et al. did the evaluation of uniform scalar use as well as non-scalar use of quantisers and also vector quantisers of product code for MFCCs compression between the range of 1.2 and 10.3kbps used a greedy-based bit allocation algorithm. The addition of bits to each component was done and the rate of word error evaluated the component with the best improvement in the performance of recognition was selected and given the allocated bit .this exercise continued to the allocation of all the bits. Conclusion was that split vector quantisers were more economical on the bits since to achieve the word error rate as scalar quantisers few bits were used. Also the non- uniform scalar quantizes that were PDF-optimized had better results than the uniform scalar quantisers hence implying that PDF of MFCCs could not be easily distributed. Better performance was also witnessed in the quantization of PDF-optimized scalar that had the bit allocation non –uniform, compared with the one allocated with uniform bit. They hence concluded that2kbps was the needed bitrates for quantization of split vector so as to achieve the performance of unquantised recognition. Therefore, based on the evidence from various researchers, distributed speech recognition is an important element in the communication industry and due to this many researches are still being done in order to optimize its use. Emphasis should be done on the studies so as to come up with better methods of dealing with the speech-related issues and also to be able to handle speech issues perfectly. Syllables in speaker identification and verification recognition The speech sounds are classified into two main categories which are voiced or unvoiced .the voice sounds, for instance /iy/ (in see) are always periodic with a harmonic structure which cannot be present in the unvoiced sounds, like / s /, which are noise-like as well as aperiodic.the production of voiced sounds occur when larynx’s vocal folds are vibrated by the excited air that comes from the lungs. A glottal wave having a fundamental fo frequency and also harmonics at fundamental frequency multiples are produced, then it goes through the vocal tract, which resembles acoustic tube beginning at the larynx while terminating at the tip (Schroeder 1998). The change in shape of this tube forms anti-resonances and resonances that put and remove emphasis in some spectrum parts respectively. The formants are present where emphasis on spectrum by the vocal tract resonance occurs. Various quasi-periodic sounds may be formed together with articulators’ (teeth, lips jaws, and also tongue) changes. Unvoiced sounds do not receive vocal cord vibrations instead articulators constrict the vocal tract then air passes very fast producing noise-like sound (Peacocke, & Daryl 2004). Speech production is believed to be composed of source and also filter components. Excitation (unvoiced sound’s aperiodic noise and voiced sound’s periodic voice) is represented by the source component. As compared to vocal tract, its resonances, and anti-reasonances, the emphasis on spectrum parts is done by the source component (Fried-Oken, 2000). Spontaneous speech recognition Even though the accuracy in the speech recognition can be attained d to a higher level by the reading of written text by using latest technology of speech recognition, the accuracy is very poor for the spontaneous speech that is freely spoken. This is due to the acoustic and linguistic models for speech recognition being made using text that reads speech or the language that is written, while there is difference in spontaneous speech and the language that is written both linguistically and acoustically. Building of a great corpus of spontaneous speech is important because the knowledge of the spontaneous speech’s structure is quite limited. The paradigism shift from recognition to understanding of speech is crucial, especially where the speaker’s message content is extracted, as opposed to transcription of the whole spoken words. The perspectives portray how the intense research projects have been employed to increase the level of technology in the spontaneous speech recognition as well as understanding that (Jian-lai, Tian & Chang E2000). Since there is less data for modeling language, we built the Language model from every data source available the resources mostly used include data transcripts from Fisher training, switchboard and also HUB4.these data were all normalized with the use of identical processes, the spellings were also normalized .and also hyphenation uniformity was considered. The minimization of perplexity determined the weights of interpolation. The table highlights the perplexity outcomes on the test data that was selected for the track. The WER portrayed in this case is for baseline acoustic model. LM Corpora Perplexity WER RWER SWB + HUB4 (LM1) 103.235 56.0 2.20 Fisher Part 2 73.09 49.8 1.97 Fisher Part 2 + LM1 72.60 49.5 1.92 Fisher Part 1 & 2 (LM2) 70.11 48.4 1.95 LM2 + LM1 (LM3) 69.85 48.0 1.90 Table 1: Repetition WER for different language models (%) Acoustic model adaptation The acoustic models which are phonetic decision as well as clustered models of triphone having 3- state technology were trained with the e use of the sonic recognition tool kit .the information in the decision tree’s leaves were made using the Gaussian distribution through a procedure based on the BIC and also ,using the multiple iterations of EM alogarithms,improvement on the acoustic model is further increased by the performance of the normalization of cepstral variance (CVN) and also the vocal tract length normalization(VTLN).on top of this we added at the speaker adaptive models(SAT).This training was also done in under constrained maximum likelihood regression(CMLLR). . The data at the leaves of the decision tree were modeled with Gaussian distributions via a BIC-based procedure and trained using multiple iterations of the EM algorithm. This is our baseline model. The acoustic models are further improved by performing cepstral variance normalization (CVN) and vocal tract length normalization (VTLN).In addition to speaker independent acoustic models, we also built speaker adaptive models (SAT). The training was done via constrained maximum likelihood regression (CMLLR) transform on the feature space for each training speaker. The transform is used in both means as well as variances of the system parameters. After the speakers were transformed, computing of a new canonical acoustic model. Looking at Table 4, (m1) means the acoustic model attained after CVN, VTLN and the switchboard trained language model, HUB4 even the Fisher transcripts. The SAT + (m1) model represents the speaker adaptation done in cepstral variance as well as vocal tract length normalized acoustic model. Model WER RWER Baseline + LM3 48.0 1.90 CVN + VTLN + LM3(m1) 46.0 1.80 SAT + LM3 42.8 1.67 SAT + (m1) 42.1 1.66 Table 2: WER and RWER on test data for different acoustic models (%) Recommendation Speech is the backbone of information transfers and hence the issues that have been discussed above as hindrances should be clearly addressed. The illnesses that affect speech generation should be studied, researched, and treated to avoid the complications they impose to various people. Conclusion Speech is an essential way of transmitting information therefore it has various effects on both the people who are giving the information as well as those who are receiving this information. Much research has been carried out to determine the causes and also the effects of speech problems and the results have been gathered successfully. Bibliography Enderby, P. M.,1983, Frenchay Dysarthria Assessment, College Hill Press, London. Mohan ,S., 1991Source And Channel Coding, Kluwer Academic , USA. Fried-Oken, M., 2000, Speaking Up and Spelling It Our, Paul H. Brookes Publishing Co, Baltimore. Peacocke, R. D. & Daryl H. G. 2004, An Introduction to Speech and speaker recognition Bell-Northern Research, Austria. Schroeder, M.R., 1998, Speech and speaker recognition, Karger Publishers, Sydney Chin-Hui, L., Soong F. K., 1996, Automatic speech and speaker recognition: advanced topics Springer, U.K. Luettin J., 1997, Visual speech and speaker recognition, University of Sheffield, Sheffield. Jian-lai Z. Tian Y, Chang E., 2000, Tone articulation modeling for mandarin spontaneous speech. McGraw Hill, Sydney. Read More

Speech and Speaker Recognition - Assignment Example

Extract of sample "Speech and Speaker Recognition"

CHECK THESE SAMPLES OF Speech and Speaker Recognition

Information Retrieval, Inverse Document Frequency

Speaker Identification System

Automatic Speaker Recognition

Non-Verbal and Oral Communication

Project Management Planning

Business Strategy and Sustainability

Management of Communication

Management Communication Issues