The goal of this dissertation is to investigate methods for word prominence detection in speech. In human communication prosodic cues such as word prominence play an important role: We emphasize words to mark them important and indicate the informational focus in a sentence. Speech recognition systems currently do not use this information and are therefore not very intuitive and error-prone. In this thesis, a system to distinguish prominent and non-prominent words is presented. Several different feature choices in the audio and video domain are investigated; furthermore, several different classifiers with different characteristics are examined. One aspect to be evaluated here is the usage of context information on the feature level as well as on the classifier level. It will be shown, that plenty of information is incorporated in the neighboring words. Therefore, the whole sequence should be used for classification. The study will be especially concerned with the performance difference between speaker-dependent and speaker-independent trained systems. To overcome the problem of variations from a pool of speakers and the resulting performance loss, a new adaptation method is presented. Common speaker adaptation methods, used for speech processing, are designed for Gaussian Mixture Models/Hidden Markov Models based classifiers. This thesis shows that for the problem of word prominence detection, a discriminative classifier, such as the Support Vector Machines, performs best, but until now has not been combined adequately with common speaker adaptation methods. Therefore, a new method, based on Support Vector Machines with Radial Basis Function kernel, and their two extensions are presented and evaluated. Ultimately, the thesis shows that this method can significantly improve performance for speaker-independent classification when only a small amount of speaker-specific data is available.
Andrea Schnall Knihy
