Mel Frequency Cepstral Coefficients (MFCC) have been the standard for voice parameterization over the last ten years. Such parameterization scheme works fine in speech recognition given its reduced size and its effectiveness to capture the most essential features of a speech signal.
For synthesis on the other hand, MFCC do not seem to create a convincing voice model on the long term. The author and colleagues investigated such aspect through Mean Opinion Score (MOS) tests, were the listeners had to grade from 0 to 5 phrases produced on the first place by a human speaker followed by a synthesized version of the same phrase using MFCC as speech parameterization (Franco, et. Al. 2016). The results were above the average in both intelligibility and naturalness (3.44 and 3.07 respectively) but it proves that room for improvement exists.
A new study took place (Franco, Herrera 2016). For this study, a different parameterization was included and compared to the previous study where MFCC were used. The chosen parameterization was Linear Spectral Pair (LSP), the theory behind it can be found in a paper by McLoghlin (2008). The authors decided to test it for two main reasons: First, LSP is based on the Linear Predictive Coding (LPC) voice parameterization which models the human vocal tract as a filter. The spectra obtained based on vocal tract models tend to resemble natural speech remarkably. The second reason to use it, comes from its high compatibility with the synthesizer used by the authors which is based on the work of Tokuda and colleagues: Hidden Markov Models as Text to Speech synthesis HTS. That system is pre-programmed with a general parametrization which can be decomposed in both MFCC and LSP (Tokuda, et. al. 1994).
The results from that study averaged 3.4 in naturalness and 3.6 in intelligibility, the numbers are close to those obtained with the MFCC parameterization but the variance found in the grades given by the listeners in the LSP study is much less than the variance found in the MFCC study. This shows that the opinions among the LSP study listeners are much more consistent, therefore the quality of the synthesis is better in general.
Another conducted test was to input the synthesized speech in a Recognition system for forensics application. It was then compared with the original speaker whose voice the system was based on. It showed 0.0072 between original speech and synthesized speech. The closer the distance is to zero, the closer is the synthesis to an ideal case where no difference would appear between artificial and natural speech.
In terms of size LSP speech parameterization files are smaller than MFCC parameterization files this reduction can be important in terms of data transferring and data storing economization.
The authors consider LSP speech parameterization as a new standard in future studies in Laboratorio de Tecnologías del Habla FI UNAM.
References
Franco, C., Herrera Camacho, A. and Del Rio Avila, F. “Conference Proceedings: Speech Synthesis of Central Mexico Spanish using Hidden Markov Models” Athens: ATINER´S Conference paper Series, No: COM2016-2071, 3-12, 2016, Athens Institute for Education and Research
Franco Galván, Carlos Angel. Herrera Camacho, José Abel. Del Río Ávila, Fernando. “Conference Proceedings: Síntesis de Voz en Español hablado en el Centro de México Utilizando MFCC´s y LSP´s” IEEE ROC&C , 2016
McLoughlin, I. V. (2008). Line spectral pairs. Signal processing, 88(3), 448-467.
Tokuda, K., Kobayashi, T., Masuko, T., & Imai, S. (1994, September). Mel-generalized cepstral analysis-a unified approach to speech spectral estimation. In ICSLP (Vol. 94, pp. 18-22).