Speech Synthesis using HMM

Mel Frequency Cepstral Coefficients (MFCC) have been the standard for voice parameterization over the last ten years. Such parameterization scheme works fine in speech recognition given its reduced size and its effectiveness to capture the most essential features of a speech signal.

For synthesis on the other hand, MFCC do not seem to create a convincing voice model on the long term. The author and colleagues investigated such aspect through Mean Opinion Score (MOS) tests, were the listeners had to grade from 0 to 5 phrases produced on the first place by a human speaker followed by a synthesized version of the same phrase using MFCC as speech parameterization (Franco, et. Al. 2016). The results were above the average in both intelligibility and naturalness (3.44 and 3.07 respectively) but it proves that room for improvement exists.

A new study took place (Franco, Herrera 2016). For this study, a different parameterization was included and compared to the previous study where MFCC were used. The chosen parameterization was Linear Spectral Pair (LSP), the theory behind it can be found in a paper by McLoghlin (2008). The authors decided to test it for two main reasons: First, LSP is based on the Linear Predictive Coding (LPC) voice parameterization which models the human vocal tract as a filter. The spectra obtained based on vocal tract models tend to resemble natural speech remarkably. The second reason to use it, comes from its high compatibility with the synthesizer used by the authors which is based on the work of Tokuda and colleagues: Hidden Markov Models as Text to Speech synthesis HTS. That system is pre-programmed with a general parametrization which can be decomposed in both MFCC and LSP (Tokuda, et. al. 1994).

The results from that study averaged 3.4 in naturalness and 3.6 in intelligibility, the numbers are close to those obtained with the MFCC parameterization but the variance found in the grades given by the listeners in the LSP study is much less than the variance found in the MFCC study. This shows that the opinions among the LSP study listeners are much more consistent, therefore the quality of the synthesis is better in general.

Another conducted test was to input the synthesized speech in a Recognition system for forensics application. It was then compared with the original speaker whose voice the system was based on. It showed 0.0072 between original speech and synthesized speech. The closer the distance is to zero, the closer is the synthesis to an ideal case where no difference would appear between artificial and natural speech.

In terms of size LSP speech parameterization files are smaller than MFCC parameterization files this reduction can be important in terms of data transferring and data storing economization.

The authors consider LSP speech parameterization as a new standard in future studies in Laboratorio de Tecnologías del Habla FI UNAM.

References

Franco, C., Herrera Camacho, A. and Del Rio Avila, F. “Conference Proceedings: Speech Synthesis of Central Mexico Spanish using Hidden Markov Models” Athens: ATINER´S Conference paper Series, No: COM2016-2071, 3-12, 2016, Athens Institute for Education and Research

Franco Galván, Carlos Angel. Herrera Camacho, José Abel. Del Río Ávila, Fernando. “Conference Proceedings: Síntesis de Voz en Español hablado en el Centro de México Utilizando MFCC´s y LSP´s” IEEE ROC&C , 2016

McLoughlin, I. V. (2008). Line spectral pairs. Signal processing, 88(3), 448-467.

Tokuda, K., Kobayashi, T., Masuko, T., & Imai, S. (1994, September). Mel-generalized cepstral analysis-a unified approach to speech spectral estimation. In ICSLP (Vol. 94, pp. 18-22).

Se presentó el 28 de noviembre de 2016 en Acapulco Gro. una ponencia resultado de nuestro trabajo de investigación doctoral. Se habló sobre la parametrización utilizada en nuestro sistema de síntesis de voz en español. Causó buena impresión entre quienes asistieron a la charla y se lograron dos posibles colaboraciones entre instituciones. Nuestro abstract es el siguiente:

Los autores han desarrollado un sintetizador basado en HMM’s para el español hablado en el centro de México. En este trabajo, se adecua la parametrización LSP para este sinteizador. Además, se realiza una comparación de dos parametrizaciones de voz: MFCC y LSP. Ambos esquemas están programados como parte del sintetizador antes mencionado. Se realizaron frases de síntesis con cada parametrización y las respectivas pruebas MOS para valorar su calidad. Tanto MFCC como LSP sobrepasan la calificación promedio, pero aún hay espacio para mejora.

El artículo completo se puede descargar aquí: cm-02

	Intercambio Modal: s… on Cambios de tonalidad y modo en…
	De Ingenuidad y Adap… on La poliactividad del músico…
	Composición a partir… on Forma Sonata
	Cameron on Síntesis Aditiva
	Músicos compañeros d… on Músicos compañeros de vida: Je…

Carlos, siendo franco

Composiciones originales y Tecnología musical

Using LSP as an alternative to MFCC in speech parameterization

Síntesis de voz en español hablado en el centro de México utilizando MFCC’s y LSP’s