Digital Speech Synthesis

Introduction

You belong on this page if you understand sound-synthesis instruments and notelists, and if you wish to understand how MUSIC-N style software sound synthesis can emulate human vocal sounds.

The earliest example of computer-synthesized singing known to me is a 1961 rendition of a male human voice singing the chorus of Henry Dacre's 1892 “Daisy Bell” (Popularly known as “A Bicycle Built for Two”). This early example was created at Bell Labs, and the results can be heard on youtube. The example has an accompaniment generated by Max Mathews, no doubt using one of the MUSIC-N series of programs. However, MUSIC-N does not seem to have played a role in the actual speech synthesis.

The goal of this page sequence is to use the Sound engine to re-synthesize “Daisy Bell”. The objective is synthesis informed by acoustics, which means that wherever possible we want the assembly of sound-synthesis components (oscillators, noise generators, filters, etc.) to be explainable in terms of the vocal tract and its resonant cavities. However, this will be possible only up to a point — the later pages on fricatives and plosives rely upon sound spectra which are analytically produced without reference to any physical model. This exercise intentionally excludes techniques such as sampling and and linear predictive coding. Although these techniques produce vocal sounds which are much more realistic than those which will be achieved here, using them leaves the user with very little insight into the nature of a sound: its spectral peaks and valleys, and how the sound unfolds over time.

I undertook this exercise as a reality check on the Sound engine. It should have been straightforward; after all, computers have been doing speech synthesis at least since 1961. My major professor at UB, Lejaren Hiller, had always maintained that the MUSIC-N programs (from which the Sound engine descends) were justified to Bell Labs as speech-synthesis platforms. Plus I had a very promising resource in Dennis H. Klatt's 1980 article “Software for a cascade/parallel formant synthesizer” (J Acoust Soc Am 67, pp. 971-995), which is available on this site in PDF format. Klatt's article described solutions to pretty much all of the challenges posed by speech synthesis, and it even provided tables of specific parameter settings used to produce specific phonemes.

Whether Hiller was historically correct about MUSIC-N, or not, I cannot say. However, I have personally concluded that MUSIC-N's note-parameter model is inadequate for speech synthesis, for several reasons:

  1. There can be any number of phonemes in a word, and the transitions between phonemes are rarely discrete. This means that the properties of phonemes (e.g. the peak frequencies of vocal resonances) cannot be described using discrete parameters. Rather each property must be described as a sequence of segments, combining steady-state plateaus with transitional ramps.
  2. Some phonemes are pitched while some are aspirated. Still other phonemes combine pitch with aspiration. This suggests that sound-synthesis instruments be able to respond to conditions which change over the duration of a note. However, making instruments conditional means that the instrument wastes a lot of time generating signals which it isn't actually going to use. Better if the instruments could be designed modularly, with the modules being activated only when needed.
  3. The bank of resonances and antiresonances required to produce a phoneme varies, depending whether the phoneme is a vowel, a nasal, a fricative, or whatever. This third reason has similarities with reason #2 except that the vocal sources act in parallel while resonances and antiresonances act in cascade (sequence). The solution to reason #2 is the same as the solution to reason #3: modular instrument design.

While here listing reasons why MUSIC-N's note-parameter model is inadequate, I have also suggested ways of enhancing the model so that speech synthesis would be feasible. How my Sound engine implements these enhancements is detailed elsewhere, but in a nutshell it involves voices, which provide scope to passing signals between instruments, and contours, which allow segment-by-segment description of control signals and whose information is accessible to any note — so long as the note's voice ID matches the contour's voice ID.

Reality check indeed!

Not being affiliated with any academic institution, I have no ready access to a university library or to the archival services employed by academic journals. As such, my sources have been limited to information publicly available on the internet, and to old books I have kicking around. While Klatt's solution did not prove as helpful as I hoped, I found enough information elsewhere to make up the deficit. The resource pages on Speech Acoustics developed by Robert Mannel and Felicity Cox have been particularly helpful to me, and should be considered as suggested reading even though the pages are written in Australian. Another useful overview is available in the series of lecture slides prepared by James Kirby, evidently for presentation in Hanoi. I have also made recourse to two period books, J.L. Flanagan's 1977 Speech Analysis Synthesis and Perception and a 1973 collection edited by Minifie, Hixon, and Williams, Normal Aspects of Speech, Hearing, and Language, particularly Minifie's contribution on “Speech Acoustics” (pp. 235-284).

Overview

We start with melody and lyrics. Sheet music for “Daisy Bell” is available on the web, for example at www.free-scores.com. We are concerned just with the chorus, ignoring both the verses and the accompaniment.

The ultimate rendition of “Daisy Bell” will be realized over several iterations with new speech-sythesis techniques being introduced as needed.

When I began this exercise, I initially coded the note lists by hand. As complexity increased owing to the addition of new phoneme categories, manual preparation became increasingly more tedious and less practical. Tiresomeness reflected back on earlier iterations when changes in policy (e.g how notes should be articulated) or instrument design forced re-coding of the earlier listings. During the preparation of Iteration #5 (which employs separate notes to shape different spectral features of fricative noise sounds) the manual coding got to be too much. I had already developed Java procedures to try out individual words, but from this point on I undertook to write procedures which generated note lists for entire phrases, with the iteration number as a parameter. By making use of start statements to offset note starting-times, it was possible to generate note lists which could be tested individually, then pasted into larger iteration lists.

A Speech-Synthesis Orchestra

The orchestra used for the present exercise is implemented in the file SpeechOrch.xml. If you have access to the Sound application, and you intend to replicate the sound-synthesis runs on your own, you should download SpeechOrch.xml into your working directory.

The speech-synthesis orchestra extends many ideas about modular instrument design developed in the page on Synthesizing Noise Sounds. The orchestra defines two voices, with a separate stack of instruments operating within each voice. Both instrument stacks implement versions of the source-filter model of speech production. Voice #1 is for vowels and vowel-like consonants, while voice #2 is for fricatives and other mostly noisy sounds. In either case, it takes several notes to generate a single sound, and information is passed between notes through two intra-voice signals. Signal G1 transmits the audio signal while signal G2 transmits the power envelope.

The orchestra is monophonic, which means that although vowel-like and noisy sounds are produced along very different synthesis streams, they are both heard as coming from the same physical location. Keep this in mind should you choose to adapt this orchestra for stereophonic synthesis: The voice entity as implemented within the Sound engine is a construct affecting the scope of contours and signals. It may often equate to a musical “voice”, but it doesn't have to.

If you have access to the Sound application, and want to know intimate details about the speech-synthesis orchestra, you can always view SpeechOrch.xml in Sound's Orchestra Editor.

Note-List Heading

orch /Users/charlesames/Scratch/SpeechOrch.xml
set rate 44100
set bits 16
set norm 1
Listing 1: Note-list header for “Daisy Bell”.

Listing 1 presents the note list header shared by all “Daisy Bell” iterations. The first text line in the header is the orch statement, which indicates that the listed notes will be synthesized using SpeechOrch.xml. You'll need to adapt this statement to reflect your own working directory. Of the remaining statements, set rate sets the sampling rate to 44100, which is the standard for audio CD's and permits synthesis of frequencies across the full range of human hearing, while set bits sets the ultimate sound quantization level to 16-bits. However, the set norm statement causes two passes through the data. The first pass saves all samples 32-bit accuracy in a temporary file, while the second pass rescales these samples to optimize the signal-to-noise ratio.

Contours

SpeechOrch.xml defines five contours and declares all five accessible both to voice #1 and to voice #2. The five contours are:

  1. Amplitude, permitting values from 32 to 16384.
  2. Frequency, permitting values from 16 to 1378 (Nyquist limit divided by 16).
  3. Formant 1, permitting values from 100 to 3500.
  4. Formant 2, permitting values from 100 to 3500.
  5. Formant 3, permitting values from 100 to 3500.

All five contours have the exponential calculation mode, which means that transitions from origins to goals proceed in equal-ratios curves.

To avoid foldover it is necessary to ensure that no harmonic of any tone exceeds the Nyquist limit. The upper limit for Contour #2: Frequency is calculated to accomodate a pulse waveform whose uppermost harmonic is 16 times the frequency of the waveform's fundamental: (44100 ÷ 2) ÷ 16 ≈ 1378 Hz. Frequencies in “Daisy Bell” range from 146.8 Hz. (D3) to 293.7 Hz. (D4), all well below this calculated limit.

Additional due diligence would verify that the harmonics of any tone used will cover the highest formant regions. A table of vowel formant frequencies will be provided for Iteration #2, but for now it is sufficient to know that the highest F3 value listed in that table is 3079 Hz. The lowest pitch in “Daisy Bell” is 146.8 Hz. (D3), which will produce a highest harmonic at 16 × 146.8 = 2349 Hz.

Voice #1: Instrument Stack for Vowels and Vowel-like Consonants

The stack for voice #1 has four categories of swappable component. It takes at least four notes to make a sound using the voice #1 stack:

Thus to produce a vowel sound, you would use a quartet of note statements. The first note in the quartet will invoke instrument #101 to generate a pitched tone. The second note in the quartet will invoke instrument #119 to capture the RMS power envelope. The third note of the quartet will invoke instrument #121 to apply the resonant characteristics of the vocal track to the pitched tone. The fourth note of the quartet will invoke instrument #199 to restore the envelope captured by instrument #119.

Instrument #122 always works in conjunction with instrument #121. Thus synthesizing the word “manor” would require a stack of six notes invoking instrument numbers 101, 119, 121, 122, 122, and 199 respectively. The notes for instruments 101, 119, 121, and 199 would start and end simultaneously, lasting for the entire duration of the word. The first note for instrument 122 would last for the duration of the m sound, while the second note for instrument 122 would last for the duration of the n sound.

Voice #2: Instrument Stack for Noises

The stack for voice #2 also has four categories of swappable component. In this case the design is more directly influenced by NoiseOrch.xml. It again takes four or (usually) more notes to make a sound using the voice #2 stack:

Next topic: Melody

© Charles Ames Page created: 2014-02-20 Last updated: 2015-07-12