Digital Speech Synthesis

Introduction

You belong on this page if you understand sound-synthesis instruments and notelists, and if you wish to understand how MUSIC-N style software sound synthesis can emulate human vocal sounds.

The earliest example of computer-synthesized singing known to me is a 1961 rendition of a male human voice singing the chorus of Henry Dacre's 1892 “Daisy Bell” (Popularly known as “A Bicycle Built for Two”). This early example was created at Bell Labs, and the results can be heard on youtube. The example has an accompaniment generated by Max Mathews, no doubt using one of the MUSIC-N series of programs. However, MUSIC-N does not seem to have played a role in the actual speech synthesis.

The goal of this page sequence is to use the Sound engine to re-synthesize “Daisy Bell”. The objective is synthesis informed by acoustics, which means that wherever possible we want the assembly of sound-synthesis components (oscillators, noise generators, filters, etc.) to be explainable in terms of the vocal tract and its resonant cavities. However, this will be possible only up to a point — the later pages on fricatives and plosives rely upon sound spectra which are analytically produced without reference to any physical model. This exercise intentionally excludes techniques such as sampling and and linear predictive coding. Although these techniques produce vocal sounds which are much more realistic than those which will be achieved here, using them leaves the user with very little insight into the nature of a sound: its spectral peaks and valleys, and how the sound unfolds over time.

I undertook this exercise as a reality check on the Sound engine. It should have been straightforward; after all, computers have been doing speech synthesis at least since 1961. My major professor at UB, Lejaren Hiller, had always maintained that the MUSIC-N programs (from which the Sound engine descends) were justified to Bell Labs as speech-synthesis platforms. Plus I had a very promising resource in Dennis H. Klatt's 1980 article “Software for a cascade/parallel formant synthesizer” (J Acoust Soc Am 67, pp. 971-995), which is available on this site in PDF format. Klatt's article described solutions to pretty much all of the challenges posed by speech synthesis, and it even provided tables of specific parameter settings used to produce specific phonemes.

Whether Hiller was historically correct about MUSIC-N, or not, I cannot say. However, I have personally concluded that MUSIC-N's note-parameter model is inadequate for speech synthesis, for several reasons:

There can be any number of phonemes in a word, and the transitions between phonemes are rarely discrete. This means that the properties of phonemes (e.g. the peak frequencies of vocal resonances) cannot be described using discrete parameters. Rather each property must be described as a sequence of segments, combining steady-state plateaus with transitional ramps.
Some phonemes are pitched while some are aspirated. Still other phonemes combine pitch with aspiration. This suggests that sound-synthesis instruments be able to respond to conditions which change over the duration of a note. However, making instruments conditional means that the instrument wastes a lot of time generating signals which it isn't actually going to use. Better if the instruments could be designed modularly, with the modules being activated only when needed.
The bank of resonances and antiresonances required to produce a phoneme varies, depending whether the phoneme is a vowel, a nasal, a fricative, or whatever. This third reason has similarities with reason #2 except that the vocal sources act in parallel while resonances and antiresonances act in cascade (sequence). The solution to reason #2 is the same as the solution to reason #3: modular instrument design.

While here listing reasons why MUSIC-N's note-parameter model is inadequate, I have also suggested ways of enhancing the model so that speech synthesis would be feasible. How my Sound engine implements these enhancements is detailed elsewhere, but in a nutshell it involves voices, which provide scope to passing signals between instruments, and contours, which allow segment-by-segment description of control signals and whose information is accessible to any note — so long as the note's voice ID matches the contour's voice ID.

Reality check indeed!

Not being affiliated with any academic institution, I have no ready access to a university library or to the archival services employed by academic journals. As such, my sources have been limited to information publicly available on the internet, and to old books I have kicking around. While Klatt's solution did not prove as helpful as I hoped, I found enough information elsewhere to make up the deficit. The resource pages on Speech Acoustics developed by Robert Mannel and Felicity Cox have been particularly helpful to me, and should be considered as suggested reading even though the pages are written in Australian. Another useful overview is available in the series of lecture slides prepared by James Kirby, evidently for presentation in Hanoi. I have also made recourse to two period books, J.L. Flanagan's 1977 Speech Analysis Synthesis and Perception and a 1973 collection edited by Minifie, Hixon, and Williams, Normal Aspects of Speech, Hearing, and Language, particularly Minifie's contribution on “Speech Acoustics” (pp. 235-284).

Overview

We start with melody and lyrics. Sheet music for “Daisy Bell” is available on the web, for example at www.free-scores.com. We are concerned just with the chorus, ignoring both the verses and the accompaniment.

The melody must be translated into a note list, with each pitch converted into a frequency.
For the lyrics, we must first obtain an accurate phonetic transcription using the symbols of the International Phonetic Alphabet. Once we have specified the precise phonemes, we will then be able to decide how to go about realizing each phoneme.

The ultimate rendition of “Daisy Bell” will be realized over several iterations with new speech-sythesis techniques being introduced as needed.

Iteration #1 will produce just the melody.
Iteration #2 will introduce vowels and diphthongs.
Iteration #3 adds in glides and liquids, which are produced using the same methods as vowels.
Iteration #4 introduces nasals.
Iteration #5 introduces fricative consonants, which can be unvoiced or voiced.
Iteration #6 realizes plosive or stop consonants, which complete the exercise.
An additional page is devoted to affricate consonants, a category of phoneme employed by English, but not frequently enough to appear in “Daisy Bell”.

When I began this exercise, I initially coded the note lists by hand. As complexity increased owing to the addition of new phoneme categories, manual preparation became increasingly more tedious and less practical. Tiresomeness reflected back on earlier iterations when changes in policy (e.g how notes should be articulated) or instrument design forced re-coding of the earlier listings. During the preparation of Iteration #5 (which employs separate notes to shape different spectral features of fricative noise sounds) the manual coding got to be too much. I had already developed Java procedures to try out individual words, but from this point on I undertook to write procedures which generated note lists for entire phrases, with the iteration number as a parameter. By making use of start statements to offset note starting-times, it was possible to generate note lists which could be tested individually, then pasted into larger iteration lists.

A Speech-Synthesis Orchestra

The orchestra used for the present exercise is implemented in the file SpeechOrch.xml. If you have access to the Sound application, and you intend to replicate the sound-synthesis runs on your own, you should download SpeechOrch.xml into your working directory.

The speech-synthesis orchestra extends many ideas about modular instrument design developed in the page on Synthesizing Noise Sounds. The orchestra defines two voices, with a separate stack of instruments operating within each voice. Both instrument stacks implement versions of the source-filter model of speech production. Voice #1 is for vowels and vowel-like consonants, while voice #2 is for fricatives and other mostly noisy sounds. In either case, it takes several notes to generate a single sound, and information is passed between notes through two intra-voice signals. Signal G1 transmits the audio signal while signal G2 transmits the power envelope.

The orchestra is monophonic, which means that although vowel-like and noisy sounds are produced along very different synthesis streams, they are both heard as coming from the same physical location. Keep this in mind should you choose to adapt this orchestra for stereophonic synthesis: The voice entity as implemented within the Sound engine is a construct affecting the scope of contours and signals. It may often equate to a musical “voice”, but it doesn't have to.

If you have access to the Sound application, and want to know intimate details about the speech-synthesis orchestra, you can always view SpeechOrch.xml in Sound's Orchestra Editor.

Note-List Heading

orch /Users/charlesames/Scratch/SpeechOrch.xml
set rate 44100
set bits 16
set norm 1

Listing 1: Note-list header for “Daisy Bell”.

Listing 1 presents the note list header shared by all “Daisy Bell” iterations. The first text line in the header is the orch statement, which indicates that the listed notes will be synthesized using SpeechOrch.xml. You'll need to adapt this statement to reflect your own working directory. Of the remaining statements, set rate sets the sampling rate to 44100, which is the standard for audio CD's and permits synthesis of frequencies across the full range of human hearing, while set bits sets the ultimate sound quantization level to 16-bits. However, the set norm statement causes two passes through the data. The first pass saves all samples 32-bit accuracy in a temporary file, while the second pass rescales these samples to optimize the signal-to-noise ratio.

Contours

SpeechOrch.xml defines five contours and declares all five accessible both to voice #1 and to voice #2. The five contours are:

Amplitude, permitting values from 32 to 16384.
Frequency, permitting values from 16 to 1378 (Nyquist limit divided by 16).
Formant 1, permitting values from 100 to 3500.
Formant 2, permitting values from 100 to 3500.
Formant 3, permitting values from 100 to 3500.

All five contours have the exponential calculation mode, which means that transitions from origins to goals proceed in equal-ratios curves.

To avoid foldover it is necessary to ensure that no harmonic of any tone exceeds the Nyquist limit. The upper limit for Contour #2: Frequency is calculated to accomodate a pulse waveform whose uppermost harmonic is 16 times the frequency of the waveform's fundamental: (44100 ÷ 2) ÷ 16 ≈ 1378 Hz. Frequencies in “Daisy Bell” range from 146.8 Hz. (D3) to 293.7 Hz. (D4), all well below this calculated limit.

Additional due diligence would verify that the harmonics of any tone used will cover the highest formant regions. A table of vowel formant frequencies will be provided for Iteration #2, but for now it is sufficient to know that the highest F₃ value listed in that table is 3079 Hz. The lowest pitch in “Daisy Bell” is 146.8 Hz. (D3), which will produce a highest harmonic at 16 × 146.8 = 2349 Hz.

Voice #1: Instrument Stack for Vowels and Vowel-like Consonants

The stack for voice #1 has four categories of swappable component. It takes at least four notes to make a sound using the voice #1 stack:

One or more notes to generate a “source” sound. A sound will normally have one source but chords are also possible, as are admixtures of pitched and noise sources. The “source” category of instruments has two members. Both “source” instruments scale the amplitude values drawn from Contour #1: Amplitude using a multiplier drawn from note parameter #7. Both instruments apply sustained envelopes with an attack duration drawn from note parameter #8 and a release duration fixed at 0.05. Both instruments mix the resulting signal into Voice Signal #1: Audio.

Instrument #101: Buzz1 generates tones with 16 harmonics of equal amplitude. It draws its fundamental frequency from Contour #2: Frequency.
Instrument #102: Whisper1 generates classic noise with the cutoff frequency set to half the sampling rate, then processes the result through a band-pass filter centered at 7000 Hz. Whisper1 draws its attack duration from note parameter #8. The resulting signal is mixed into Voice Signal #1: Audio.

Exactly one note to capture the source signal's RMS power envelope. The “source-power extraction” category of instruments contains just one member: Instrument #119: RMS1. RMS1 draws its input from Voice Signal #1: Audio and writes the power envelope to Voice Signal #2: Power.
One or more notes to modify the sound using filtering operations. The “signal-modifying” category of instruments has two members. Both signal-modifying instruments draw input from Voice Signal #1: Audio and overwrite the same signal with their output.

Instrument #121: Mouth1 emulates resonances of the vocal tract when sound is directed out of the mouth. Mouth1 first subjects its source signal to low-pass filtering with a “cutoff” frequency twice the value of Contour #2: Frequency. It cascades the result through three band-pass filters whose center frequencies are drawn respectively from Contour #3: Formant 1, Contour #4: Formant 2, and Contour #5: Formant 3.
Instrument #122: Nose1 emulates resonances and antiresonances of the vocal tract additionally produced when sound output is directed from the mouth to the nose (specifically the nasal consonants m, n, and ŋ). Instrument #122: Nose1 draws its attack duration from note parameter #7, its release duration from note parameter #8, its notch frequency from note parameter #9, and its notch bandwidth from note parameter #10.

Exactly one note to rebalance the sound and output it to channel 0. The “rebalancing/output” category of instruments contains one member:

Instrument #199: Rebalance1. This instrument draws inputs from both Voice Signal #1: Audio and Voice Signal #2: Power, using the latter signal to normalize the former. The rebalanced signal then buffers out to file.

Thus to produce a vowel sound, you would use a quartet of note statements. The first note in the quartet will invoke instrument #101 to generate a pitched tone. The second note in the quartet will invoke instrument #119 to capture the RMS power envelope. The third note of the quartet will invoke instrument #121 to apply the resonant characteristics of the vocal track to the pitched tone. The fourth note of the quartet will invoke instrument #199 to restore the envelope captured by instrument #119.

Instrument #122 always works in conjunction with instrument #121. Thus synthesizing the word “manor” would require a stack of six notes invoking instrument numbers 101, 119, 121, 122, 122, and 199 respectively. The notes for instruments 101, 119, 121, and 199 would start and end simultaneously, lasting for the entire duration of the word. The first note for instrument 122 would last for the duration of the m sound, while the second note for instrument 122 would last for the duration of the n sound.

Voice #2: Instrument Stack for Noises

The stack for voice #2 also has four categories of swappable component. In this case the design is more directly influenced by NoiseOrch.xml. It again takes four or (usually) more notes to make a sound using the voice #2 stack:

One or more notes to generate a “source” sound. A sound will normally have one source but chords are also possible, as are admixtures of pitched and noise sources. All three “source” category of instruments has three members. All three “source” instruments scale the amplitude values drawn from Contour #1: Amplitude using a multiplier drawn from note parameter #7. All three instruments mix the resulting signal into Voice Signal #1: Audio, which voice #2 keeps in a different place than voice #1.

Instrument #201: Buzz2 generates tones with 8 (!) harmonics of equal amplitude. Buzz2 draws its fundamental frequency from Contour #2: Frequency. It applies a sustained envelope with an attack duration drawn from note parameter #8 and a release duration fixed at 0.05.
Instrument #202: Noise2 generates classic noise with the cutoff frequency set to half the sampling rate and with a sustained envelope. Noise2 applies a sustained envelope with an attack duration drawn from note parameter #8 and a release duration fixed at 0.05.
Instrument #203: Noise2 Exp generates classic noise with the cutoff frequency set to half the sampling and with an envelope that drops off exponentially. Noise2 applies an envelope with an attack duration drawn from note parameter #8, an exponential decay, and a release duration fixed at 0.03.

Zero or one note to capture the source signal's RMS power envelope. The “source-power extraction” category of instruments contains just one member: Instrument #219: RMS2. RMS2 draws its input from voice #2's instance of Voice Signal #1: Audio and writes the power envelope to voice #2's instance of Voice Signal #2: Power.
Zero or more notes to modify the sound using filtering operations. The “signal-modifying” category of instruments has seven members. All seven signal-modifying instruments draw input from voice #2's instance of Voice Signal #1: Audio and overwrite the same signal with their output.

Instrument #221: LowPass implements 1st order low-pass filtering. LowPass draws its bandwidth from note parameter #7.
Instrument #222: LowPassBW2 implements 2nd-order Butterworth low-pass filtering. LowPassBW2 also draws its bandwidth from note parameter #7.
Instrument #223: HighPass2 implements high-pass filtering by centering a 2nd-order Butterworth band-pass filter at 18 kHz. HighPass2 actually implements a 2nd-order Butterworth band-pass filter with the center frequency fixed at 18000 Hz. and its bandwidth drawn from note parameter #7.
Instrument #224: BandPassBW2 implements 2nd-order Butterworth band-pass filtering. BandPassBW2 draws its center frequency from note parameter #7 and its bandwidth from note parameter #8.
Instrument #225: BandPassFLT2 implements 2nd-order band-pass filtering with substantial low-frequency bleedthrough. BandPassFLT2 also draws its center frequency from note parameter #7 and its bandwidth from note parameter #8.
Instrument #226: BandReject2 implements 2nd-order band-reject filtering. BandReject2 draws its notch frequency from note parameter #7 and its notch bandwidth from note parameter #8.
Instrument #227: LinearEqualizer2 sums 16 band-pass filters set at equally spaced frequencies with individually adjustable gains. Note parameter #7 determines the center frequency for the uppermost equalization band; for example if parameter #7 is 16000 then the first band will center at 1000 Hz., the second band will center at 2000 Hz., and so forth. The remaining 16 note parameters specify individual gains to be applied to each band.

Exactly one note to rebalance the sound and output it to channel 0. The “rebalancing/output” category of instruments contains one member:

Instrument #299: Rebalance2. Like Instrument #199: Rebalance1 instrument draws inputs from both Voice Signal #1: Audio and Voice Signal #2: Power, except this time around the signal instances for voice #2 are employed. The power signal is used to normalize the audio signal, and the rebalanced result then buffers out to file.

Next topic: Melody

Page created: 2014-02-20

Last updated: 2015-07-12