Sound Engine Basics


MUSIC-N was a series of digital-sound-synthesis programs developed during the 1960's by Max Mathews with colleagues Joan E. Miller, F. R. Moore, John R. Pierce, and J. C. Risset at Bell Telephone Laboratories. The fifth and last of the Bell Labs programs was described in a famous book, The Technology of Computer Music. The MUSIC-N programs sought to emulate “analog” components of the classical electronic music studio, e.g. oscillators, noise generators, envelope generators, and filters. Ultimately the digital technology represented by MUSIC-N doomed analog technology to obsolescence.

To many people, non-real time software synthesis has itself been rendered obsolete by real-time digital synthesis hardware. However to someone with low-grade performance skills (such as myself), whether or not a sound has been generated in real time is not particularly important. Software synthesis presents many opportunities. Foremost is the ability to shape sound on its most elemental level. Even if one never wishes to realize compositions, this ability gives one direct involvement with sound that Helmholtz would have killed for. Beyond this, software syntheses allows one to sculpt sounds of arbitrary complexity — without limits on the number of simultaneous tones and without concern that calculations execute quickly enough to keep up with digital-to-analog conversion.

I personally worked with Stanford's music system during a summer workshop in 1976 and with Barry Vercoe's Music11 (a precursor to CSound) during a summer workshop at M.I.T. in 1978. MUSIC-V was in use at the State University of New York at Buffalo (UB) when I began graduate studies there in 1977. After the M.I.T. workshop I tried to persuade Hiller to adopt Music11 but Vercoe wanted to charge license fees and Hiller would have none of that. I subsequently began working up my own Sound program with some of the capabilities I had witnessed elsewhere; for example, dynamic memory allocation, constants and variables in instrument definitions, and greater unit diversity. This program was actually generating wave files when a flood came through and destroyed UB's primitive digital-to-analog conversion facility. The Sound engine presently deployed on this site recasts that original program for the web.

Things have changed since the 1970's. UB's CDC 6600, designed by Seymour Cray and a “supercomputer” of its day, is today dwarfed in both speed and memory capacity by common desktop units. Those same desktops have sound cards with high-speed, high-resolution, digital-to-analog and analog-to-digital conversion. Proprietary audio files have given way to standardized WAV and MP3 formats. Specialized component-description “languages” are now superseded by XML. And of course the internet, which was barely getting started in those days, is now ubiquitous.


The text that follows first sets the context by explaining the distinction between additive synthesis and subtractive synthesis. It then gets down to brass tacks by describing the WAV format used to contain an audio signal. The text proceeds to walk through how a sound-synthesis engine would use the simplest instrument and a one-note score (i.e. notelist) to generate an audio signal. Essential to this is the operation of the digital oscillator. Returning to the engine, the text explains how audio-rate calculations combine signals from instruments and buffer up results for file output. It then describes note initialization, a feature of the post-Bell-Labs generation of sound-synthesis packages. Finally, the text explains voices, contours, and ramps, which are unique features (so far as I know) of my own Sound engine.

Sound-Synthesis Models

The years following World War II combined a new experimental spirit — responding in part to years of cultural suppression in Europe — with an enthusiasm for the wondrous new technologies that had been so stimulated by the war effort. These impulses spawned two approaches to what we now know as electro-acoustic music:

Mixing and splicing are pervasive in electro-acoustic music today, though what was once done using tape segments is now done using digital wave files. What Schaeffer and Henry did to incorporate environmental sounds has evolved into the general technique of sampling. The idea of recreating instrumental sounds by separately recording individual tones and playing them back on cue has been around since the mellotron, a 1963 instrument which employed a different tape loop for every tone on its keyboard. Digitizing this concept produced of sampling synthesizers; these evidently existed as early as 1967 but did not really take off until the mid 1980's.

A particular technique of tape music as been the use of tape loops to create an echo effect. With the advent of digital delay, what was originally an expensive, noise-prone, and inflexible effect became the basis of a new frontier in digital sound synthesis and digital signal processing. An early application of digital delay was artificial reverberation, as described in a famous article by Manfred R. Schroeder. However it was also realized that if one processed an impulse (e.g. a noise source) through a delay line with a very short loop time, then the delay line would resonate harmonically like the tube of a clarinet or a string on a violin. Such is the basis of the plucked string technique developed by Karplus and Strong sometime before 1981.

Additive synthesis builds toward complexity by adding sine waves together, each with a particular amplitude. In rudimentary additive synthesis, amplitudes are static; that is, they remain fixed over the duration of a sound. When the sine waves are tuned according to the harmonic series, the result is a periodic waveform. Static waveforms are the subject of my page on Oscillators and Waveforms. In advanced additive synthesis, the amplitude of each harmonic changes dynamically over time. More on additive synthesis may be found in Wikipedia.

Subtractive synthesis begins with broad-spectrum sounds like noises and pulse waves. It then employs filters to selectively enhance or degrade specific regions of the sound spectrum. In the early days of electronic music, “subtractive synthesis” specifically referred to ways of filtering the output from a noise generator, and this is the topic of my page on Synthesizing Noise Sounds. However, researchers in speech synthesis during the same time, notably Gunnar Fant, concluded that a source-filter model contributed strongly to the understanding of speech sounds. Acousticians have since applied the source-filter model directly to the production of musical tones; for example to how the resonant body of stringed instruments affects the tones of these instruments, or how the shape of the air column (cylindrical in trumpets or conical in horns) affects the production of brass tones. The Wikipedia article on subtractive synthesis unfortunately provides less information than I have given here.

The distinction between additive and subtractive methods has long been useful as a way of organizing sound synthesis techniques in the minds of new students. These have never been doctrines championed by one side or another. Yet in comparing the summer workshop I attended at Stanford in 1976 with the workshop I attended at M.I.T. in 1978, the difference in emphasis fell out along this distinction. I should say before I launch into this narrative that the contrast probably had less to do with institutional philosophies and more to do with the fact that the Standford workshop targeted beginners, while the M.I.T. workshop targeted experienced teachers.

Additive Synthesis at Stanford in 1976

Stanford's 1976 summer workshop on Computer-Generated Sound lasted only four weeks, but provided the most exhilarating experience of my entire life. Fellow attendees at Stanford included Larry Polansky, Neil Rolnick, and Beverly Grigsby.

At Stanford the presenters assumed we had no previous experience with computer sound synthesis — certainly true in my case — and therefore started us out with the basics. Presenters included F. Richard Moore on the workings of MUSIC-N-style sound synthesis, Leland Smith on score preparation, John M. Grey (a Ph.D. candidate in Stanford's psychology department) on psycho-acoustics, John Chowning on advanced topics. Loren Rush was also on staff; I don't remember him formally presenting but he did give me an encouraging compliment on occasion.

Dick Moore explained the fundamental sound-generating units with particular emphasis on the digital oscillator. Meanwhile Leland Smith explained how to create simple note lists (they called them “score files”). To try things out, we were initially given an instrument named SIMP which consisted of a single oscillator generating a sine wave. Having no envelope, notes generated by SIMP started and ended with obnoxious clicks. This problem was shortly remedied with a second instrument design where one oscillator generated a one-cycle ADSR envelope which drove the amplitude of a second audio-frequency oscillator. This second design was employed by four instruments whose sole difference was the waveform sampled by the audio-frequency oscillator. TOOT sampled a sine wave. CLAR sampled a square wave. BUZZ sampled a pulse wave. BRIT sampled a sawtooth wave. These four sounds provided the basis for an extended tape composition by Beverly Grigsby. Moore and Smith must have told us about the random noise generator because I used it for my own project, a piece for synthesized percussion. I don't remember any mention of digital filtering during these sessions.

Of Grey's lectures on psycho-acoustics I remember a specific focus on the studies performed by Reiner Plomp and Willem Levelt. Much of Plomp and Levelt's work concerned on the perception of sine-tone interactions within and outside the Critical Band, and this concern with sine waves is very consistent with the additive outlook.

I believe it was John Grey who played us the example of instrument tones morphing between four instruments. Here my memory fails me. I remember two of the instruments definitely being trumpet and violin. A third might have been the oboe. I don't remember the fourth at all. In any case, this demonstration showed off the additive model in its full dynamic glory, since it was achieved by obtaining amplitude envelopes for each harmonic of each instrumental tone, and then using interpolation techniques to transform each envelope of one instrument into the corresponding envelope for a second instrument.

The advanced topics presented by John Chowing included FM synthesis and the synthesis of moving sounds. Details about FM synthesis may be found in Chowning's excellent article. All I want to point out here is that Chowning's method provides a shortcut to producing tones with dynamically evolving spectra, meaning that the amplitude of each harmonic changes constantly over the duration of a note. In my mind, this aligns FM synthesis firmly on the additive of the additive-subtractive dichotomy. As for Chowning's synthesis of moving sounds, I had already witnessed that a few months earlier. On that previous occasion, the California Institute of the Arts hosted a one-day computer-music symposium. Chowning was there, as were other presenters whose names I have long since forgotten. These other presenters were up first. They proudly demonstrated a system that could direct each single note to any one of eight (even 32, they speculated) speakers located around the room. Chowning came up next, and using 4 speakers, produced a single continuous sound that whizzed around the room — enhanced even by a Doppler effect.

Subtractive Synthesis at M.I.T. in 1978

After Stanford, I wasn't greatly convinced there was much more to learn about computer sound synthesis when I attended the 1978 summer workshop at M.I.T. I'd only vaguely heard the name of Barry Vercoe, and the main attractions of the workshop were first to regain access to a top-rate sound-synthesis facility and second to take advantage of the second session, during which attendees could pursue compositional projects. Fellow attendees at M.I.T. included Larry Austin, William Benjamin, Robert Ceely, Alexander Brinkman, Paul Dworak, and Tod Machover.

I came out of the M.I.T. workshop very impressed with Vercoe, and particularly impressed with MUSIC11. MUSIC11 was the MUSIC-N variant developed by Vercoe for the PDP-11 minicomputer, which subsequently became the basis for Vercoe's famous CSOUND program. I also learned a whole lot about subtractive synthesis, and a bit also about artificial intelligence.

I do not remember the name of the professor Vercoe brought in to lecture us on psycho-acoustics. I do remember that the approach divided sound-production systems into an active excitation component and a passive resonating component. For further reading we were directed to Juan G. Roederer's Physics and Psychophysics of Music (New York: Springer, 1973, 161 pages) which is a good read and which now has run into its fourth (2008) edition. We learned that for stringed instruments such as the violin, the resonating systems including not just the instrument body but the strings themselves, which happen to encourage standing waves along the entire string length (the fundamental), half the string length (the first harmonic), one-third of the string length (the second harmonic) and so forth. We also learned about speech sounds, about how the glottis generates an overtone-rich buzz, and about how the various cavities of the vocal tract produce combinations of resonant peaks, each associated with a different vocal sound. We learned about diphthongs and glides, about how “you” is the retrograde of “we”.

MUSIC-N's FLT unit, described on pp. 76-77 of The Technology of Computer Music produces resonant peaks much like those produced by the vocal cavities. Vercoe's MUSIC11 implemented an equivalent unit along with at least one other filter, a low-pass unit. The challenge with digital filtering is that the gain produced by a digital filter is not easily predictable. However, MUSIC11 addressed this challenge with units that could extract the power envelope from any signal, much as an electronic power supply converts AC current into DC current. With this innovation it no longer mattered how the resonances and anti-resonances of the filter bank aligned with the input spectrum. The amplitude of any signal could be normalized. Subtractive synthesis had become a practical proposition, not just with with one filter, but with many filters linked in cascade! My page on Digital Filtering enumerates the digital filters implemented within the Sound engine. Sub-pages devoted to each filter show how the filter's frequency-response curve changes as the frequency and bandwidth parameters vary. An additional sub-page is devoted to the use of filters in cascade.

This as far as the 1978 M.I.T. workshop went with specific filtering units, but Vercoe had one more subtractive-synthesis topic to present to us, and it was a doozy: linear predictive coding. I knew about vocoders, and during the electronic music course at Pomona College, John Steele Ritter had played us Rusty In Orchestraville. At M.I.T., Vercoe showed us how an LPC analysis of a spoken sentence, “They took the cross-town bus.”, could be superimposed over an instrumental recording to produce a cross-synthesis in which the instruments ‘spoke’ the words of the sentence.

Vercoe also had something to say about delay-line technologies. He described Manfred R. Schroeder's technique of artificial reverberation, explained the units Music11 provided to realize Schroeder's effects, and also demonstrated a chorus effect produced using a delay line with multiple random taps. Vercoe's presentation of the Karplus-Strong synthesis explained how a very short delay line could resonate much like a vibrating string. I also remember Vercoe suggesting that feeding a sustained noise source into such a delay line would serve much like drawing a bow across such a string. And I remember that some years later at an ICMC someone — I do not remember who except that he was not from Stanford — played me a recording of a sound synthesized in that way, a sound which very much resembled a sustained 'cello tone.

Wave Files

The WAV audio-file format was established by Microsoft and IBM during the 1990's. My Sound engine employs Java wavefile I/O by Evan X. Merz and the BigClip class by Andrew Thompson. Although the engine only supports WAV file formats, several internet sites offer free WAV to MP3 conversion; for example

At the heart of digital sound processing are two pieces of technology: the digital-to-analog converter for playing sounds and the analog-to-digital converter for recording sounds. These days you can find these components on the sound card of any personal computer. Audio signals are organized around four facts: sampling rate, sample quantization, channel count, and number of samples per channel.

The duration of an audio signal may be calculated as:

(samples per channel)

(bytes per sample)(sampling rate)

The Synthesis Engine

All MUSIC-N sound-synthesis engines works with two things, an orchestra and a note list (elsewhere known as a score). The note list contains statements of various types, the most prominent of which is the note statement. Within the Soundengine, each note statement begins with the word “note”, followed by a sequence if decimal parameters. Of these, parameters 1-6 have fixed purposes. For the moment we are specifically concerned with parameter #5 (onset time) and parameter #6 (duration).

Here is how a sound-synthesis engine works in a nutshell. It first sorts the note statements by order of time, instrument ID, and note ID. It obtains the end time T from the note list (specifically from an “end” statement, of which there must be exactly one), sets the current time t to 0, and sets note pointer Ni to the first note in the list. It sets up a currently-playing notes collection, Playing, which is initially empty. It then loops through the following steps:

  1. If tT then exit the loop, we're done!
  2. Purge all Playing notes with release times (P5+P6) not greater than t.
  3. Set the play-to time tnext to the next note onset time (P5) or to the minimum release time (P5+P6) of all Playing notes, whichever is smaller.
  4. While the onset time (P5) of the Ni equals the t, do the following:
    1. Initialize the note pointer Ni. This phase of calculation will be examined more closely under the heading Note Initialization.
    2. Bring Ni into the Playing collection.
    3. Advance the note pointer Ni to Ni+1.
  5. Calculate wavefile samples from t up to (but not including) tnext. This phase of calculation will be examined more closely under the heading Audio-Rate Calculations.
  6. Set the current time t to the play-to time tnext.

Time Granularity

The previous explanation of the synthesis engine described how sample calculations divide into batches. Among other things, the criteria affecting batch sizes includes differences between consecutive note events (start times and end times). Having a cluster of notes which start at nearly the same time could potentially create situations where the engine has to repeatedly cut short batches in order to queue in additional notes.

There is a threshold, however, where nuances of timing become meaningless and where consecutive events are heard to be simultaneous. This threshold is called the “time constant”. It is discussed by Winckel, 1967 pp. 51-55, but seems to have been known back in 1860. The time constant is around 1/18 second, or 55 msec. It represents a number of things, most prominently the transition point between what we perceive as time and what we perceive as frequency. Understand that the time constant is not a precise cutover; for examples:

The Sound engine granularizes time into 10 msec increments. No matter how many digits a note statement employs after the decimal point, Sound will round the timing to the nearest 100th of a second. Dividing the CD standard sample rate of 44,100 by 100 gives 441. Sound has a default batch length of 1323 samples, which is three time grains.

Audio-Rate Calculation

Figure 1: Audio-rate Calculations.

The following text reprises the symbols t, tnext, and tnext, which were defined previously under The Synthesis Engine.

As Figure 1 illustrates, the sound-synthesis engine breaks up the period from t to tnext into batches. The default batch length of 1323 samples (this number is configurable) is enough to process 30 msec of sound at a 44100 Hz. sampling rate. For each batch, the engine iterates through the Playing collection. For each playing note, the engine extracts an instrument number from parameter #3, then sets the corresponding instrument to calculating samples. The instrument in turn delegates calculations to its units. Each type of unit performs calculations in its own peculiar way. The Operation of the Digital Oscillator is instructive.

Once all of the samples in the batch are complete, the results are written to file.

The “Simplest” Sound-Synthesis Instrument

To understand further how sound-synthesis engines work it is necessary to understand instruments, note lists, and in particular how a digital oscillator works. Figure 2-1 and Figure 2-2 depict an orchestra containing what The Technology of Computer Music calls the “simplest instrument”. This specific orchestra will help further explain how samples are calculated, but first a word about orchestra structure generally.

Orchestras contain waveforms and instruments, among other things. You can learn more about waveforms on the Waveform Reference page. Instruments are collections of units, the specific types of which are documented on the Unit Reference page. Each unit has a collection of arguments, and each argument references a data conduit. The varieties of data conduit are documented on the Data Conduit Reference page.

Figure 2-1: The “simplest orchestra”, viewed through my Orchestra Editor. Figure 2-2: Expansion of Instrument #1.

Rather than using the predefined 512-sample sine wave listed as Waveform #1, it will be more instructive to explain things using the 16-sample wave presented in Figure 3 below. This orchestra contains one instrument, as detailed in Figure 2-2. Instrument #1 contains two units: Unit #1 is an Oscillator while Unit #2 is a single-channel Output unit.

While explicit declarations were not required by MUSIC-V, declarations become indispensable in a graphical user interface where arguments are selected using drop downs. Three types of data conduit.

Here is how the instrument shown in Figure 2-2 operates: The Oscillator unit generates a tone 1323 samples at a time, storing this result in Signal #1. The tone's zero-to-peak amplitude is controlled by Parameter #7; while tone's frequency is controlled by Parameter #8. The tone's waveform is indicated by the constant value 1, which directs the oscillator to employ Waveform #1.

The Output unit mixes Signal #1 in with what the other notes are contributing to channel #0.

A One-Note Score

orch /Users/charlesames/Scratch/SimplestOrch.xml
set rate 10000
set bits 16
set norm 0
note 1 0 1 0 0.00 0.00 1 32000 440
end 1
Listing 1: Simple note-list example.

To make Orchestra #1 produce a sound, we will provide it with the note list presented in Listing 1.

The various statements have the following effects:

Operation of the Digital Oscillator

Here is how Oscillator units generate tones. It all centers around a waveform and a stored double-precision Position variable. Suppose our oscillator employs the waveform shown in Figure 3. The waveform is simply an array of floating-point numbers, where the value in the final position repeats the value at position 0. The Position variable may range from 0 up to — but not including — the waveform length (16).

Figure 3: A sixteen-sample representation of a sine wave.

Now suppose we have a sampling rate of 10,000 and wish to generate a tone with a frequency of 440 Hz. To produce this frequency, the Position variable must advance by a sampling increment which is calculated as follows:

(sampling increment)  = 
(frequency) × (waveform length)

(sampling rate)
440 × 16

 = 0.704

Table 1 details the calculations a digital oscillator would make using the waveform shown in Figure 3 with a sampling increment of 0.704. Results from this table will shortly be presented graphically, but for the moment you should notice that the Position column starts at zero, that each successive entry increments the previous entry by 0.704, and that whenever the Position threatens to fall outside the range from 0.00 (inclusive) to 16.00 (exclusive), the oscillator wraps it back into range. This happens for samples 23 and 46.

The numbers in the Index column are obtained by truncating the Position value to the next lower integer. The resulting Index variable is used to look up values in the waveform array.


Table 1: Digital oscillator calculations using the waveform shown in Figure 3 with a sampling increment of 0.704.

There are two ways of implementing digital oscillators, non-interpolating and interpolating. Non-interpolating oscillators are quicker because they use the index directly to reference a waveform entry. Figure 4 shows the oscillator signal generated using the non-interpolating method. The shape is especially blocky (adding noise to the signal) because the waveform length is so short. My Sound engine does not actually offer a non-interpolating option, but it is very instructional to see how the waveform spreads out when the sampling increment is less than unity.

Figure 4: Graphic display of non-interpolating oscillator output using the Value(Index) column from Table 1.

The Residue column in Table 1 is calculated by subtracting the current Index from the current Position. The interpolating method for tone generation produces better signal quality than the non-interpolating method because they this residue is used to interpolate between consecutive waveform entries. Interpolation happens at a cost of extra processing cycles, but in today's world of hardware floating-point accelerators it matters nowhere near as much as it used to.

Interpolation  = Value(Index) + 
Value(Index+1) - Value(Index)


Figure 5 shows the oscillator signal generated using the interpolating method.

Figure 5: Graphic display of interpolating oscillator output using the Interpolation column from Table 1.

Note Initialization

MUSIC-V did not do note initialization; however, note initialization became universal in programs of the post-Bell-Labs generation, including Barry Vercoe's Music360, Vercoe's Music11, and the music program I used at Stanford in 1976. Vercoe told me that this second generation of sound-synthesis programs were more influenced by MUSIC-IV than MUSIC-V, and that many MUSIC-IV features were dropped in MUSIC-V in order to permit single instruments to play simultaneous notes. I am not familiar with MUSIC-IV, but it is possible that note initialization was one of these dropped features.

The previous heading on the Sound Synthesis Engine placed the note-initialization phase within the overall process of sample calculation; however details of the initialization phase were left for later. I now intend to fulfill this promise. Not having researched what other sound-synthesis engines do during their note-initialization phases, I can only tell you what my Sound engine does. It seems safe to assume that products like Csound and SuperCollider do similar things.

The first thing the engine does in the note-initialization phase is use the instrument id in note parameter #3 to dereference an instrument. The engine next iterates through the instrument's units in unit-ID order, performing one (or neither) of two actions:

For an example cookie, consider an Oscillator. Refer back to the earlier section, Operation of the Digital Oscillator. At a bare minimum, each oscillator cookie needs to store what the earlier section called the Position variable. However there are a number of other quantities which do not necessarily need to be recalculated for every sample. Significant among these quantities is the sampling increment, which only needs to be recalculated when the input frequency changes.

Voices, Contours, Ramps, and Voice-Level Signals

Four entities, voices, contours, ramps and voice-level signals are implemented by my recent Sound engine but had no equivalents in the sound-synthesis packages I was familiar with during the 1970's. Voices and contours carry through features of my Ashton score-transcription utility, the earliest versions of which were generating note lists in the spring of 1978. This utility organized notes into voices, and it also employed contours to control gradually evolving score-attributes such as tempo and dynamics.

Voices provide a scope for the Sound engine which occupies an intermediate level between the local scope of note parameters and the global scope of system variables or of waveforms. In that sense, Sound voices are analogous to MIDI channels, where you can have many simultaneous notes that all share the same control information. However MIDI files transmit single control values at specific moments in time. By contrast, Sound contours are described by ramp statements in the note list. Each ramp has a start time, a duration, an origin, and a goal. Each contour's ramps, placed end-to-end, fill out the entire wavefile duration.

After introducing voices into Sound, it was a logical next step to allow notes from the same voice to pass signals between one another. Voice-level signals enable conditional branching within the sound-synthesis system. For example, suppose you wish to generate a group of tones and noises and then process this specific group of sources through some sort of resonator. You can do that by mixing all the the source outputs into a voice-level signal, then have the resonator pick up this same signal to work its magic.

© Charles Ames Page created: 2014-02-20 Last updated: 2017-08-15