The Voice in Western Art Music after 1950

(A work in progress)

Thomas Patteson, University of Pennsylvania (


The goal of this study is to present a brief and cursory survey of treatments of the voice in art music of the period c. 1950-1975.  The impetus for this project was a seminar led by Charles Bernstein in the spring of 2007 entitled "Sound of Poetry, Poetry of Sound," the focus of which was the role of the aural dimension in poetic experience,  particularly in the 20th century.  Much 20th-century music can be understood in parallel with poetic tendencies as a deliberate attempt to foreground sonic elements traditionally thought to be peripheral to musical meaning.  This may seem counterintuitive at first glance, as music, unlike poetry, is generally not divided into two dimensions of "sound and sense," but in music, too, there is a conventional distinction between elements of sound that are primary (pitch structures such as melody and harmony above all) and those that are secondary or peripheral (timbre, texture, and other parameters that are resistant to analytical quantification).  The turn toward voice in 20th-century music thus ironically coincides with a rejection of the metaphor of "music as language," just as contemporary poetics attempts to break free of the strictures of language for the sake of "the word as such."

Indeed, the distinction between poetry and music becomes arbitrary at best with regard to such new genres as "sound poetry" and "speech music."  Though one should be wary of setting up overly rigid and schematic distinctions, especially in a field characterized precisely by a pronounced tendency toward generic hybridism and intermediality, I have grouped the examples under heading in order to facilitate the understanding of aesthetic commonalities between the various works.  Needless to say, these groupings are provisional and non-exclusive.

Through language to music

Alvin Lucier's I Am Sitting in a Room (1969) represents an attempt to establish a continuum between the common speech and musical sound.  The text on which the piece is based explains the method of the work:

"I am sitting in a room different from the one you are in now.  I am recording the sound of my speaking voice, and I am going to play it back      into the room again and again, until the resonant frequencies of the room reinforce themselves, so that any semblance of my speech, with perhaps the exception of rhythm, is destroyed.  What you will hear then, are the natural resonant frequencies of the room articulated by speech.  I regard this activity not so much as a demonstration of a physical fact, but more as a way to smooth out any irregularities my speech might have."

With each iteration of this procedure, the speech-character of Lucier's voice is effaced to a greater extent, until at the end of the piece even the vaguest contours of the text are unrecognizable.  All that remains is a slowly undulating harmonic field reminiscent of the sound of wind blowing through an aeolian harp.  (As this piece is 45 minutes long, the example included here presents only three iterations of the text so as to make audible in condensed form the aural transformation at work in this composition.  To appreciate the gradual transition from articulate speech to musical sound, the piece must of course be heard in its entirety.)

Luciano Berio's Thema - Omaggio a Joyce (1958) likewise subjects a recorded speaking voice to gradual obliteration.  The first two minutes of the piece present a straightforward reading from James Joyce's Ulysses, allowing the listener to focus on the alliterative and onomatopoetic musicality of Joyce's prose.  In the middle section, the narrative continuity is abruptly halted; the voice, now shattered into acoustic shards, seems to do battle with its own distorted echoes.  The unaltered speaking voice returns at the end, lending the piece a certain degree of conventional formal closure.  But while the transition from voice to ambient sound in I Am Sitting in a Room comes across as natural (if a bit uncanny), here the the relationship between voice and electronic noise is adversarial, even minatory.  The short audio example given here is taken from the midpoint of the piece (starting at 3'40").

Another, more subtle point of mediation between language and music is found in the exploitation of basic linguistic sounds (primarily vowels) as the components of musical construction.  An example of this is can be heard in  Györgi Ligeti's Lux Aeterna for 16 vocalists (1966).  The first line of the text reads "Lux aeterna luceat eis, Domine"--"Let everlasting light shine upon them, Lord."  Ligeti highlights the timbral differentiation brought about by the first occurrence of the  vowel (o) on "Domine": as the other voices drop out, morendo (dying away), the basses enter in a straining falsetto, creating a striking change of texture and tone color (this takes place at measure 37 in the score).  The phonemic content of the text is thus accorded the role of structural articulation; Ligeti treats them as "musical elements, as sonic qualities of varying darkness and brightness, amalgamates them to sound clusters and uses them to create contrasts according to purely musical precepts." (Knief 28)  The first three minutes of the piece are given here; the entry of the basses occurs at 2'28".

Remarkably, this technique has a parallel in the 12th-century genre of sacred vocal music known as organum.  This genre was characterized by a texture comprising a lower voice, which sings the notes of a Gregorian chant in long, drone-like rhythmic values, and from one to three higher voices, which sings a quickly moving, florid version of the same sacred text.  Because the lower voice (called the "tenor" from the Latin "tenere," "to hold") holds its notes for so long-- sometimes upwards of 30 seconds, depending on the performers' tempo-- the vowel sound it sings has time to suffuse the musical texture with its particular timbre.  As in Ligeti's Lux Aeterna, the change from one vowel sound to another can be a significant musical event: it is as if the entire sonic space is suddenly cast in a different light.  In this example from the four-part organum "Viderunt omnes" (c. 1198) by the composer Perotin, the change is audibly salient for two reasons: first, the preceding two vowels (i) and (e) of "Vide-" are the comparatively bright, having a very high second formant (that is, range of amplified frequencies), while the vowel (u) is quite dark (its second formant is roughly four times lower in frequency than (i).  Second, the change of vowel from (e) to (u) coincides with a rise in pitch in the tenor, further strengthening the sense of contrast at this moment.

Karlheinz Stockhausen's composition Stimmung (1968) is an extremely thorough exploration of a variety of para-linguistic vocal sounds, with vowels receiving the most extensive treatment.  The entire piece is based on a static harmonic projection of the first seven partials of a single low note (B-flat at 57 Herz).  (Partials are the frequencies arrayed in fixed numerical ratios above a first partial or fundamental.  First partial=n, second partial=2n, etc.)  The composition consists of a number of "models," or brief segments of musical notation, that are assembled according to the accompanying "form scheme."  Each of these models contains a series of vocal sounds notated in the symbols of the International Phonetic Alphabet, along with precise directions for rhythmic articulation.  These untexted vocal sounds, which form the primary sonic material of the composition, are interspersed with fragments of poetry and exclamations of "magical names" of deities from various world cultures.  The result is a bizarre, hypnotic pseudo-ritual; because the piece is harmonically static, projecting the same essential harmony for the entirety of its approximately one-hour duration, the listener's attention is redirected from conventional expectations of formal development to microscopic vocal modulations and subtle, almost glacial transformations of the musical texture.

Two tape pieces by Steve Reich, It's Gonna Rain (1965), Come Out (1966), also deal with the liminal space between coherent speech and noise created by electronic manipulation.  These pieces are examples of what became known as "process music," in which a certain basic musical effect determines the course of the compositon.  The processes in It's Gonna Rain and Come Out are based on the effect produced when two  recordings gradually go out of phase with each other; like I Am Sitting in a Room, these pieces allow the listener to hear the gradual traversing of the boundary between speech and sound.

The voice without language

That the voice is nothing but a vessel of the word is a tenet of Western logocentrism that is well represented in the history of music.  But the wordless voice is ambiguous: it can represent not only the incursion of a prelinguistic irrationality, but also the liberating transcendence of language experienced in religious ecstasy: as Saint Augustine writes, "And to whom does this jubilation pertain, if not to the ineffable God?" (Dolar 49)  The decoupling of voice and language has accordingly been viewed by many 20th-century composers as a project of emancipating the sonic potentialities of voice from the narrow strictures of linguistic propriety.    Instead of being cherished as an intimate expression of subjective presence, the voice without language is glorified as a highly sophisticated synthesizer or an ideal musical instrument.

The sound-world of Stimmung, with its extended vocalic dilations, has a little-known predecessor in Philippe Carson's 1962 tape composition Phonologie, which is based on the manipulation and simultaneous overlay of the recorded vowel sounds of a single male vocalist.  Through the juxtaposition of different pitch levels and vowel sounds, Carson creates a broad, slow-moving polyphony of vocalic timbres.  Here, to an even greater extent than in Stimmung, the voice is stripped down to a state of raw physicality: no longer a medium of linguistic expression, it is now simply a sounding phenomenon, like any other in nature.  It is, to quote Roland Barthes, "the articulation of the body, of the tongue, not that of meaning, of language.” (Barthes 413)  While Ligeti exploits the timbral potential inherent in the sonic substance of language in order to effect musical contrast, and Stockhausen uses the material of language as building blocks of a complex musical architecture, Carson's approach is arguably the most radical, presenting the voice as a prelinguistic sound object.

A similar reduction of the voice to a raw, nonlinguistic sound source occurs in Pierre Henry's tape piece Vocalises (part of his 1952 work Le microphone bien tempéré).  The title of the composition, referring to a vocal exercise based on solfège (do, re, mi, etc.) or nonsense syllables, provides a hint about its underlying concept.  Employing the Phonogène, an early sampling device that allowed the composer to adjust the playback speed, and thus the pitch and duration, of taped sounds, Henry subjects a simple "ah" sound to extreme sonic manipulations.  The characteristic formant structure of the voice is apparent in many of the sounds reproduced within the mid-range of the frequency spectrum, but these recognizable vocal utterances are accompanied by quick, high-pitched screeches and low, guttural growls, all of which are derived from the same basic Urklang.  In contrast to Lucier and Berio, who span the continuum between vocal and nonvocal sound gradually over a broad stretch of time, Henry condenses his sonic material into virtual simultaneity, creating a violent conjunction of the disparate.

A more traditionally "musical" but no less aesthetically effective deployment of the voice as a (mostly) wordless musical instrument takes place in Morton Feldman's Three Voices for Joan La Barbara (1982).  Here the composer uses a minimalist, loop-based compositional approach to create a timbrally homogeneous yet rich web of vocality marked by the occasional coagulation of discrete phonemes and sometimes whole words.

Speech as music

"Language can approach music, and music can approach language, to the extent that the boundaries between sound and meaning are annihilated." (Karlheinz Stockhausen, quoted in Knief 21)

At the other extreme from the desemanticization of language is the use of more or less unadorned speech as musical material.  Instead of the reappropriation of the vocal apparatus for the sake of nonlinguistic sound production, here it is a question of uncovering the latent musicality inherent in speech itself.   One might think that the use of prosaic speech would be motivated by a concern for the comprehensibility of the text in question.  But this seems rarely, if ever, to be the case: first, because the texts set in this manner are usually extremely elliptical in terms of literary meaning, and second, because the text is presented "polyphonically," with many voices speaking at once in a complex sonorous overlay.  Thus, if this approach to the voice is on its face less radical than those which annul completely the language-character of the voice, it is perhaps more potentially alienating for an audience, precisely because it achieves its effects through language.  If desemanticization circumvents language-character altogether in pursuit of a pure, nonlinguistic vocality, speech-as-music raises the expectation of communicative utterance only to frustrate it by yoking language to a translinguistic ("musical") aesthetic purpose.

What singles out the voice against the vast ocean of sounds and noises, what defines the voice as special among the infinite array of acoustic phenomena, is its inner relationship with meaning.  The voice is something which points toward meaning, it is as if there is an arrow in it which raises the expectation of meaning; the voice is an opening toward meaning.  […]  But if the voice is thus the quasi-natural bearer of the production of meaning, it also proves to be strangely recalcitrant to it.  If we speak in order to “make sense,” to signify, to convey something, then the voice is the material support of bringing about meaning, yet it does not contribute to it itself….  The voice is the instrument, the vehicle, the medium, and the meaning is the goal.  This gives rise to a spontaneous opposition where the voice appears as materiality opposed to the ideality of meaning.  The ideality of meaning can emerge only through the materiality of the means, but the means does not seem to contribute to meaning.  […] If we speak in order to say something then the voice is precisely that which cannot be said.  (Dolar 15-16)

In Robert Ashley's In Sara, Mencken, Christ and Beethoven There Were Men and Women (1972), based on a 1944 book of the same name by John Barton Wolgamot, the author's text is presented in a rapid, almost uninflected reading over a background of recorded and artificially produced noises.  This straight reading of the text allows its obsessively repetitious nature to come to the fore: the text consists of 128 "stanzas," each being identical save for four "variable" contents which are different in each iteration (although many are repeated over the course of the text).  The most striking feature of the text is the seemingly endless recitation of proper names that dominates each stanza.  The prosaic delivery of the text by a solitary speaking voice suggests that information is being transmitted, but everything belies this expectation: the incessant repetition of this "festival of names" empties it of what little semantic content it might have had, and reduces it, over the 40-minute course of the composition, to a procession of linguistic noises.  The audio sample here presents the first two minutes of the piece.  An in-depth report on the origins of the piece by Keith Waldrop can be found on UbuWeb.

A similar technique is at work in Kenneth Gaburo's Lingua II - Maledetto (1967-69), although here the vocal presentation is subject to much greater variation.  One speaker maintains a consistently sober speaking voice while the other voices interject with extremely varied types of vocal production, such as plain speech, shouts, singing, and laughter.  The piece thus encourages hearing in two concurrent "tracks"-- speaking voice and paralinguistic commentary-- and is arguably comparable to the traditional "melody and accompaniment" musical texture.

Total vocality

While some composers sought to exploit one extreme of the language-nonlanguage polarity, others were more interested in attempting to encompass within single works the entire spectrum of vocal sound production.  The work of Luciano Berio, in particular, is representative of this tendency.  Berio's composition A-Ronne for 8 vocalists (1974-75) provides an excellent study of his synthetic approach to the voice, though his better known works Visage (1961) and Sequenza III (1965) are also exemplary in this regard.  In A-Ronne, amidst a welter of wheezing, shouting, whistling, singing, the underlying multilingual poem by Eduardo Sanguinetti serves as a sort of perceptual anchor.  Berio employs the full arsenal of vocal effects, but seldom all at once; instead he deftly modulates between varying sonorous zones (speech, song, noise, etc.).  When different types of vocal sounds are layered, the result is usually a controlled polyphony, as in this example, where a haunting, Bach-like chorale is juxtaposed against an incessant, monotone speaking voice.  Elsewhere the layering of discrete vocal sounds creates an auditory overload bordering on the carnivalesque: in this passage, whistling, popping, susurration, squeaking, humming, speaking and stuttering combine to form a dense sonic texture. 



Dolar, Mladen.  A Voice and Nothing More.  Cambridge, Mass.: MIT Press, 2006.

Knief, Tibor.  "Typen der Entsprachlichung in der neuen Musik, " in Über Musik und Sprache, ed. Rudolf Stephan.  Mainz: B. Schott's Söhne,       1974.

Barthes, Roland.  A Barthes Reader, ed. Susan Sontag. New York: Hill and Wang, 1998.