MACHINE LISTENING: WaveNet and approaching media materialism through rhythmanalysis
“The blue lagoon is a 1980 American romance and adventure film directed by Randal Kleiser” . With this clever reference lifted from the Internet Movie Database we are introduced to the voice of WaveNet. A “generative model of raw audio waveforms,” the WaveNet algorithm is outlined in a paper published just this September by DeepMind, a machine learning subsidiary of Google (van den Oord). It is a significant step forward in the synthesis of human-sounding voices by computers, an endeavor which is both paradigmatic of artificial intelligence research and a mainstay in popular culture, from Hal in the film 2001: A Space Odyssey to current voiced consumer products like Apple’s Siri. According to DeepMind’s own testing , WaveNet outperforms current state of the art text-to-speech systems in subjective quality tests by over 50% when compared to actual human speech—it sounds very good, and no doubt we will be hearing much more of it.
My purpose in this text, however, is not to explore a genealogy of computer speech. Rather, it’s about “machine listening.” That term comprises both a philosophical question—can machines listen? (and its corollary, what is listening?)—as well as the sub-field of computer science concerned with the extraction of meaningful information from audio data. The timely emergence of WaveNet is compelling on both fronts, and I am going to proceed with the hypothesis that WaveNet is, perhaps more than anything else, a listening machine.
To this end, the second set of examples of synthesized speech provided by DeepMind is the more intriguing. Having been trained to speak, WaveNet nonetheless must be told what to say (hence the IMDb quote, etc). If it isn’t told, however, it still generates “speech” that is “a kind of babbling, where real words are interspersed with made-up word-like sounds” (van den Oord) . Listening to these, I’m struck first by the idea that this is the perfect answer to the classic campfire philosophy question, “what is the sound of my native language?” When we understand the words, the sub-semiotic character of a language is, perhaps, obscured. This babbling seems like a familiar tongue, or at least one somewhat related to English—maybe Icelandic? Secondly, to my ear, this set of examples sounds more realistic than the first. I’m hearing ennui in these voices, a measured cadence punctuated by breaths just as expressive as the “words,” a performance with the unmistakeable hallmarks of a overwrought poetry reading. The Turing test  has been mis-designed—it’s not the semantics that make this voice a “who” rather than an “it”.
The inclusion of aspirations and a more musical sense of timbre, rhythm, and inflection in WaveNet is a function of the acoustic level at which it operates. Previous techniques of text-to-speech, as DeepMind explains, are parametric or concatenative. The former is purely synthetic, attempting to explicitly model the physical characteristics of human voices with oscillators; the second relies on a database of sounds snippets recorded by human speakers that are pieced together to form the desired sentences. Both strategies proceed from assumptions about how speech is organized—for example, they take the phoneme as speech’s basic unit rather than sound itself. Where WaveNet is different is that it begins with so-called “raw” audio—that is, unprocessed digital recordings of human speech, to the tune of 44 hours worth from 109 different speakers (van den Oord). This data is feed into a convolutional, “deep” neural network, an algorithm designed to infer its own higher-order structures from elementary inputs. Subsequently, WaveNet generates speech one audio sample at a time, 22 thousand of which add up to a single second of sound in the form of a the digital waveform. An intriguing aspect of the result is that WaveNet models not only the incidental aspects of speech in the training examples, but the very acoustics of the rooms in which they were recorded.
WaveNet’s use of raw audio invokes what media theorist Wolfgang Ernst dubs “acoustic knowledge” (Ernst 179). For him, such knowledge is a matter of media rather than cultural interpretation, embodied in the material processes by which sound is recorded on a phonographic disc. As he puts it, “these are physically real (in the sense of indexical) traces of past articulation, sonic signals that differ from the indirect, arbitrary evidence symbolically expressed in literature and musical notation” (Ernst 173). It is the “physically real frequency” (Ernst 173) that matters, the signal over semantics. Erst makes clear the implications for listening: “Cultural tradition, or the so-called collective memory, does not lead to a reconstruction of the actual sonic eventality; we have to switch our attention to the laws of technological media in order to be in a position to reenact past sound experience” (Ernst 176). Ernst’s “media archaeology” is thus concerned with the “event” as a confluence of dynamical processes, albeit one inscribed in material artifacts.
To provide my own example, in a tape recording from the late 1940s of my grandmother speaking, she has a distinct Pennsylvania Dutch accent. This was somewhat of a revelation when I first heard it some 60 years later, having known her as an elderly woman with no such inflection. Her description of those years to me was to some extent limited by its telling—it required machine temporality, rather than human, to reveal the dialect that was inevitably missing from her own narrative. The sonographic resonance was something different than the hermeneutic empathy of her stories. To me, they are equally touching—Ernst would privilege the former.
And yet analog recording media are not without their own acoustic inflections—the hiss and pops of tape or record are an added valence to the sonic events they reproduce. There is a “style” to media, a dialect in this addition. For Ernst, this indicates how the medium is inseparable from the recording. It also mitigates the insinuation that a technical signal, in its physical realness, is somehow objective or unmediated. Rather, material contingencies comprise the character of such listening machines. Further, that a phonograph is an imperfect listener grants it some affective agency; its status as a listener is in fact predicated on having experienced in recording a change that is expressed in playback.
Such is the nature of sound. As Brandon Labelle puts it, “Sound is intrinsically and unignorably relational: it emanates, propagates, communicates, vibrates, and agitates; it leaves a body and enters others; it binds and unhinges, harmonizes and traumatizes; it send the body moving” (Labelle ix). Sound leaves an impression. How we experience it and how we respond to it with our own particular bodies is conditioned by both physiology and past experience that marks us as listeners, whether non-biological or of a race, class, culture, species. Listening to something cannot just be a matter of source + receiver—it is a material entanglement of these two together.
From this perspective, Ernst’s fascination with technical apparatuses is unnecessarily circumscribed. In the effort to assert acoustic knowledge over symbolic meaning, he sidesteps the material nature of human listening. It’s revealing when he writes that “Instead of applying musicological hermeneutics, the media archaeologist suppresses the passion to hallucinate ‘life’ when he listens to recorded voices” (Ernst 60). Such a call for “unpassioned listening” (Ernst 25) might be an attempt at empathizing with the machines, but it is at odds with the interrelationality of listening and oddly replays the detached ocularity—the cold gaze—of colonial naturalism. Can we ask instead if there are physical processes of which that “life” so characteristic of human listening is comprised?
This leads us to the history of research on human perception of which artificial intelligence like WaveNet is progeny. Jonathan Sterne recounts how beginning in the 1920s, institutes like Bell Labs realized that “Sound was not simply something out there in the world that human and animal ears happened to encounter and faithfully reproduce; nor where human ears solipsistically creating sound through the simple fact of hearing” (Sterne 98). Instead, “Hearing was itself a medium and could therefore by be understood in terms analogous to the media that were being built to address it” (Sterne 99). This demonstrates the perspective of Ernst, and Freidrich Kittler before him, that the invention of that media—the phonograph—predetermined such a revelation. Regardless, the cochlea of the human ear and its psychoacoustic properties made possible what Sterne calls “perceptual coding” (Sterne 2) that capitalizes on the difference between human and machine listening. If, depending on conditions, the human can perceive only a fraction of frequencies audible to the machine, and if the machine is able to digitally encode only those frequencies, there is a surplus of bandwidth that remains. Multiple streams of acoustic data can therefore be processed simultaneously, or, in particular, sent down a telephone line at the same time (all current telephony infrastructure does this). The difference in our listening capacities thus produces a poly-temporal relation.
This complicates a simplistic notion of acoustic knowledge as a direct signal. The machine, here, is no less comprised of processes that are physically real, but there exists a material semiotics in the digital encoding performed by its internal processor. Ernst excludes this from cultural symbolism as it operates on a machinic level “below the sensual thresholds of sight and sound—a level that is not directly accessible human sense because of its sheer electronic and calculating speed” (Ernst 60). But digital logic contains within it an adaptation to human sense that mediates between our differing temporalities. Computers typically sample audio at 44.1kHz—a number chosen to match the standard threshold of human hearing , but far below the capacity of contemporary digital processors (such as the 3Ghz computer I am typing this on). From the perspective of Sterne’s perceptual researchers, that threshold is a sensible choice if one wants to treat hearing as medium. Already, then, the human body reverberates in the digital acoustic impression.
However, we’re not much closer to our hallucinations. Sterne dubs the perceptual model “hypodermic” (Sterne 74) in that it assumes hearing is akin to the transmission of a message straight to the cochlea that might as well bypass the body—the audio signal is presumably “decoded” by some cognitive function thereafter. Ernst’s divide between technicity and cultural knowledge is, perhaps, similar, stuck within an idea of source + receiver. Consider, though, a problem I’ve recently come up against—the frame rate of virtual reality systems. For decades, film was made and shown at 24 frames-per-second. Though much slower than the ear, this rate was similarly determined by a perceptual limit beyond which a sequence of images appears convincingly continuous. But an audience sitting still in a theater looking at a stationary projection is a different story than one moving around with screens glued to their faces. As it turns out, anything less than 60fps in VR is stomach-churning—not only does the gastro-intestinal system then make it into machine rhythms, but it shows how the temporality of human senses is not so easily isolated from its embodied material-cultural situation.
Recent cognitive science research has shed further light into how that might work. “Neural resonance theory,” championed by Edward Large, observes (via fMRI) that electrical oscillations between neurons in the brain entrain to the rhythmic stimulus of the body by music or other behaviors. Once adapted, these endogenous oscillations can be maintained independently. Are these not our hallucinations? If Large is correct, the brain’s primary purpose might be that of a complex oscillator constantly adapting to its environment, not via some internally coded representation, but as a physical coupling of brain to world via the body. The song that pops into your head, the voice that you recognize, the familiar acoustic quality of a habitual space—these experiences are acoustic knowledge that are not limited to technical inscription by the machine, but which are no less material as they resonate within your own physiology.
This would not be news to Henri Lefebvre. Ernst’s dispassion is contrasted by Lefebvre’s warm bloodedness in which “the living body has (in general) always been present: a constant reference. The theory of rhythms is founded on the experience and knowledge of the body; the concepts derive from this consciousness and this knowledge, simultaneously banal and full of surprises” (Lefebvre 67). Rhythm, here, might be compared to acoustic knowledge as it is a form of material memory, but it encompasses a greater sense of both contingency and potentiality. Lefebvre’s “rhythmanalysis” is also concerned, like Ernst’s media archaeology, with the event: “Everywhere there is interaction between a place, a time and an expenditure of energy, there is rhythm” (Lefebvre xv). However, for Lefebvre “We know that a rhythm is slow or lively only in relation to other rhythms (often our own: those of our walking, our breathing, our heart)” (Lefebvre 10). Furthermore, these rhythms are not spontaneous or self-contained but are the result of a process of external influences. This he labels “dressage,” or training, the acculturation of an individual to a socially produced articulation of time (Lefebvre 39). This could be described as inscription, but it realizes the necessity of its own continual reperformance.
We know by now that the meaning of speech is not just a matter of semantics. As Deleuze and Guattari put it, “Because style is not an individual psychological creation but an assemblage of enunciation, it unavoidably produces a language within a language” (Deleuze 97). This second-order language, this style, this rhythm, is what is important to the rhythmanalyst, and what she can offer to the media archaeologist. For it brings an enunciation into play with the listening that conditions it. Ernst’s strict division of the semantic versus the technical requires us to repress the very reverberations that make acoustic knowledge significant, the chain of embodied entrainments in which both us and the machine are co-implicated. And yet, conversely, the pulse of the machine is absent in Lefebvre’s thinking, and can only be supplied by a close attention to technical means. To my ear, something like WaveNet requires their interanimation.
WaveNet is a listening machine. Like a phonograph, it processes raw audio, and reproduces raw audio in return. It operates beneath a human conception of what speech “is” and captures instead the acoustic knowledge that actually composes it. That we recognize the quality of that audio as important to a “realistic” voice shows that humans, too, possess a means of acoustic knowledge beyond the semantic—a sense of rhythm. WaveNet also functions as an algorithm for perceptual coding concerned with these very features–what’s retained from those 44 hours in the 10 second snippet is a sense of the embodied human enunciation. The mechanism through which WaveNet “learns”—training a deep convolutional neural network (van den Oord)—is in fact an entrainment to these rhythms. Starting as a blank slate (like children shipwrecked alone in a lagoon, natch), with the introduction of each human recording it learns how to predict the sequence of audio samples relative to a given text. With each recording it hears, it changes. This is what makes it a listener, and a better one than a phonograph that only can receive a single sonic impression.
We know from Large that the quality of internal oscillation in human physiology is conditioned by the environment—rhythmanalysis demonstrates that how you listen and how you walk, have sex, or use a computer are not materially separable. Likewise, WaveNet introduces its own inflections that are intrinsic to its material situation—algorithm, hardware, Google engineers. Its speech is a negotiation between human resonance and this embodied machine temporality. Lefebvre muses how “If one could ‘know’ from outside the beatings of the heart of … a person …, one would learn much about the exact meaning of his words” (Lefebvre 4). Beating at nonhuman rates, WaveNet both listens and speaks differently. What is it that we hear, then, in the melodrama of its babblings? Though its phonetic poetry is at first hearing benign, it begs the question of what qualities of enunciation it might normalize—who are the voices it listens to? To which listeners does it appeal? And how will interacting with WaveNet voices shape human ears, as they inevitably will?
 This testing was conducted via online crowdsourcing. The anonymous, underpaid, typically non-US human labor involved in training contemporary AI systems is an intriguingly problematic method beyond the scope possible here.
 Alan Turing proposed a test that predicated a machine’s ability to think on its ability to imitate a human. This was to be done via teletype—only written language is ever exchanged.
 An adult human can typically hear up to 22kHz—a sampling rate of twice this frequency is required to accurately reproduce the waveform (CD-quality audio is 44.1kHz). WaveNet operates at 22khz, meaning it’s limited to frequencies below 11kHz—it’s not hi-fi from an audiofile perspective, but that’s still pretty good.
Deleuze, Gilles and Felix Guattari. A Thousand Plateaus: Capitalism and Schizophrenia, trans. Brian Massumi. Minneapolis: University of Minnesota Press, 1987.
Ernst, Wolfgang. Digital Memory and the Archive. Minneapolis: University of Minnesota Press, 2013.
Labelle, Brandon. Background Noise: Perspectives on Sound Art. London: Continuum, 2006.
Large, Edward, et al. “Neural networks for beat perception in musical rhythm” in Frontiers in Systems Neuroscience, 2015; 9: 159. <http://dx.doi.org/10.3389/fnsys.2015.00159>
Lefebvre, Henri. Rhythmanalysis: Space, Time, and Everyday Life. London: Continuum, 2004.
Sterne, Jonathan. MP3: The Meaning of a Format. Durham: Duke University Press, 2012.
van den Oord, Aäron, et al., “WaveNet: A Generative Model for Raw Audio,” presented at the 9th ISCA Speech Synthesis Workshop, published September 19, 2016, blog post <https://deepmind.com/blog/wavenet-generative-model-raw-audio/> accessed September 25, 2016.