Acoustic Ecology of an AI System
Amina is a designer and researcher currently undertaking a Techne funded PhD in the School of Communication at the Royal College of Art, in partnership with IBM, around the themes of Artificial Intelligence and Voice. She is also a trained singer, performing internationally with a number of choirs for over 20 years, as well as regularly for artists' projects. Amina is interested in the social, cultural and ethical implications of emerging technology. She employs voice as a medium, exploiting vocal potential to devise stories about alternate arrangements for society via design, technology and politics.
Acoustic Ecology of an AI System is a work-in-progress that involved creating experimental acoustics for synthesised AI voices to explore how we could use sound to locate non-human voices in fictional space, time, environment and architecture. The intention is not to create a more streamlined, efficient product, or resolve usability issues but to present alternative, poetic ways of perceiving these entities with and through a more holistic concept of what this form of communication entails. Berardi (2018) says that to employ poetry is to explore ‘beyond the limit of conventional meaning, and simultaneously [expose] a revelation of a possible sphere of experience not yet experienced’ (Berardi, p.21). In this project poetry is used as a method to evade the specifics of data and instead ignite the imagination to distort present conditions; highlighting alternative possibilities for synthesized voices and AI vocal communication.
‘A young woman from Colorado; the youngest daughter of a research librarian and physics professors who has a B.A in history from Northwestern, an elite research university in the United States; and as a child, won US$100,000 on Jeopardy Kids Edition, a televised trivia game. She used to work as a personal assistant to a very popular late-night TV satirical pundit and enjoys kayaking.’ Here, James Giangola, a lead conversation and personality designer for Google Assistant describes how the assistant was imagined during its design process (West, Kraut & Chew, 2019).
This quote describes how even before an AI assistant starts its learning process it is far from being conceived as impartial or neutral, and yet, when we interact with one this is what we are led to perceive. ‘Whichever voice assistant you choose, nearly all of them give us the feeling of interacting with a woman' (Krejci, 2018). The gendering of synthesized voices has been the topic of much debate in research and the media since the recent proliferation of AI enabled voice assistants in the home, workplace and use for accessing services. Conversely, we could avoid the uncanny valley (Mori, 2012) altogether. Artist and programmer Nicole He suggests we preserve recognizably robotic speech-to-text “voice”, as an important artistic aesthetic, even as speech synthesis technology advances in order to be able to distinguish between entities (He, 2019).
Today voice enabled AI assistants are intended to sound as natural as possible and designed to emulate human communication complete with prosody, pauses and punctuation (Leviathan, 2018). However AI synthesised voices that we interact with are sonically flat and acoustically unassuming, especially compared to interactions with the human voice. We are given no clues as to where this acousmatic voice is emanating from or what environment they are located in.
In search of a fresh perspective, if we consider the aesthetics of synthesised voices, as highlighted by He (2019), we can contemplate the voice as sonic object. Here a creative space emerges to manipulate the audible material to present different poetic possibilities. Don Ihde (2007) says ‘The tendency to miss the sonorous quality of speech is related to the tendency to forget backgrounds and to abstractly believe that one can attend to a thing-in-itself. This peculiar and often highly functional background does, however, present itself in dramaturgical forms of speech such as those found in rhetoric, poetry and chanting, and the actor’s voice. In such cases even while there continues to be a “showing through” the spoken language, the embodiment of that language in sound is more keenly noticeable’ (Ihde, p.138). He continues, ‘By opening the word to a wider and deeper context, the word becomes “poetic” in the sense of a bringing-into-being of a meaning; but at the same time it is a bringing-into-being of a meaning that I almost “knew all the time.” (Ihde, p.165).
Through experimenting with applying effects to synthesised voices we can mold their sonic, sonorous, material qualities. In this project, the intention of this was to reattach disembodied acousmatic voices to space and time in order to provide alternative poetic perspectives about these interfaces. Also, critique and contemplate the use of synthesised voices, how we design and implement them.
This project also draws on the discipline of Acoustic Ecology, originated by R. Murray Schafer (1977), whereby he suggests that we try to hear the acoustic environment as a musical composition and further, that we own responsibility for its composition (Schafer, p. 205). The practices’ study relies heavily on field recording and soundscapes as composition. Today acoustic ecology is heavily concerned with environmental and broader ecological issues and is even supporting scientific research into our changing environment, for example by monitoring species decline due to urbanization.
Insight into what lies beneath AI’s sleek, shiny, surface was beautifully and diligently documented as part of Anatomy of an AI System by Kate Crawford and Vladan Joler. Crawford (2018) says ‘Put simply: each small moment of convenience – be it answering a question, turning on a light, or playing a song – requires a vast planetary network, fueled by the extraction of non-renewable materials, labour, and data.’
Particularly fascinating, yet deeply worrying, are the people hidden within the system, illustrated in the diagram. Some of these people are referred to as mechanical turks or crowd workers - people hired by businesses that are remotely located to perform discrete on-demand tasks that computers are currently unable to do. Employers post jobs known as Human Intelligence Tasks (HITs), and involve tasks like identifying specific content in an image or video, writing product descriptions, or answering questions, among others. – So, in other words, tasks that we perceive the technology, or AI, are fulfilling. It is also imperative to understand the precarious and / or dangerous working conditions that some of these people face, for example, in noisy and dusty underground mines or huge smelting factories where dangerous fumes and vapours reside. These are aspects of the technology that go unseen, unrepresented, and most likely, unperceived by the general public.
Google’s WaveNet is a deep neural network for generating raw audio and by training it with recordings of real speech it has enabled the creation of relatively realistic sounding, human-like voices. However, when training the network without the text sequence, it still generates speech, but now it has to make up what to say. This results in a kind of babbling, where real words are interspersed with made-up, word-like sounds (Van den Oord & Dieleman, 2016). Clips of Google Wavenet’s Babble were harvested as source audio material for this work to suggest the experiments created as a speculative intervention to interrupt the existing sound design process.
The babble clips were treated through acoustic modelling software plug-in’s that allow for 3D visual sketching of an environment or space, and then through a digital process of creating geometrical acoustics to be applied to the source audio. Anatomy of an AI System was utilized as a guide map to prototype the different environments to be modelled and create a sonic presentation of Crawford and Jolers schematic, as: Acoustic Ecology of an AI System.
Berardi, F. (2018). Breathing: Chaos and Poetry. Semiotext(e) Intervention Series. California.
Crawford, K. & Joler, V. (2018) Anatomy of an AI System. [Online] https://anatomyof.ai/ [Accessed: 26th March 2020]
He, N. (2019). Robots Shouldn't Sound Human: The Aesthetics of the Computer Voice in Art and Games. Amaze Berlin Conference [Online] Available at: https://amazeberlin2019.sched.com/event/NE3y/nicole-he-robots-shouldnt-sound-human-the-aesthetics-of-the-computer-voice-in-art-and-games (accessed 1st May 2019)
Ihde, D. (2007, 2nd Edition). Listening and Voice: A Phenomenology of Sound. Ohio University Press, Athens, Ohio, USA.
Krejci, J. (2018). Giving It Her Voice. Form Magazine No. 279 Sept / Oct 2018. P.46-53
West, M., Kraut, R., Chew, H.E. (2019). The rise of gendered AI and its troubling repercussions. I’d Blush if I Could. Equals Global Partnership, UNESCO. p.85-135.
Leviathan, Y. (2018). Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone. [Online]. Google AI Blog. https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-conversation.html [Accessed: 26th March 2020]
Mori, M. (2012). The Uncanny Valley. Translated by MacDorman, K.F., Kageki, N. IEEE Robotics & Automation Magazine. June, 2013. P. 99-100.
Schafer, R., M. (1977). The Tuning of the World. A.A. Knopf, New York.
Van den Oord, A. & Dieleman, S. (2016). WaveNet: A generative model for raw audio [Online]. DeepMind Blog. https://deepmind.com/blog/article/wavenet-generative-model-raw-audio [Accessed: 26th March 2020]