Rough experiments in feature audiation
July 05, 2021
Feature visualization shows us that neural networks trained on images have an intricate model of the world that we can peek into using a collection of interpretability tools and techniques. When applied to computer vision networks like VGG or ResNet, we get a dizzying variety of visual patterns, textures, and abstract artifacts.
Because of some inherent difficulties, similar techniques aren't often transferred to models of other kinds of data. For example, text is less straightforward: BERT is a "bad dreamer." And it appears that feature visualization for audio has its own set of challenges. Given that audio models perform very well and are increasingly used in practice, it could be valuable to adapt existing visualization techniques to the class of audio models and 📐 see what happens 📐.
A challenge in working with audio is that model architectures vary widely. Generative models may be autoregressive (using convolution or recurrence), diffusion-based, or use tranposed convolution, while encoder models may also be mostly recurrent or based on convolution and attention. Given the wide variety in architectures, it is hard to imagine any universal approach to meaningfully apply to each type of model.
One way to solve this problem is to design and train a neural network specifically for interpretability or artistic reasons. An earlier paper inspired by DeepDream explored modifying the style of music using features from a model trained on people's musical tastes.
Another thought is that we could reasonably expect architectures to standardize over time — especially towards large, unsupervised models. And these are also the types of neural networks that interpretability would be best suited for, because their representations could be especially surprising or informative. Many recent models for raw audio are quite large and benefit from pre-training without labeled data. For example, WaveNet is a powerful generative network for speech and music and there is work on interpreting it (1, 2). Another unsupervised model is Wav2Vec. These models learn generally useful representations of audio, which makes them an interesting challenge for interpretability.
Because the artifacts generated by these experiments will be audio samples rather than images, I like to call the process "feature audiation" rather than feature visualization, to avoid confusion. My goal is to show that feature audiation is possible for a modern audio network using various techniques from feature visualization, and hopefully this will encourage future work to find more refined approaches than my rough experiments.
Modeling Speech With Wav2Vec2
While both music and speech are sources of training data, I decided to use a model trained on speech only. I also picked an encoder model rather than a generative or autoregressive one, because it is more straightforward to adapt to feature visualization.
One of the most recent architectures for speech processing is Facebook's Wav2Vec2. Wav2Vec2 starts with a handful of 1D convolutional layers followed by a transformer encoder. It was trained on 53,000 hours of unlabeled speech, and its weights are made available via fairseq or HuggingFace.
We can think of Wav2Vec2 as a general algorithm for converting a waveform of speech audio to a stream of lower-frequency contextual representations where, in the case of Wav2Vec2-base, and . Thus, because the model is trained on audio sampled at 16 kHz, its output is a sequence of 768-dimensional vectors at a rate of 50 Hz (every second of audio has 50 corresponding vectors, or a vector for every 20 milliseconds.) It was trained similarly to a masked language model, where some inputs to the transformer are masked out and need to be predicted in context of the remaining inputs. But unlike language, speech doesn't come in discrete tokens, so there's an additional learning objective towards producing diverse discrete codes representing each section of audio.
These representations are useful for various tasks. For example, one may fine-tune Wav2Vec2 for speech recognition by mapping its output to characters of an alphabet or a dictionary of phonemes, via Connectionist Temporal Classification. And because the representations are not supervised by transcripts, the vectors also likely contain paralinguistic information, which could be used for tasks like emotion or speaker recognition. It's likely that somehow, the activation patterns of Wav2Vec2 collectively represent things like language, affect, and speaker identity — and this should motivate more efforts in interpretability.
Technique for optimizing neurons
Importantly, images and audio can both be converted to the frequency domain via a Fourier transform. In images, it's been found that generating samples from the (power-normalized) frequency domain tends to produce better results than optimizing directly from the spatial domain. I found that this is also true for audio. It is also helpful to "augment" the samples with different kinds of transformations, making them more robust. We get improved results by representing inputs through the following pipeline:
- Random frequencies: generate a random normal distribution in the frequency domain.
- Normalize: multiply each frequency coefficient by the reciprocal of its corresponding frequency. (See
- Convert to time domain: apply the inverse Fourier transform to produce a signal in the time domain.
- Augment: apply data augmentation to generated inputs: small left-to-right translation and elementwise Gaussian noise. (See
Neuron and Channel Audiation
Using this technique, we can activate individual neurons with generated samples. Neuron audiation is when we maximally activate a neuron while discouraging activations of its neighbors.
It was easier to focus on the transformer module of Wav2Vec2, so I start with the early layers of the transformer. The audiations here sound faintly speech-like. Maybe they represent phonemes or facets of phonemes.
The transformer in Wav2Vec2-base has 12 layers. These are some samples taken from the latter half of the transformer.
A closely related technique is channel audiation, where a particular neuron is maximized along the spatial (or time, in the case of audio) axes. In images, this technique generates abstract, but well-structured textures. In this early experiment, we get chaotic sounds with some repeating features. There's something eerie or comical about the humanness of some of these "audio textures."
There seem to be many things to explore regarding interpretability for audio models. For example, we could:
- Retrieve dataset examples of neuron activations
- Improve signal-to-noise ratio via regularizers
- Identify paralinguistic features in Wav2Vec2
- Generate multiple samples for a single target, optimizing for diversity
- Long-term consistency: attempt to generate complete words
Thanks to Seung-won Park for guidance on modern speech models, Nick Moran for sharing ideas about feature visualization on transformer networks, and Peter Lake for discussing his similar experiments with WaveNet (repo.)
I'm also very grateful to the authors of Distill.pub for providing an overview and contribution to feature visualization techniques and Facebook AI Research for releasing their models.