Skip to content

Generation of Sequential Data

This machine learning (ML) project was part of my bachelor thesis. The main tasks were to analyze and understand the MusicVAE architecture and to implement it. The thesis and the final presentation are available for reading. The code belongs to the Fraunhofer IDMT, thus I cannot disclose it.

During the project I learned to use PyTorch, PyTorch Lightning, and TensorBoard for developement, debugging, and training in a larger machine learning project. Furthermore, the project required to carefully analyze a complex code base written in a legacy framework (TensorFlow version 1). In particular, I needed to find out details about the exact structure (vector sizes, etc.) of the MusicVAE architecture.

MusicVAE. Adapted with changes from https://magenta.withgoogle.com/music-vae
Figure 1: MusicVAE. Adapted with changes from https://magenta.withgoogle.com/music-vae

MusicVAE is a generative model based on a variational auto-encoder (VAE) as the generative framework parameterized by long short-term memory (LSTM) networks to be able to represent sequential data. In particular, it is designed to replicate sequential data with some form of hierarchical structure (properties shared with e.g. note sequences). In my thesis the data happened to be sequences of one-hot vectors representing MIDI notes. However, the main concepts can be transferred to other applications as well.

Variational auto-encoder (VAE) architecture as the generative framework.
Figure 2: Variational auto-encoder (VAE) architecture as the generative framework.

Long short-term memory (LSTM) architecture for learning to represent sequential data. Adapted with changes from https://colah.github.io/.
Figure 3: Long short-term memory (LSTM) architecture for learning to represent sequential data. Adapted with changes from https://colah.github.io/.

Thesis Abstract

The abstract of the thesis can be read below.

Automatic music generation (AMG) systems are computer-based models that produce signals that can be interpreted as music, such as waveforms or note sequences. Apart from purely commercial motivations, like the creation of a potentially unlimited amount of music, such systems are also interesting from a creative point of view, since they could be used for support during the music composition process, or even be understood as a new type of instrument. In this work, the flat variant of the AMG system MusicVAE is re-implemented and trained. It is used for unconditioned generation of monophonic note sequences of two bars in length. MusicVAE is a variational auto-encoder. It maps a given note sequence to a multivariate Gaussian distribution in a lower-dimensional space (called latent space). From that distribution, it samples a vector and decodes it to another note sequence. The model can be trained so that the reconstructed sequence is as similar as possible to the input sequence. By randomly sampling and decoding vectors from the latent space after training, new pieces of music can be generated. In this work, a set of generated note sequences is compared to a held-out test set using selected objective criteria and a subjective analysis. It is shown that the trained model struggles to reproduce the rhythmic structure of the training data and that generated note sequences often contain large jumps in pitch and single non-diatonic notes. Despite such outlying notes, most note sequences contain coherent and diatonic melodies. In particular, interesting rhythms and melodies with multiple high, short notes followed by few longer notes are generated. In future work, ways to condition the latent space could be examined to find possibilities of controlling the generation process.

Excerpts to Listen to

The excerpts mentioned in the presentation can be downloaded here.