Music Learning with Long Short Term Memory Networks

Colombo, Florian François

2015

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Humans are able to learn and compose complex, yet beautiful, pieces of music as seen in e.g. the highly complicated works of J.S. Bach. However, how our brain is able to store and produce these very long temporal sequences is still an open question. Long short-term memory (LSTM) artificial neural networks have been shown to be efficient in sequence learning tasks thanks to their inherent ability to bridge long time lags between input events and their target signals. Here, I investigate the possibility of training LSTM networks to learn and reproduce musical sequences and eventually better understand some of the mechanisms neural networks deploy to learn and compose long time scale structures. To be able to learn music with LSTM networks requires representing musical sequences in these networks. The musical representation developed for this work is inspired by the tonotopic representation of sounds in the auditory system. It is shown that LSTM networks are able to learn each note transitions of the monophonic and polyphonic versions of a simple song using a particular network architecture where both input and output of LSTM networks are musical notes in the developed network representation. However, this architecture for LSTM networks fail to learn longer and more complex musical sequences (e.g. the J.S. Bach cello suites). To solve this problem, I introduce the separation of time scales model, which consists in two connected LSTM networks, operating on different time scales. On one hand, trained slow time scale LSTM networks produce transitions between unique identifiers of musical patterns, which resemble a compressed memory of the pattern akin to neural memory. This gives the long time structure of music. On the other hand, trained fast time scale LSTM networks are producing the note-to-note transitions of each musical patterns. The latter receives as additional inputs the identifiers from slow time scale networks, akin to feed- forward input from memory regions of the brain. These unique identifiers bias fast time scale LSTM networks toward the production of the corresponding musical pattern. The most efficient identifiers of musical patterns are found to be a representation of how similar patterns are from one another. Finally, when unlearned pattern identifiers are given to trained fast time scale networks, novel musical patterns are created from the learned production rules. I show that the introduction of a separation of time scales greatly improves the capacity of LSTM networks to learn a larger body of musical sequences. Finally, I demonstrate that previously unseen input biases can be used to induce the network into the generation of new musical sequences, akin but not similar to known patterns. This presents a possible first step towards the generalization of previously learnt musical knowledge to the creation and composition of new music by artificial neural networks.