Multimodal deep representation learning and its application to audio and sheet music

M. Dorfer. Multimodal deep representation learning and its application to audio and sheet music. 10, 2018.

  • Matthias Dorfer

This thesis is about multimodal deep representation learning and its application to audio and sheet music. Multimodal deep learning in general could be described as learning task-specific representations from two or potentially more input modalities at the same time. What kind of representations a model learns mainly depends on the given training data and the task that is addressed, including its respective optimization target. In the first part of my thesis, the data at hand are images of sheet music and their corresponding music audio. Three different machine learning paradigms are employed to address Music Information Retrieval (MIR) problems involving audio and sheet music, with multimodal convolutional neural networks. In particular, the thesis presents (1) supervised function approximation for score following directly in sheet music images, (2) multimodal joint embedding space learning for piece identification and offline audio score alignment, and (3) deep reinforcement learning again addressing the task of score following in sheet music. All three approaches have in common that they are built on top of multimodal neural networks that learn their behavior purely from observations presented during training. To train such networks a suitable and large enough dataset is required. As such data was not available when I started working on the thesis, I have collected a free, large-scale, multimodal audiosheet music dataset, with complete and detailed alignment ground-truth at the level of individual notes. In total the dataset covers 1,129 pages of music, which is exactly the kind of data required to explore the potential of powerful machine learning models. The dataset, including my experimental code, is made freely available to foster further research in this area. With this new dataset I show that with the right combination of appropriate data and methods it is feasible to learn solutions for complex MIR-related problems entirely from scratch without the need for musically-informed hand-designed features. In the second part of my thesis I take a step back from this concrete application and propose methodological extensions to neural networks in general, which are more broadly applicable beyond the domain of audio and sheet music. We revisit Canonical Correlation Analysis (CCA) and Linear Discriminant Analysis (LDA) two methods from multivariate statistics to extend their core ideas to allow for combination with deep neural networks. In the case of CCA, I show how to improve cross-modality retrieval via multimodal embedding space learning by back-propagating a ranking loss directly through the analytical projections of CCA. For LDA, I reformulate its central idea as an optimization target to train neural networks that produce discriminative, linearly separable latent representations useful for classification tasks such as object recognition. To summarize, this thesis extends the application domain of multimodal deep learning to audio and sheet music-related MIR problems, proposes a novel audio - sheet music dataset, and adds two general methodological contributions to the field of deep learning.