Data-Driven Music Theory: Curating and Investigating Large Corpora of Digitally Encoded Music Analyses

Rohrmeier, Martin AloisNeuwirth, Markus Franz JosefHentschel, Johannes2024-07-022024-07-022024-07-02202410.5075/epfl-thesis-10276https://infoscience.epfl.ch/handle/20.500.14299/208928This dissertation on data-driven music theory is centered around curatorial practices concerning the creation, publication, and evaluation of large, expert-annotated symbolic datasets. With its primary interest in the harmony of European tonal music from its early beginnings until today, it lays the foundation for methodological advancements in music theory that put computational modeling at their core. By extending the analytical lens beyond the individual score to encompass hundreds of pieces simultaneously, the investigation of large corpora promises to broaden the scope of standard hermeneutic and inductive methods, applying them to more extensive segments of music history collectively. This approach helps deepen our understanding of overarching historical and stylistic trends and, simultaneously, enables empirical insights into general compositional principles and the specific ways individual works relate to them, epitomizing "scalable reading". A significant portion of this work delves into the meticulous process involved in the creation, publication, and maintenance of the Distant Listening Corpus (DLC), a corpus currently composed of 1220 digital scores, fully annotated by music theory experts. Its ~230 000 annotation labels exhaustively cover keys, harmonies, cadences and phrases of the annotated pieces. Being primarily entered and contained in the score encodings, a parsing library for MuseScore files is developed in order to render all corpus data available in a homogeneous tabular format. This library also plays a pivotal role in the semi-automated workflow that constitutes the technical backbone of the corpus initiative and supports curators, annotators and reviewers in their respective tasks. The DLC is curated with a special focus on current and future best practices that include transparency and reusability of the corpus data. The corpus can be loaded, processed and analyzed with the library DiMCAT, developed specifically for this purpose, which aims to facilitate reproducible research and integration with other symbolic datasets in the field. Another focus of this work lies on music-theoretically informed data models. Aiming to increase the interoperability of the harmonic analyses in the DLC, it is accompanied by a unified chord model for Western harmony which makes it possible to bridge harmonic corpora that originate from diverse theoretical traditions, ensuring that analyses can be reliably compared and integrated. This model provides a vital bridge, marking a key development in bringing together different harmonic datasets to enhance collective insights into music theory. The final goal is the empirical investigation of the curated corpus, leveraging the developed tools and models to extract new insights into the evolution and nuances of Western harmony. First, a study utilizing chord profiles and chord-tone profiles draws on methods from literary stylometry, exploring to what extent the chord annotations can serve as 'stylistic fingerprints'. A second study investigates the phrase annotations provided by the DLC through a reductive model, aiming to give an overview of the anatomy of the musical phrase. By intertwining these three objectives, the dissertation embodies a comprehensive approach to digital musicology, aspiring to set a precedent for future data-driven music theory studies.enmusic theorycomputational musicologydigital humanitiescorpus researchharmonyannotation workflowresearch data managementData-Driven Music Theory: Curating and Investigating Large Corpora of Digitally Encoded Music Analysesthesis::doctoral thesis