Standardizing classical music metadata

David P. Anderson
16 October 2023

As described earlier, music apps deal with various item types:

  • Compositions (possibly hierarchical: movements within compositions, pieces within suites, etc.)
  • Arrangements
  • Instruments and instrumentation
  • People (composers, performers)
  • Ensembles (orchestras, quartets, etc.)
  • Recordings (of a composition, by a person and/or ensemble)
  • Performance venues (concert halls, houses)
  • Concerts

Each of these item types has various attributes, and the item types are linked (e.g. recording to composition and performer, composition to composer, etc.). Together, this information is called 'metadata'.

Metadata for classical music is more complex than, say, for popular music. Many music apps (e.g. Spotify) were designed for popular music, and ignore the complexity of classical music, with disappointing results; see an essay on NPR.

There is currently no standard for musical metadata. CDs and MP3 files have a provision for metadata, but no standardization of it. Each music app has its own scheme.

There would be a number of advantages in standardizing classical music metadata. For example, it would allow music discovery features that aggregate data from multiple apps to more accurately estimate individual taste. In this way we could combine the information of a person's music-listening on YouTube, their score browsing and downloading on IMSLP, and their concert attendance on Groupmuse.

I propose that an organization - the Classical Music Index (CMI) - be formed to standardize classical music metadata. The CMI would be an independent non-profit organization. It would be funded and run by a consortium of music apps.

Currently the broadest and deepest source of classical music metadata is IMSLP. It contains data about published composers (including obscure composers whose works are out of print) as well as contemporary unpublished composers, who can upload their scores to IMSLP and enter the metadata. The IMSLP data has some problems:

  • It has been entered by volunteers, and is not clean; e.g. names are spelled in different ways.
  • IMSLP is based on Mediawiki. There is no underlying database, no schema, and no normalization.

However it is by far the best current source. So I propose that the CMI start with a cleaned and structured version of IMSLP data.

The first task is to define a schema. I think the underlying data should be a SQL database. The data could be exported in other forms (JSON, XML).

I created such a SQL schema, which could be used as a starting point for CMI. Here's a simplified view:

We should also look at existing related schemas and borrow ideas from them. For example, has (bloated but possibly useful) schemas for creative works and people.

Anyone (not just consortium members) would be able to add content to CMI. In particular, all musicians would be able to create 'person' entries for themselves, composers would be able to add info on their compositions, and so on. This create some issues:

  • Collisions: two people might try to add the same item. There might be two composers with the same name.
  • Conflict: people might disagree on, say, the genre of a piece.
  • Spam and vandalism: people create entries with spam or obscenities.

I think the only way to address these issues is via curation: CMI staff (or volunteers) would review new content and changes.

One approach would be to use Git. The CMI data would be a collection of JSON files in a Git repository (possibly on Github). Each proposed change would be a pull request that could be reviewed and merged.

This would introduc some complexity for apps. Each app has an existing database, whose schema presumably is a subset of CMI. They'd need to add fields indicating whether an item has a corresponding CMI item, and if so what the ID is. And a process for submitting new items to CMI. We'd need to streamline this process and make using CMI cost-effective for apps.