Standardizing classical music metadata

David P. Anderson
1 January 2024
Back to top

Sub-essay: A database schema for classical music

Describing classical music in a general way involves various item types:

Compositions (possibly nested: movements within compositions, pieces within suites, etc.)
Arrangements
Instruments and instrumentation
People (composers, performers, lyricists)
Ensembles (orchestras, string quartets, etc.)
Recordings
Performance venues (concert halls, houses)
Concerts
Locations (cities, states, countries, continents)

Each item has various attributes, including links to items of other types (e.g. recording to composition and performer, composition to composer, etc.). This information is called 'metadata'.
Metadata for classical music is more complex than, say, for popular music. Many music platforms (e.g. Spotify) were designed for popular music, and ignore many aspects of classical music (such as multi-movement works). This limits the search mechanisms that these platforms can offer; see an essay on NPR.
As far as I can tell, current music platforms (IMSLP, YouTube, Spotify, MuseScore, etc.) don't share a standard for musical metadata; each platform has its own schema. CDs and MP3 files have a provision for metadata, but it's limited and not standardized.
There would be a number of advantages in standardizing classical music metadata. For example:

It would make it easier for music platforms to provide discovery features based on metadata: e.g., to let users search for recordings of works by 18th-century French female composers.
It would provide a basis for music discovery features that aggregate data from multiple platforms to more accurately estimate individual taste. We could, for example, combine the information of a person's music-listening on YouTube, their score browsing and downloading on IMSLP, and their concert attendance on Groupmuse.

I propose forming an entity - the Classical Music Index (CMI) - to create and maintain a database of classical music metadata. The CMI would be an independent non-profit organization, funded by a consortium of music platforms and other music-related organizations.
Goals

Expressive power

CMI would provide Web-based interfaces and APIs for querying the database. It should allow queries like
Show arrangements, for piano 4 hands, of string quartets composed by 19th-century French women.
... and similarly complex queries. Such queries should run in a reasonable amount of time (seconds, not hours).
Open editing

Data models

The relational model

In the 'relational model' there are object types (tables), and each object has a fixed set of attributes, some of which can be references to other objects.
The semantic network model

Much of the work in metadata is based on the 'semantic network model', in which data consists of (object, predicate, subject) triples. This model was first (1960s) used in AI to represent knowledge. It was later (1990s) used as the basis for the 'semantic web'. In this context there are W3 standards for

Object types: Web Ontology Language (OWL).
Representation of triples: Resource Description Framework (RDF).
A query language, SPARQL.
... and something called DBpedia is trying to convert the information in Wikipedia into this form.
You can convert a relational schema into a semantic network by making each attribute into a triplet. In this way, two different relational schemas can be unified, sort of, if you can agree on predicate names. And the network model is more flexible because you can add connections without changing the schema.
My take: the semantic network world is populated by academics, not engineers. Academics tend to make things abstract and complex, and to describe them obtusely. We end up with lots of research papers and few usable and scalable systems. As far as I can tell this direction is dying. Web sites are full of broken links and haven't been updated in years.
So I prefer to use the relational model.
Initializing the database

IMSLP

Currently the broadest and deepest source of classical music metadata is IMSLP, which has data about published composers (including obscure composers whose works are out of print) as well as contemporary unpublished composers, who can upload their scores to IMSLP and enter the metadata. The IMSLP data has some problems:

It was entered by volunteers, and is not clean; e.g. names are spelled in different ways.
IMSLP is based on Mediawiki. Until the recent addition of Clara, there was no underlying database, no schema, no normalization, and no general search capabilities.
Its contents mostly refer to scores that are no longer under copyright, or that have an open-access license. So it's missing a lot of items newer than 75 years.
It has no notion of permanent IDs.

However, it's the best current data source that I know of. So I propose that CMI start with a cleaned and structured version of IMSLP data. We can then merge other existing databases into this.
MusicBrainz

Another possible starting point is MusicBrainz (see below). This is a volunteer-based database of music metadata. They use a relational DB (PostgreSQL) and their schema is rich and well-documented. They have permanent IDs (MBIDs) for everything.
For our purposes, MusicBrainz has a couple of shortcomings: it focuses on recordings (not scores) and on popular music, not classical. Their database can't answer the '4-hands arrangements' query above.
However, assuming we start with IMSLP data, it would be good to import the MusicBrainz data, add items that aren't in IMSLP, and add MBIDs to the CMI tables where possible.
DB updates and access control

CMI would offer both web interfaces and APIs for adding and modifying items. These interfaces would encourage data consistency; for example entering a person name would use auto-complete. Anyone (not just consortium members) would be able to add content to CMI. In particular, musicians would be able to create 'person' entries for themselves, composers would be able to add info on their compositions, and so on. This create some issues:

Collisions: two people might try to add the same item. Or there might be two composers with the same name.
Conflict: people might disagree on, say, the title or genre of a piece.
Spam and vandalism: people might create items with spam or obscenities.

I think that (as in Wikipedia) curation is the best way to address these issues: CMI staff (or volunteers) would review and 'vet' new items and edits. Unvetted items would not be visible.
Platforms and other users would interact with CMI as shown here:

This would introduce some complexity for platforms. Each platform has an existing database, whose schema presumably is a subset of CMI. They'd need

to add DB fields indicating whether an item has a corresponding CMI item, and if so what the ID is.
processes for submitting new items to be vetted by CMI, for integrating vetted items, and for resolving vetting conflicts.
We'd need to figure out how to streamline these processes so that using CMI is cost-effective for platforms.

Related work

Music libraries

The Music Library Association has links to various things related to music metadata. See Music Discovery Requirements and Metadata for music resources. Much of the following is based on following those links.
Ontologies

The U.S. Library of Congress has ontologies (schemas) for things like person names, media types, and music notation forms.
FOAF is an RDF ontology for describing people.
The Getty research institute has 'vocabularies' for things like geographical names.
The Music Library Association site has:

Guidelines for describing the title, number, and key of a composition?
How to describe the format of a score.
MARC is a standard for digital encoding of library metadata, i.e. index cards. It's outdated but universally used. BIBFRAME is a proposed replacement.
Dublin Core is a schema for describing published works.
The Music Ontology is an RDF ontology for music. Example: the class 'MusicArtist' is described as "A person or a group of people (or a computer :-) ), whose musical creative work shows sensitivity and imagination". It's an academic project, and seems to be dead.
Databases

MusicBrainz

MusicBrainz is volunteer-based effort to collect and export musical metadata. It offers a GUI app called Picard that allows volunteers to add or edit metadata for CDs in a way that encourages DB consistency.
Their schema is relational (they use PostgreSQL) and well-documented. The entire database is exported as JSON or PostgreSQL.
There are some related projects:

ListenBrainz is a database of users' listening history. It can connect to Spotify, and with a number of other players, to get listening data. Based on this data, it offers visualizations of one's taste, as well as recommendations based on 'collaborative filtering' (not sure exactly what that means). These are exported via an API.
CritiqueBrainz.org lets volunteer write reviews of albums (and other things, like books) and give them 1-5 star ratings. These are exported via an API. There are no personalized recommendatations based on these ratings.
AcousticBrainz computes acoustic information (rhythm, key, 'danceability', etc.) about recordings, and exports it via a web API, similar to YouTube.

MusicBrainz comes close to fulfilling the goals of CMI and MPS. But there are some gaps:

It's centered around published recordings (CDs). Its schema can express works, but not scores. It can't express instrument combinations.
It's centered around popular music, and the album/band/song model.

Doremus

Doremus is a French academic project that is both a schema (an RDF ontology) and (I think) a database of some French archives. Its schema is detailed (you can say what specific instrument was used in a performance, and who tuned it) and correspondingly ornate; see this example. Their search interface doesn't work.
YouTube

YouTube offers a Web API for querying their database of videos. You can search on title, duration etc. It lacks music-specific features.
Movie/TV metadata

Movie and TV metadata is related to music metadata: there are works, people (cast, crew), ratings, etc. It's interesting to look at existing data sources.
IMDb (Internet Movie Database) started off (c. 1990) as a volunteer community project. They offered a download of their data as a CSV file. They were bought by Amazon, the data access disappeared, and the web site became much fancier. At some point they started licensing the data (including ratings) through both an API and bulk data download: see details.
TMDB (The Movie DB) is a volunteer community project, similar to what IMDb was before Amazon bought it.
TVDB is similar to TMDB but more commercial. They use volunteers to collect data, and 'moderators' to vet it. You can 'favorite' items and get recommendations. Their data is available through an API that's free for noncommercial use.
TVTime licenses (I think) one of the above data sources, and tells you when your favorite shows are on (or something like that). It was social features (comments) and a 'people also watched' function. It's slightly analogous to Concert Finder.