Musical taste and personalization

David P. Anderson
1 January 2024
Back to top

Musical taste

When you hear or play a piece of music, you react in some way: love, hate, indifference, etc. The reaction depends on - among other things - your 'taste'. It seems to me that taste in classical music has these properties:

  • Taste has a wide dynamic range. For example: most classical music leaves me cold: listening to it may be pleasant, but my mind wanders to other things. But there are pieces (like the Schubert cello quintet) that produce a powerful emotional reaction.
  • Taste varies widely between individuals. Some people may love a piece while others hate it. The average rating of a piece may not be a good predictor of whether a particular individual will like it.
  • Taste has structure; there are correlations (between people, and between pieces) that can be used to predict how an individual will react to a piece they haven't heard.
  • A person's taste changes (perhaps slowly) over time.

I think of music as having a sort of multidimensional normal distribution.

  • At the center is music that most people like: Beethoven, Chopin, Mozart. This dominates the radio, concert halls, and music classes. I call it '1 sigma' music.
  • 2 sigma music: some people like it but others don't: Mahler, Couperin, Glass, Scriabin.
  • 3 sigma music: relatively few people like it, but those who do REALLY like it: Sorabji, Ligeti, Gesualdo. This music is rarely performed. You can find it (e.g. on YouTube) but it takes some work.

Most people start off listening to 1-sigma music. As time goes by, your ear expands: unusual harmonies and sonorities start to make sense. And, after hearing warhorses for the 10th time, you may get a bit tired of them. So you start listening to more 2-sigma music. Tastes start to differentiate.

This process continues. You discover 3-sigma music (by word of mouth or Internet exploration). Your taste moves into the fringes of the distribution, and becomes even more individual.

Aside: one hears phrases like "curated by experts", implying that there's an "ideal" musical taste. It think this is nonsense; it's a necessary position for centralized sources like radio stations and large-venue concerts, since they can only offer one program.

If we can infer the taste of a user, we provide discovery tools that are more efficient. We can show them items they'll probably like, and might not have discovered otherwise.

Data sources for inferring taste

Suppose we want to infer the musical taste of an online user. What sources of data can we use?

Explicit ratings

This notion of rating is pretty standard on the web. Amazon lets you rate products and shows you the average ratings; shows you average ratings of hotels, and so on. I've found this to be hugely valuable.

Music platforms could collect ratings of all item types: compositions, recordings, performances, composers, etc.

To be useful for inferring taste, ratings need to be associated with a user. I.e. you have to be logged in to rate things. A platforms could collect anonymous ratings, but they'd be useful only for showing rating statistics (e.g. average rating), not for anything personalized.

When requesting a rating, it's important to be clear about what's being rated. For example, people on Amazon sometimes give low ratings because their package arrived late, not because the item was bad.

When you listen to a recording, the following are potentially different:

  • How much you like the piece itself.
  • How much you like this particular performance.

A music rating system could separate these. And there are other factors. Maybe you like the 1st movement but not the 2nd; maybe you like the singer but not the pianist; maybe you don't like the sound quality. It would be a challenge (probably not worth it) to separate all of these.

Finer-granularity ratings have more predictive value. A rating of a performer has less value than separate ratings of several recordings by that performer.

From the data science perspective, the more ratings the better. When a person attends a concert, it would be nice to get their ratings of each piece on the program. That way - for example - we could give listeners better recommendations for future concerts, and we could help performers create programs that people will like more.

But there's the danger of "rating overload". Marketers have discovered the value of ratings. When I rent a car, I get an email asking me to rate the experience on various axes. This is tedious, so I ignore it.

So collecting ratings has to be done judiciously. Maybe after attending a concert I get an email asking me to rate the pieces and the performers. And maybe there's an incentive. This should be studied.

And we must keep in mind the problem of fake ratings. Composers might pad the ratings of their own compositions. Platforms could discourage this by 1) making it hard to create a new identity: require Recaptcha, response to an email, etc. 2) using AI to identify fake ratings.

Rating scales

There are various scales for ratings:

  • Star ratings: typically one to five stars (with one, not zero, being the lowest rating). Amazon uses this. Hotel-related sites often use zero to ten stars, and they let you rate various attributes (staff, comfort, value, location, etc.) separately.
  • Binary ratings: Netflix and YouTube (among others) moved from star ratings to binary: "Like" and "Don't Like". Apparently the number of ratings doubled. This is discussed in: this interesting article.
  • Unary ratings: Features such as "Like", "Collect", "Follow" etc. are like binary without "Don't Like". They're better than nothing, but:
    • You can't say how much you like the item.
    • Not checking the button could mean either that you don't like the item, or that you didn't see the button, or that you just didn't bother. (Although if a user 'likes' a large fraction of the items they view, perhaps absence of Like has more significance.)

A system for analyzing ratings should be able to use data in any of these scales.

Implicit ratings

Taste can potentially be inferred from user interactions:

  • On YouTube: what videos you listen to; how long you listen to each video. If you listen to a video all the way through, you probably like it.
  • IMSLP: what you search for; what scores you download.

Comments and reviews

Sites like Amazon and let users write reviews of items. This is a popular feature; people like to express their opinions to the world. I find reviews quite valuable in many cases.

Reviews may provide implicit ratings: Natural-language AI techniques can tell whether a review is good, bad, or indifferent.

Sites may let users rate reviews; for example, Amazon has "Helpful" and "Report" buttons.

Reviews, and ratings of reviews, could be used in various ways. Reviews can be used for 'social linkage' discovery features: E.g. when I read a review of an item I can look at other reviews by the same person, perhaps of items I already know about. That gives me some idea of whether I trust the reviewer.

Objective attributes

We may be able to infer properties of compositions and recordings directly from their digital descriptions:

  • For recordings, we can compute acoustic properties of the waveform: rhythm, acoustic spectrum, energy, 'speechiness', harmony, rate of harmonic change, etc. Spotify does this as part of their recommendation system, and exports the 'feature vectors' via a web API.
  • For scores (in machine-readable form) we can compute tempo, harmonic structure, note density, and other statistics.

MPS could potentially compute these attributes, and use them for both for collaborative filtering and similarity estimates.

Descriptive tags

All Music Guide (AMG) has a scheme where albums are assigned English 'tags'. There are two classes of tags: 'Moods' (Dreamy, Dark, Airy, Soothing, ...) describe the music itself; 'Themes' (Dance Party, Street Life, Protest, ...) describe the contexts in which one might most enjoy the music.

AMG's editors decide on the set of tags, and the assignment of tags to albums. A system like this would be problematic if the public could assign tags; people might have much different ideas of the meaning of a tag.

AMG has discovery tools that involve tags. If you click on a tag you get a list of albums with that tag. Their 'Advanced Search' feature lets you including tags in a query that includes other attributes like genre, release date, and artist name.

Tags could be used in other, less explicit, ways; for example, collaborative filtering (see below) could consider two people to have affine taste if they like music with similar sets of tags.

This feature is interesting because it leverages 'experts' to (potentially) aid in personalized discovery. Other platforms use 'experts', to create (non-personalized) 'curated playlists'. The implication is that the experts have 'ideal' musical tastes that the rest of us should aspire to.

Modeling and inferring individual musical taste

The above sources of data can be used as a basis for various discovery features.

Some of these are non-personalized; all users see the same thing. For example, an interface could say "people who like composition A also like B, C and D". This can be computed from ratings or usage data: find people who like A, and see what else they like.

Simple personalized features are possible:

  • Suppose objective attributes (e.g. acoustic features of recordings) are available. For a given user, we can look at the attributes of recordings they like, do a cluster analysis, and recommend recordings near the cluster centers.
  • If descriptive tags are available (as in AMG) we can look at the tags of items a user likes, and recommend items with similar tags.

Usage-based collaborative filtering

Collaborative filtering is the general idea of using data about a large population of users to make predictions about one user.

A simple form of collaborative filtering is based on usage data, such as the recordings that users have listened to. Suppose a user U has listened to recordings {R1...Rn}. For recordings R not in this set, compute the number N(U,R) of pairs (Ri,U') where U' is a user who has listened both to R and to Ri ∈ {R1...Rn}. If N(U,R) is large, we know that people who listened to the same things as U also listened to R, and therefore (perhaps) U will like R.

Rating-based collaborative filtering

More sophisticated algorithms are possible if we have rating data (explicit or implicit). For example:

  • If two users have rated lots of the same items, compute the (Pearson) correlation between these rating vectors. This gives a number in [-1,1] estimating the similarity of the users' tastes. Then, to estimate a user U's rating of an item I, look at the users who are highly correlated with U and have rated I; take an average (weighted by correlation) of their ratings.
  • Same, but switch the roles of users and items: find the set of items J that are correlated with I and that U has rated; average U's ratings of these items. Note: there's complete symmetry between users and items, so every algorithm has a dual.
  • Linear model: each user and item is modeled as a vector of N numbers (N is in the range of 10: it corresponds to the number of independent components of musical taste). The predicted rating of I by U is the dot product of their vectors. Compute the vectors by using an optimization algorithm to minimize the error between actual and predicted ratings. Note: unlike the correlation algorithms, this can estimate a user's rating even if there is no rating connection U -> I' -> U' -> I.
  • AI models using neural nets. I suspect that these might outperform all the above.

I know about collaborative filtering because in the 1990s I used it to develop web sites that recommended movies and music. I don't think collaborative filtering is widely used today. Netflix created a big ruckus about improving algorithms, but I'm not sure they actually use them.

The Music Preference Service

Music platforms can collect the various types of data listed above. In particular, they could collect ratings in various ways:

  • Listening platforms (YouTube, Spotify, Pandora, IMSLP) could request a rating for each listen.
  • Sites that offer scores (e.g. IMSLP) could let you rate compositions and composers.
  • Concert organizers (large venues, Groupmuse, private house concerts) could collect post-concert ratings, of the individual works and/or the concert as a whole.

The preference-prediction methods listed above work better with more data. They work best if all data, from all platforms and for all users, are pooled and analyzed together. If I rate a recording of Scarbo on YouTube and a performance of Scarbo on Groupmuse, these should ideally end up in the same database, marked as ratings of the same piece by the same person.

Also, collaborative filtering works best if ratings of different item types (compositions, performances etc.) are pooled and analyzed together; that way the system can exploit correlations between types.

So the best way to enable personalized discovery is to establish a 'Music Preference Service' (MPS): a consortium of music platforms that pool their data.

The platforms in the consortium would give MPS all data relevant to preference: explicit and implicit ratings, comments and reviews, etc. MPS would define a way to link user identities across platforms.

This requires that music platforms standardize on the types of items being rated, and the way items are identified. This would use the Classical Music Index.

BTW, in the very early days of Amazon, I proposed to Jeff Bezos that we use this approach to combine movie ratings (from my movie-recommendation web site, with his book ratings. He liked this idea but didn't pursue it; instead he offered me a job, which I declined.


The Music Preference Service would provide a web-based API.

    result = predict(user, item, purpose)

user identifies a user.

item identifies an item of some type (composition, recording, concert, venue, person).

purpose describes the user's proposed use of the item: to play it, to hear a streamed recording, to hear a live performance, etc.

result is the pair of the predicted rating (0..1) and the confidence in that prediction (0..1). Or it could be a confidence interval.

    results = top_predict(user, type, purpose, n)
Return a list of the n items of the given type for which predict(user, item, purpose) is greatest.

    results = top_predict_group(users, type, purpose, n)
Return a list of the n items of the given type for which the minimum (or average, or RMS) of predict(user, item, purpose) is greatest. This can be used, for example, to recommend compositions that a soloist and accompanist will both like, or to find concerts that everyone in a group will like.
    result = similarity(item1, item2, mode)

Returns an estimate of the 'taste similarity' between a pair of items of the same type. If the type is 'person', there are two modes.

  • Listening similarity: two people are similar if they like listening to the same compositions, recordings, or composers.
  • Composition similarity: If one (or both people) are composers there's a second type of similarity, based on their compositions. For example, if A is a performer and B is a composer, this means that A is likely to like to hear (or play) B's compositions. This notion of similarity is relevant to People Discovery.

    results = top_similarity(item, mode, n)
Return a list of the n items whose similarity to the given item is greatest.

Data privacy

The data given to MPS by a platform would be private to that platform, and not shared with other platforms other than by its effect on predictions. The platforms would trust MPS to enforce this privacy. Therefore there must be an organizational 'firewall' between MPS and the platforms; for example there should be no common staff.

MPS needs to link user identities across platforms, but should not store data from which real-world user identities could be inferred. This could be done using hashed versions of identifiers like email address and phone number.

Listening context

Ratings and predictions involving listening are complicated by the fact that context can make a big difference:

  • You might like a Scriabin piece in isolation, but not right after hearing several other Scriabin pieces.
  • You might like a Mahler symphony in isolation, but not if you heard it the previous day.
  • Your reaction to hearing a piece could depend a lot on your mood, who you're with, and so on.

It would be overkill to ask users what mood they're in when they rate something. But it may be possible to use contextual information to make better predictions:

  • Platforms could give the MPS data not just about a user's ratings, but their complete interaction history (what they viewed and clicked on, the concerts they attended, etc.).
  • Rating predictions could take this data into account (I don't have any specific algorithmic ideas, but it would fit naturally into AI-based schemes).
  • The rating prediction for a concert program could reflect the pieces as a sequence, not just the average of separate predictions.

Bootstrapping new items

How can the MPS bootstrap a new composition (perhaps by an unknown composer) or a new recording? It could recommend the item to random people and try to get some ratings.

Even if there are no ratings of the composition, we may know something about the composer. For example, if they've rated things themselves, we can guess that their composition is similar to the music they like.

We could use the objective attributes of the item (see above); these don't involve ratings.

We can encourage composers to create synthesized sound files of their compositions (easy if they use a score editor); this is likely to get ratings faster.

We want to avoid a situation where new composition gets a few bad ratings and then no one ever sees it again. On the other hand, if a composition is truly bad, we don't want people to waste time on it.