Biased Lift for Related-Item Recommendation

Preliminaries

I’m going to use my usual notation in this blog post: we have a set $U$ of $M$ users who interact with $N$ items forming set $I$ .

For convenience, let $m_i = |U_i|$ be the number of users who have interacted with item $i \in I$ . We can directly sort by this value, in descending order, to find the most popular items (those items who have been interacted with by the most users). We can also frame overall popularity as the probability that a user will interact with that item (assuming, for the moment, that all users are equally likely or representative of future incoming users):

$\operatorname{P}[i] = \operatorname{P}_{u \in U}[i \in I_u] = \frac{m_i}{M}$

When we have explicit ratings, user $u$ gives item $i$ the rating $r_{ui}$ , and the average rating for an item is $\bar{r}_{ui}$ .

To recommend items, we compute some score $s(i|\dots)$ and pick the items with the highest score, or use these scors in some more sophisticated ranker such as a sampler or diversifier.

Related Products

If we want to move beyond overall popularity to contextual probability, where the context is a reference item $j$ (e.g., the item the user is looking at right now), the first thing we can try is just the conditional probability:

$s(i|j) = \operatorname{P}[i|j] = \frac{m_{ij}}{m_j}$

This is convenient and straightforward: for some candidate item $i$ , what is the probability the user will interact with $i$ given that they have interacted with $j$ ? We can then use those probabilities to rank the recommendations. Joe Konstan and I taught this in the early videos in the Coursera Recommender Systems MOOC, and I keep teaching it in my in-person recommender systems classes today.

Conditional probability has an interesting problem, however: what if $i$ is very popular? If a candidate item is highly popular, it will have a high conditional probability regardless of the reference item. We sometimes call this the “banana problem”: lots of people buy bananas, so for any other item in the grocery store, there’s a reasonably good probability the user will also buy bananas. Further, most people already know about bananas, so recommending bananas is less likely to give them new information than recommending some less-common item that frequently co-occurs with the reference item.

One way to address this — and the method we taught in the MOOC — is to use lift:

$\begin{align*} s(i|j) = \operatorname{Lift}(i,j) & = \frac{\operatorname{P}[i,j]}{\operatorname{P}[i]\operatorname{P}[j]} \\ & = M \frac{m_{ij}}{m_i m_j} \end{align*}$

Lift measures the degree of statistical relatedness of two events. If the two events (items, in our case) are independent, then $\operatorname{P}[i,j] = \operatorname{P}[i]\operatorname{P}[j]$ , and $\operatorname{Lift}(i,j)=1$ . Higher lift values arise when the items co-occur more frequently than we would expect by chance if the items were independent, indicating that they are statistically related (lower values arise when they co-occur less frequently). If we want lift scores to be better-behaved numerically, perhaps to combine with other scores in a hybrid model, we can use log lift, but this doesn’t change the ordering of potentially-related items.

Lift has a different problem, though: what if $i$ is very unpopular, and therefore $m_i$ and $\operatorname{P}[i]$ are small? Perhaps it is so unpopular, that its only interactions co-occur with interactions with $j$ ? In this case, lift will be very high, but has low support: we are making very confident conclusions of item relatedless on the basis of a small number of interactions (sometimes only a single interaction).

What can we do about this?

Sidebar: Bias

The bias-variance tradeoff is our key to solving this problem. To see how that might work, let’s go to a different model that we also taught in the MOOC and discussed in our old survey, and I picked up in turn from FunkSVD, I think: the bias model.

If we want to compute the average rating for an item, the basic way to do this is with a simple mean:

$\bar{r}_i = \frac{1}{|U_i|} \sum_{u \in U_i} r_{ui}$

The mean has a similar problem as lift — if one user really liked a movie and rated it 5 stars, it will have an average of 5, beating out movies that a lot of people rated and gave a respectable average of 4.7 over, say, 10K ratings.

An easy fix for this problem is to introduce some damping or bias based on the global average rating¹:

$\dot{\bar{r}}_i = \frac{\left(\sum_{u \in U_i} r_{ui}\right) + \gamma \bar{r}}{|U_i| + \gamma}$

There are at least two ways to think about this new term $\gamma$ :

A priori, we start each item out with $\gamma$ virtual ratings equal to the global mean rating. This reflects a prior assumption that, absent infromation about user preference, items are probably average. As the item accrues more real ratings, those dominate the virtual ratings and the damped or biased mean converges towards the empirical mean. This is conceptually similar to Laplace smoothing for naïve Bayesian modeling.
If you push through the calculus, this turns out to be equivalent to computing the posterior expected value of Bayesian inference for the item mean rating with an emirical prior and a Gaussian likelihood function. Specifically, the prior for the inference is a Gaussian distribution with an empirical mean ( $\mu_0 = \bar{r}$ ) and variance derived from the damping strength $\gamma$ .

Either way we look at it, the result is to bias the mean, decreasing its variance (in the bias-variance sense) and decreasing its susceptibility to deriving high scores from little information.

The bias model provides another more flexible mechanism to introduce this bias that has the benefit of also accounting for users’ differing use of the rating scale. We can learn global, user, and item biases and combine them to compute a personalized mean:

$\begin{align*} b_{ui} & = b_0 + b_i + b_u \\ b_0 & = \bar{r} = \frac{1}{|R|} \sum_{r_{ui} \in R} r_{ui} \\ b_i & = \frac{\sum_{u \in U_i} (r_{ui} - b_0)}{|U_i| + \alpha} \\ b_u & = \frac{\sum_{i \in I_u} (r_{ui} - b_0 - b_i)}{|I_u| + \beta} \end{align*}$

In this version, we center the data at each step, computing the item and user biases from the residuals of the previous step: the item bias $b_i$ captures how much better or worse this item is than the global average rating across all items, and the user bias $b_u$ captures how much more positive or negative this user is than average.

We further bias these biases towards 0 with bias damping factors $\alpha$ and $\beta$ . Since the item and user biases are computed from mean-centered data, 0 is now a neutral value, so we can apply the damping factor only in the denominator and obtain our biased biases that require more ratings to learn a high estimate of an item’s quality.

The bias model also yields a third interpretation of the damping or bias: while the precise parameter values might differ slightly due to the step-by-step computation above, the resulting model is very similar to what we would learn if we treated rating prediction as a ridge regression problem with user and item IDs as categorical variables. I sometimes do that, particularly when using the bias model as one component of a more sophisticated regularized scoring model.

The point of all of this for our original problem is that we can decrease the problematic behavior of our scores with respect to low-information items by biasing the scores. We can implement this biasing by starting each item out with “virtual interactions” expressing neutral preference. The more virtual interactions we supply, the more empirical interactions are needed to support a high score.

Biasing Lift

This raises a question: how do we apply the same principle to probabilities and lift? What would it mean to start each item out with some number of “neutral” interactions for the purpose of lift?²

Working this out is a little less obvious than starting each item with some neutral ratings. Both because we need to define “neutral”, and because there will be some additional terms involved (the number of neutral interactions per item is not sufficient for the math to work out).

Let’s define “neutral” as “independent”: our goal is for each item to start out with $\kappa$ virtual interactions that are independent of any other item’s virtual interactions. To compute probabilities, we also need to know the number of virtual users ( $K$ ). We can set this directly, but it is more natural to derive it from $n_K$ , the (average) number of interactions supplied by each virtual user. We’ll see at the end that we can actually ignore $n_K$ and $K$ and derive a sufficient approximation of biased lift with only $\kappa$ , but we’ll use them for now.

If each item has $\kappa$ virtual interactions, we need $\kappa N$ total virtual interactions. If each virtual user supplies $n_K$ virtual independent interactions, then the virtual user count $K = \frac{\kappa N}{n_K}$ . With these in hand, we can define the prior probability of an item, or the probability that one of our virtual users will interact with it:

$\operatorname{P}_K[i] = \frac{\kappa}{K}$

Since all virtual interactions are independent, we can define the prior joint probability of individual item pairs:

$\operatorname{P}_K[i,j] = \operatorname{P}_K[i] \operatorname{P}_K[j] = \frac{\kappa^2}{K^2}$

To define a biased probability $\dot{\operatorname{P}}[i]$ that incorporates our virtual clicks, an item with $m_i$ real clicks will have an additional $\kappa$ virtual clicks from a pool of $K$ virtual users. This yields:

$\dot{\operatorname{P}}[i] = \frac{m_i + \kappa}{M + K}$

Next, we will define the biased joint probability $\dot{\operatorname{P}}[i,j]$ . For this, we need to know expected number of virtual co-occurrences a pair of items will have: of our $K$ virtual users, how many will have interacted with both items? This is $K \operatorname{P}_K[i,j] = \frac{\kappa^2}{K}$ , allowing us to compute the biased joint probability:

$\dot{\operatorname{P}}[i,j] = \frac{m_{ij} + \frac{\kappa^2}{K}}{M + K}$

We can plug these biased joint and unconditinoal probabilities into the lift formula to get biased lift:

$\begin{align*} \operatorname{BiasedLift}[i,j] & = \frac{\dot{\operatorname{P}}[i,j]}{\dot{\operatorname{P}}[i]\dot{\operatorname{P}}[j]} \\ & = \frac{m_{ij} + \frac{\kappa^2}{K}}{M + K} \cdot \frac{M+K}{m_i + \kappa} \cdot \frac{M+K}{m_j + \kappa} \\ & = (M + K) \frac{m_{ij} + \frac{\kappa^2}{K}}{(m_i + \kappa)(m_j + \kappa)} \\ & \approx M \frac{m_{ij}}{(m_i + \kappa)(m_j + \kappa)} \end{align*}$

The last simplification is not a precise computation of the biased lift, but can be used to identify the items pairs with the highest lift. We selected $\kappa$ to be a constant, and $K$ is fixed given the data set (since it is computed from fixed $N$ and the constants $\kappa$ and $n_K$ ). Therefore, adding $K$ or terms involving only $K$ and $\kappa$ do change the order of item pairs within a single data set.

Further, $K \gg \kappa^2$ , so adding $\frac{\kappa^2}{K}$ in the numerator is adding a very small quantity, and can be effectively ignored for approximation purposes.

Since the final ordering score no longer depends on $K$ , we actually do not need to pick $n_K$ if we just want to use biased lift to select the most related recommendations for a reference item. We can select $\kappa$ (higher values increase the number of real co-occurrances needed to obtain a high biased lift score), and sort by biased scores computed directly from interaction counts:

$s(i|j) = M \frac{m_{ij}}{(m_i + \kappa)(m_j + \kappa)}$

And it works! I don’t know how well it works yet, overall, but in my preliminary trials in class, it did exactly what I hoped, computing related-book recommendations that weren’t dominated by either most-popular-overall or unpopular-but-that-one-user-also-read-the-reference-book recommendations. I’ll probably add it to LensKit soon.