Biased Lift for Related-Item Recommendation
Effectively computing non-personalized recommendations can be annoyingly subtle. If we do naïve things like sorting by average rating, we get a top-N list dominated by items that one user really liked. Sorting by overall popularity doesn’t have this problem; as soon as we want contextual related-product recommendations, however (e.g. “people who bought this also bought”), and don’t want those recommendations to be dominated by the most popular items overall, the problem comes roaring back. I’m going to use my usual notation in this blog post: we have a set of users who interact with items forming set . For convenience, let be the number of users who have interacted with item . We can directly sort by this value, in descending order, to find the most popular items (those items who have been interacted with by the most users). We can also frame overall popularity as the probability that a user will interact with that item (assuming, for the moment, that all users are equally likely or representative of future incoming users): When we have explicit ratings, user gives item the rating , and the average rating for an item is . To recommend items, we compute some score and pick the items with the highest score, or use these scors in some more sophisticated ranker such as a sampler or diversifier. The bias-variance tradeoff is our key to solving this problem. To see how that might work, let’s go to a different model that we also taught in the MOOC and discussed in our old survey, and I picked up in turn from FunkSVD, I think: the bias model. If we want to compute the average rating for an item, the basic way to do this is with a simple mean: The mean has a similar problem as lift — if one user really liked a movie and rated it 5 stars, it will have an average of 5, beating out movies that a lot of people rated and gave a respectable average of 4.7 over, say, 10K ratings. An easy fix for this problem is to introduce some damping or bias based on the global average rating1: There are at least two ways to think about this new term : Either way we look at it, the result is to bias the mean, decreasing its variance (in the bias-variance sense) and decreasing its susceptibility to deriving high scores from little information. The bias model provides another more flexible mechanism to introduce this bias that has the benefit of also accounting for users’ differing use of the rating scale. We can learn global, user, and item biases and combine them to compute a personalized mean: In this version, we center the data at each step, computing the item and user biases from the residuals of the previous step: the item bias captures how much better or worse this item is than the global average rating across all items, and the user bias captures how much more positive or negative this user is than average. We further bias these biases towards 0 with bias damping factors and . Since the item and user biases are computed from mean-centered data, 0 is now a neutral value, so we can apply the damping factor only in the denominator and obtain our biased biases that require more ratings to learn a high estimate of an item’s quality. The bias model also yields a third interpretation of the damping or bias: while the precise parameter values might differ slightly due to the step-by-step computation above, the resulting model is very similar to what we would learn if we treated rating prediction as a ridge regression problem with user and item IDs as categorical variables. I sometimes do that, particularly when using the bias model as one component of a more sophisticated regularized scoring model. The point of all of this for our original problem is that we can decrease the problematic behavior of our scores with respect to low-information items by biasing the scores. We can implement this biasing by starting each item out with “virtual interactions” expressing neutral preference. The more virtual interactions we supply, the more empirical interactions are needed to support a high score. This raises a question: how do we apply the same principle to probabilities and lift? What would it mean to start each item out with some number of “neutral” interactions for the purpose of lift?2 Working this out is a little less obvious than starting each item with some neutral ratings. Both because we need to define “neutral”, and because there will be some additional terms involved (the number of neutral interactions per item is not sufficient for the math to work out). Let’s define “neutral” as “independent”: our goal is for each item to start out with virtual interactions that are independent of any other item’s virtual interactions. To compute probabilities, we also need to know the number of virtual users (). We can set this directly, but it is more natural to derive it from , the (average) number of interactions supplied by each virtual user. We’ll see at the end that we can actually ignore and and derive a sufficient approximation of biased lift with only , but we’ll use them for now. If each item has virtual interactions, we need total virtual interactions. If each virtual user supplies virtual independent interactions, then the virtual user count . With these in hand, we can define the prior probability of an item, or the probability that one of our virtual users will interact with it: Since all virtual interactions are independent, we can define the prior joint probability of individual item pairs: To define a biased probability that incorporates our virtual clicks, an item with real clicks will have an additional virtual clicks from a pool of virtual users. This yields: Next, we will define the biased joint probability . For this, we need to know expected number of virtual co-occurrences a pair of items will have: of our virtual users, how many will have interacted with both items? This is , allowing us to compute the biased joint probability: We can plug these biased joint and unconditinoal probabilities into the lift formula to get biased lift: The last simplification is not a precise computation of the biased lift, but can be used to identify the items pairs with the highest lift. We selected to be a constant, and is fixed given the data set (since it is computed from fixed and the constants and ). Therefore, adding or terms involving only and do change the order of item pairs within a single data set. Further, , so adding in the numerator is adding a very small quantity, and can be effectively ignored for approximation purposes. Since the final ordering score no longer depends on , we actually do not need to pick if we just want to use biased lift to select the most related recommendations for a reference item. We can select (higher values increase the number of real co-occurrances needed to obtain a high biased lift score), and sort by biased scores computed directly from interaction counts: And it works! I don’t know how well it works yet, overall, but in my preliminary trials in class, it did exactly what I hoped, computing related-book recommendations that weren’t dominated by either most-popular-overall or unpopular-but-that-one-user-also-read-the-reference-book recommendations. I’ll probably add it to LensKit soon. We called this the damped mean in early writing and teaching; biased mean is perhaps better, but may be confusing with biases in the bias model coming in a couple of paragraphs.↩︎ I have to believe others have asked this question, given how old lift is as a metric, but I have been unable to find references to damping or biasing lift in this way. If someone knows one, please send it my way!↩︎Preliminaries
Sidebar: Bias
Biasing Lift