Figure 1: Illustration of the two tasks for scientific article recommendation systems, where √ indicates “like”, × “dislike” and ? “unknown”.
2.2 Recommendation by Matrix Factorization
The traditional approach to recommendation is collaborative filtering (CF), where items are recommended to a user based on other users with similar patterns of selected items. (Note that collaborative filtering does not use the content of the items.) Most successful recommendation methods are latent factor models [17, 18, 13, 1, 22], which provide better recommendation results than the neighborhood methods [11, 13]. In this paper, we focus on latent factor models. Among latent factor methods, matrix factorization performs well [13]. In matrix factorization, we represent users and items in a shared latent low-dimensional space of dimension K—user i is represented by a latent vector ui ∈ R K and item j by a latent vector vj ∈ R K. We form the prediction of whether user i will like item j with the inner product between their latent representations, rˆij = u T i vj . (1) Biases for different users and items can also be incorporated [13]. To use matrix factorization, we must compute the latent representations of the users and items given an observed matrix of ratings. The common approach is to minimize the regularized squared error loss with respect to U = (ui) I i=1 and V = (vj ) J j=1, minU,V P i,j (rij − u T i vj ) 2 + λu||ui||2 + λv||vj ||2 , (2) where λu and λv are regularization parameters. This matrix factorization for collaborative filtering can be generalized as a probabilistic model [18]. In probabilistic matrix factorization (PMF), we assume the following generative process, 1. For each user i, draw user latent vector ui ∼ N (0, λ−1 u IK). 2. For each item j, draw item latent vector vj ∼ N (0, λ−1 v IK). 3. For each user-item pair (i, j), draw the response rij ∼ N (u T i vj , c −1 ij ), (3) where cij is the precision parameter for rij . (Note that IK is a K-dimensional identity matrix.) This is the interpretation of matrix factorization that we will build on. When cij = 1, for ∀i, j, the maximum a posteriori estimation (MAP) of PMF corresponds to the solution in Eq. 2. Here, the precision parameter cij serves as a confidence parameter for rating rij . If cij is large, we trust rij more. As we mentioned above, rij = 0 can be interpreted into two ways—the user i is either not interested in item j or is unaware of it. This is thus a “one-class collaborative filtering problem,” similar to the TV program and news article recommendation problems studied in [12] and [16]. In that work, the authors introduce different confidence parameters cij for different ratings rij . We will use the same strategy to set cij a higher value when rij = 1 than when rij = 0, cij = a, if rij = 1, b, if rij = 0, (4) where a and b are tuning parameters satisfying a > b > 0. We fit a CF model by finding a locally optimal solution of the user variables U and item variables V , usually with an iterative algorithm [12]. We then use Eq. 1 to predict the ratings of the articles outside of each user’s library. There are two main disadvantages to matrix factorization for recommendation. First, the learnt latent space is not easy to interpret; second, as mentioned, matrix factorization only uses information from other users—it cannot generalize to completely unrated items.
2.3 Probabilistic Topic Models
Topic modeling algorithms [5] are used to discover a set of “topics” from a large collection of documents, where a topic is a distribution over terms that is biased around those associated under a single theme. Topic models provide an interpretable low-dimensional representation of the documents [8]. They have been used for tasks like corpus exploration, document classification, and information retrieval. Here we will exploit the discovered topic structure for recommendation. The simplest topic model is latent Dirichlet allocation (LDA) [7]. Assume there are K topics β = β1:K, each of which is a distribution over a fixed vocabulary. The generative process of LDA is as follows. For each article wj in the corpus, 1. Draw topic proportions θj ∼ Dirichlet(α). 2. For each word n, (a) Draw topic assignment zjn ∼ Mult(θj ). (b) Draw word wjn ∼ Mult(βzjn ). This process reveals how the words of each document are assumed to come from a mixture of topics: the topic proportions are document specific, but the set of topics is shared by the corpus.