Chapter 12. Building a recommendation engine – Collective Intelligence in Action

Chapter 12. Building a recommendation engine

This chapter covers
  • Fundamentals for building a recommendation engine
  • A content-based approach for building a recommendation engine
  • A collaborative-based approach for building a recommendation engine
  • Real-world case studies of Amazon, Google News, and Netflix

In recent years, increasing amount of user interaction has provided applications with a large amount of information that can be converted into intelligence. This interaction may be in the form of rating an item, writing a blog entry, tagging an item, connecting with other users, or sharing items of interest with others. This increased interaction has led to the problem of information overload. What we need is a system that can recommend or present items to the user based on the user’s interests and interactions. This is where personalization and recommendation engines come in.

Recommendation engines aim to show items of interest to a user. Recommendation engines in essence are matching engines that take into account the context of where the items are being shown and to whom they’re being shown.

Recommendation engines are one of the best ways of utilizing collective intelligence in your application.

Netflix, the world’s largest online movie rental service, provides a good proofpoint of how important recommendation engines are to commerce. Netflix offers personalized recommendations to over 8.4 million subscribers with a catalog of more than 100,000 movie titles.[1] Netflix’s recommendation engine is very effective, with about 60 percent of Netflix members selecting their movies based on movie recommendations that have been tailored to their individual tastes. Later in this chapter (see section 12.4.3), we briefly review the approach Netflix took to build their recommendation system.

1 As of September 2008; see

In this chapter, we look at how to develop a recommendation engine. To do this, we use most of the concepts that we’ve developed in previous chapters. We begin with a brief introduction to various concepts related to building a recommendation engine. Next, continuing with our running example of using blog entries from Technorati, we build a recommendation engine using a content-analysis approach. For this, we first leverage Lucene and then use the text analytics framework developed in chapter 8. Next, we look at the various collaborative approaches to building recommendation engines. This will be followed by reviewing a few approaches—Amazon, Google News, and Netflix—to building recommendation engines in the industry. At the end of this chapter, you should be comfortable building a recommendation engine using both content-based and collaborative approaches.

12.1. Recommendation engine fundamentals

One of the best-known examples of a recommendation engine is Amazon provides personalized recommendations in a number of ways, one of which is shown in figure 12.1. Figure 12.1 shows the recommendations provided by Amazon when you look at the book Lucene in Action by Gospodnetic and Hatcher. Note that they recommend a number of books under the heading “Customers Who Bought

Figure 12.1. An example of the output of a recommendation engine at

This Item Also Bought.” This is an example of an item-based recommendation, where items related to a particular item are being recommended.

In this section, we introduce basic concepts related to building a recommendation engine. Let’s begin by taking a deeper look at the many forms of recommendation engines.

12.1.1. Introducing the recommendation engine

As shown in figure 12.2, a recommendation engine takes the following four inputs to make a recommendation to a user:

  • The user’s profile— age, gender, geographical location, net worth, and so on
  • Information about the various items available— content associated with the item
  • The interactions of the users— ratings, tagging, bookmarking, saving, emailing, browsing content
  • The context of where the items will be shown— the subcategory of items that are to be considered
Figure 12.2. The inputs and outputs of a recommendation engine

As we mentioned in section 2.2.1, a user is also an item. In social networking, other users are recommended based on the context of the application.

One of the easiest forms of a recommendation list is the “Top Item List,” where items that have been viewed, tagged, bought, or saved the most in a period of time are presented to the user. While promoting top products is useful, what we really want is to create a personalized list of recommendations for users.

Recommendation engines can help build the following types of features in your application:

  • Users who acted on this item also took action on these other items, where the acted on could be watched, purchased, viewed, saved, emailed, bookmarked, added to favorites, shared, created, and so on
  • Other users you may be interested in
  • Items related to this item
  • Recommended items

Here are some concrete examples of these use cases:

  • Users who watched this video and also watched these other videos
  • New items related to this particular article
  • Users who are similar to you
  • Products that you may be interested in

In recommendation systems, there’s always a conflict between exploitation and exploration. Exploitation is the process of recommending items that fall into the user’s sweet spot, based on things you already know about the user. Exploration is being presented with items that don’t fall into the user’s sweet spot, with the aim that you may find a new sweet spot that can be exploited later. Greedy recommenders, with little exploration, will recommend items that are similar to the ones that the user has rated in the past. In essence, the user will never be presented with items that are outside their current spot. A common approach to facilitating exploration is to not necessarily recommend just the top n items, but to add a few items selected at random from candidate items. It’s desirable to build in some diversity in the recommendation set provided to the user.

Next, let’s look at the two basic approaches to building recommendation engines.

12.1.2. Item-based and user-based analysis

There are two main approaches to building recommendation systems, based on whether the system searches for related items or related users.

In item-based analysis, items related to a particular item are determined. When a user likes a particular item, items related to that item are recommended. As shown in figure 12.3, if items A and C are highly similar and a user likes item A, then item C is recommended to the user. You may recall our discussion in section 2.2.3, where we looked at two approaches to finding similar items. First was content-based analysis, where the term vector associated with the content was used. The second was collaborative filtering, where user actions such as rating, bookmarking, and so forth are used to find similar items.

Figure 12.3. Item-based analysis: similar items are recommended

In user-based analysis, users similar to the user are first determined. As shown in figure 12.4, if a user likes item A, then the same item can be recommended to other users who are similar to user A. Similar users can be obtained by using profile-based information about the user—for example cluster the users based on their attributes, such as age, gender, geographic location, net worth, and so on. Alternatively, you can find similar users using a collaborative-based approach by analyzing the users’ actions.

Figure 12.4. User-based analysis: items liked by similar users are recommended

Later on in section 12.4, we look at examples of both: item-based analysis as used in Amazon and user-based analysis as used by Google News. Here are some tips that may help you decide which approach is most suitable for your application:

  • If your item list doesn’t change much, it’s useful to create an item-to-item correlation table using item-based analysis. This table can then be used in the recommendation engine.
  • If your item list changes frequently, for example for news-related items, it may be useful to find related users for recommendations.
  • If the recommended item is a user, there’s no option but to find related users.
  • The dimensionality of the item and user space can be helpful in deciding which approach may be easier to implement. For example, if you have millions of users and an order of magnitude fewer items, it may be easier to do item-based analysis. Whenever users are considered, you’ll deal with sparse matrices. For example, a typical user may have bought only a handful of items from the thousands or millions of items that are available in an application.
  • If there are only a small number of users, it may be worthwhile to bootstrap your application using item-based analysis. Furthermore, there’s no reason (other than perhaps time to implement and performance) why these two approaches can’t be combined.
  • It’s been shown empirically that item-based algorithms are computationally faster to implement than user-based algorithms and provide comparable or better results.

Both user- and item-based analysis require the computation of a similarity metric. Next, let’s look at how this is done.

12.1.3. Computing similarity using content-based and collaborative techniques

Regardless of whether you use item-based or user-based analysis, there are two main approaches for computing similarity, depending on whether you’re analyzing text or user actions. If you can represent the content associated with an item or a user in terms of a term vector, then taking the dot product of two normalized vectors provides a measure of how close two term vectors are. This corresponds to invoking the method for the TagMagnitudeVector class we introduced in section 8.2.2.

public double dotProduct(TagMagnitudeVector o)

In content-based analysis, recommending items simply amounts to finding items that have similar term vectors.

As described in section 2.4 and illustrated in table 12.1, a typical collaborative filtering algorithm represents each customer as an N-dimensional vector of items, where N is the number of distinct items in the system. Each cell value in the vector corresponds to either a positive number quantifying how well the user liked the item or a negative number if the user disliked the item. Similar to the inverse-document frequency, where commonly used words have less weight, a factor may be used to compensate for best-selling or highly rated items—the weight for this factor is inversely proportional to the number of entries that this item has in its column.

Table 12.1. Representing the user as an N-dimensional vector

Item 1

Item 2



Item N

User 1 2        
User 2 5 1      
.....         1
User m       2  

Note that this vector will be sparse for almost all users—a typical user would have acted on only a handful of the N items. In section 12.3.3, we look at one of the ways to deal with sparse matrices: reducing dimensionality using singular value decomposition (SVD).

We look at collaborative filtering algorithms in more detail in section 12.3. For now, you may recall from section 2.4 that there are three main approaches to computing the similarities while using collaborative filtering:

  1. Cosine-based similarity computation
  2. Correlation-based similarity computation (Pearson-r correlation)
  3. Adjusted cosine-based similarity computation

In section 12.3, we develop the infrastructure to compute these similarities. Next, let’s look at the advantages and disadvantages of content-based and collaborative techniques.

12.1.4. Comparison of content-based and collaborative techniques

The following are some advantages and disadvantages of collaborative and content-based techniques.

  • Collaborative-based techniques have the advantage that they treat an item as a black-box—they don’t use any information about the content itself. Unlike content-based techniques, the same infrastructure is applicable across domains and languages. So if you build a recommendation engine that works well in the English language, you can use the same infrastructure for your site in Chinese. For content such as images, music, and videos that might not have text associated with them, content-based analysis may not be an option, unless of course you allow users to tag the items (see chapter 3 for more on tagging).
  • In content-based analysis, the algorithm has no notion of the item’s quality—it’s all based on the term vector. There’s no notion of how good or bad a particular item is—the algorithm doesn’t know whether an article is well-written or poorly written. On the other hand, in collaborative-based approaches, you have usable quantitative information about the quality of the item. Web search engines provide a good analogy to this. Prior to Google, most search engines used content-based approaches for showing search results. Google, with its page rank, uses a collaborative approach—it uses how various sites have been linked together to compute a rank for each site.
  • Over a period of time, the results from content-based analysis don’t change much; text associated with the item may not change much. As time progresses, there may be some changes in the term vector due to changes in the inverse document frequency for the terms in the document, but on the whole things don’t change much for an item. Collaborative-based approaches rely on user interaction, and over a period of time user interaction on the item may change. For example, a video on a current topic may be rated highly. Over a period of time, as new content comes in and the video is no longer relevant to current issues, it may get lower ratings.
  • Collaborative-based systems rely on using the information provided by a user to find other related users and recommend items based on the ratings from similar users. In the absence of an adequate amount of data, these systems can perform poorly in their prediction capabilities. For a new user with little interaction history, there may not be enough information to find similar users using the user’s interaction history for a collaborative approach. Typically, to overcome this, user-profile information—age, gender, demographics, and so forth—is also used to find similar users.
  • Collaborative-based systems won’t recommend new items added to the system unless they’ve been rated by a substantial number of users.

Some recommendation systems use a hybrid approach, combining content-based and collaborative analysis. The combination could be in the form of implementing the two methods separately and then combining the results. Another way is to use the user’s profile to find similar users when a user hasn’t rated enough items. Recommendation systems may also leverage additional information that may affect the prediction, for example, the time of the year, or month, or week.

Now that we have a basic understanding of content-based and collaborative-based approaches, we take a more detailed look at how to implement these two approaches. In the next two sections, we first cover content-based analysis, followed by collaborative-based analysis.

12.2. Content-based analysis

Fortunately, most of the work that needs to be done to build a recommendation engine based on content analysis has already been done in prior chapters, more specifically in chapters 8, 9, and 11. In this section, we demonstrate finding related items using Lucene; using the text analytics framework developed in chapter 8; finding related items for a set of documents; and lastly how content-based results can be further personalized for a user.

12.2.1. Finding similar items using a search engine (Lucene)

In this section, we demonstrate how you can find similar items using a search engine such as Lucene. The basic approach is to compose a query, based on the content of the document you want to find related items for, and query the index for items that best match this query.

At this stage, it’s helpful to look at the output from some of the code that we write in this section. Listing 12.1 shows the output from one of the runs from the sample code we implement next (listing 12.2). In this example, 10 blog entries for the tag collective intelligence were retrieved from Technorati and indexed using Lucene. Then related blog entries for each blog entry were computed—these are shown as tabbed entries below the blog entry.

Listing 12.1. Sample output from related blogs
Social Media is Just the Next Logical Step in Technology
    our learning wiki
    The Dilemma of Personalized Library Services
Companies using the Power of WE - The Consumers!
    our learning wiki
    The Dilemma of Personalized Library Services
Pagerank isn' for humans
    our learning wiki
    A Tool to Attract More People to Your Cause
    Le web 2.0 affiche ses vraies valeurs : financières, s'ntend. A
l'xemple de Google et son
    The Dilemma of Personalized Library Services
    Programming Collective Intelligence: Building Smart Web 2.0
    Supporting occupation - Gordon Brown in Israel
Programming Collective Intelligence: Building Smart Web 2.0 Applications
    our learning wiki
    Le web 2.0 affiche ses vraies valeurs : financières, s'ntend. A
l'xemple de Google et son
    Pagerank isn' for humans
    The Dilemma of Personalized Library Services

First, let’s take a simple approach of hand-crafting this query. Listing 12.2 shows a simple method that uses the TermFreqVector associated with a document field to create a BooleanQuery.

Listing 12.2. Creating a BooleanQuery using the term vector in Lucene

We first obtain the TermFreqVector associated with the specified document and the field. Next, we create an instance of BooleanQuery, to which we add all the terms (or a subset of terms) from the term vector. Note that we specify the condition Boolean-Clause.Occur.SHOULD, which means that not all the terms are required. Lastly, since we combine the query with other Boolean queries, we set a boost factor for this query.

Let’s apply this approach to our running example of blog entries. We build on our example from chapter 11, where we indexed blog entries. Let’s find related blog entries for each blog entry in our Lucene index. Listing 12.3 shows the code for obtaining related blog entries for a given blog in our index.

Listing 12.3. Creating a composite BooleanQuery and retrieving blog entries

Since we want to weigh the title more than other text, we first create an instance of BooleanQuery using the title and apply a boost factor of 3. This is combined with a BooleanQuery representation for the complete text. Lastly, related blog entries are obtained from the index using the RetrievedBlogHitCollector class we implemented in section 11.3.7.

Finally, listing 12.4 shows the code for retrieving all the blog instances from the index and getting the related items for each blog entry.

Listing 12.4. Iterating over all documents in the index
    public void illustrateMoreLikeThisByQuery(Directory indexDirectory)
         throws Exception {
          IndexSearcher indexSearcher = new IndexSearcher(indexDirectory);
          for (int i = 0; i < indexSearcher.maxDoc(); i ++) {
             Document document = indexSearcher.doc(i);
             System.out.println(document.get("title") + " " +
             if (document != null) {
                List<RetrievedBlogEntry> relatedBlogs =
                 getRelatedBlogEntries( indexSearcher, i) ;
                for (RetrievedBlogEntry relatedBlog : relatedBlogs ) {
                   System.out.println("\t" + relatedBlog.getTitle() + +
                   " " + relatedBlog.getUrl());

Our rather simplistic approach doesn’t take into account the term frequency and inverse document frequencies associated with each term in the query. Fortunately, Lucene’s contrib/query package provides a fairly good version of this functionality. Download and compile the following Lucene similarity package:

Listing 12.5 shows the same functionality developed using the MoreLikeThis class.

Listing 12.5. Iterating over all documents in the index

MoreLikeThis is a fairly sophisticated class that creates a query to find items similar to a particular document. It has a number of settings, such as the minimum number of documents in which a term should appear, the minimum term frequency, the minimum and maximum length of words, stop words, and the maximum number of query terms. Using this class is fairly straightforward. We create an instance of More-LikeThis, set the analyzer, specify the fields to be used, and set the minimum document frequency to be 0, because we have few blog entries in our Lucene index.

Next, let’s look at building a content-based recommendation engine using the text analytics infrastructure we developed in chapter 8.

12.2.2. Building a content-based recommendation engine

You may recall the TagMagnitudeVector class from section 8.2.2 and listing 8.18. The TagMagnitudeVector class contains a list of tags with magnitudes, where the magnitude terms are normalized such that the sum of the squares of the magnitudes for all the terms is 1. Each term magnitude is obtained by computing the term frequency and the inverse document frequency for the terms. The similarity between two documents can be computed by taking their dot products. For example, for a TagMagnitudeVector, use the following method:

public double dotProduct(TagMagnitudeVector o) ;

Finding similar items for an item amounts to finding items that have the highest dot products with the given item’s term vector. Let’s illustrate how to build such a recommendation engine. First, we need to define a simple Java bean class RelevanceTextDataItem, which is a container for a TextDataItem along with a double relevance value. Listing 12.6 contains the implementation for RelevanceTextDataItem.

Listing 12.6. Implementation for RelevanceTextDataItem

Next, let’s define a class, ContentBasedBlogRecoEngine, which will retrieve blog entries from Technorati and then find relevant blog entries for each blog entry. Listing 12.7 shows the first part of the code for ContentBasedBlogRecoEngine. This listing shows the main steps in building a content-based recommendation engine—creating the dataset, finding relevant items, and printing it out.

Listing 12.7. The main steps for building a content-based recommendation engine

The method createLearningData() uses an instance of BlogDataSetCreatorImpl (see chapter 8) to create a List of TextDataItems. illustrateContentRecoEngine simply iterates over all blog entry instances to print out related items for each entry. Listing 12.8 shows the implementation for the method to find the related items for the blogs.

Listing 12.8. Getting relevant items in ContentBasedBlogRecoEngine

To find related blogs, we simply iterate over all blog instances, compute the similarity with the blog of interest, and then sort all relevant blog items based on the similarity. Listing 12.9 shows sample output from the code developed in this section. Here, you can see the related blog entries for a blog along with the level of relevance between the parent and the children blogs.

Listing 12.9. Related entries for a blog
Social Media is Just the Next Logical Step in Technology http://
   our learning wiki 0.01720982414882634
   The Dilemma of Personalized Library Services 0.01403890610073534
Companies using the Power of WE - The Consumers!
   our learning wiki 0.0396024241655864
   The Dilemma of Personalized Library Services 0.016152829610989242
Pagerank isn' for humans
   our learning wiki 0.04675732078317664
   A Tool to Attract More People to Your Cause
... 0.038080852316322855
   Le web 2.0 affiche ses vraies valeurs : financières, s'ntend. A
l'xemple de Google et son 0.029162920196574908
   The Dilemma of Personalized Library Services 0.0153460076524816
   Programming Collective Intelligence: Building Smart Web 2.0
Applications 0.01044583466318214
   Supporting occupation - Gordon Brown in Israel 0.006828125603489045
Programming Collective Intelligence: Building Smart Web 2.0
   our learning wiki 0.031064220852246284
   Le web 2.0 affiche ses vraies valeurs : financières, s'ntend. A
l'xemple de Google et son 0.027501197695123845
   Pagerank isn' for humans 0.01044583466318214
   The Dilemma of Personalized Library Services 0.0036025401391830475

So far we’ve looked at how to build a content-based recommendation engine using a search engine and using the text processing toolkit we built in chapter 8. A common use case is when we have a collection of items, such as a topic or subtopic, and we want to find items that are similar to this collection. We look at this next.

12.2.3. Related items for document clusters

In an application, it’s common to define topics or subtopics that have underlying documents associated with them. For example, if you’re building a site about data mining, you may have five subtopics: association rules, attribute importance, clustering, classification, and regression. For each of these subtopics, you may have filed a number of documents. You want to build a representation for each of these subtopics by using the content that’s been associated with each subtopic.

Fortunately, each document has an associated normalized term vector, and finding a composite representation from the document set simply amounts to adding all the term vectors for the documents. Listing 12.10 illustrates the code for combining the different term vectors.

Listing 12.10. Code to illustrate merging of documents
public void illustrateMergingOfDocuments(
    List<TagMagnitudeVector> tagMagnitudeVectors) {
    List<TagMagnitude> tagMagnitudes = Collections.emptyList();
    TagMagnitudeVector emptyTMV =
      new TagMagnitudeVectorImpl(tagMagnitudes);
    TagMagnitudeVector mergedTMV = emptyTMV.add(tagMagnitudeVectors);

Once we have a combined term vector representation for the subtopic, finding similar items is the same as in section 12.2.2. In section 11.6.3, we looked at trends in intelligent search where you can further personalize search results using metadata associated with the user. We briefly look at this next.

12.2.4. Personalizing content for a user

When there’s a tag vector representation for the user, it’s also possible to sort the candidate list of items using the term vector representation for the user. Typically, a candidate list of items is stored for each item of interest (around 20–30 items). Then, based on which user is visiting the content, the user’s term vector is used to bubble up the content that may be of most interest to the user. Make sure that the recommendation list isn’t completely homogenous and there’s enough diversity in the recommendation list to allow the user to explore new areas of interest.

With this overview of content-based recommendation systems, we’re now ready to take a more detailed look at collaborative filtering.

12.3. Collaborative filtering

In collaborative filtering, an item is considered as a black box—we don’t look at its content—and user interactions (such as rating, saving, purchasing) with the item are used to recommend an item of interest to the user. More formally, given a dataset of user actions (also called the user-item dataset), we want to recommend items to a user. As discussed in section 12.1.2, we either find similar users and recommend items liked by these users, or we find items similar to an item of interest. There are two main classes of collaborative filtering algorithms—memory-based and model-based.

In memory-based algorithms, the entire user-item database is used. The algorithm first finds a set of similar users and then makes a recommendation or set of recommendations (top n recommendations) by combining the preferences of the similar users. This approach is also known as nearest neighbor. Typically, the expected rating for an item is estimated by combining the ratings of similar users, using the degree of similarity as a measure to combine the ratings. One problem with using the weighted sum to predict the rating is the bias that different users may have in rating items. For example, one user may provide ratings that average 3 while another may provide ratings that are similar but which average 3.5. Therefore, it’s also common to predict the ratings for a user using the weighted sum of the deviations in ratings of similar users. In section 12.1.3, we looked at the three approaches to computing the similarities between users: cosine-based similarity computation, Pearson-r correlation, and the adjusted cosine-based similarity computation. In an application with a large number of users, it isn’t practical to compute similar users for a user in real-time. Therefore, it’s common to pre-compute this association either in a lookup table or by creating user clusters.

Model-based collaborative filtering algorithms create a model (see chapters 9 and 10) using the user-item data to predict ratings that a user is likely to give. Model-based algorithms try to model the user based on past ratings and then use the models to predict the ratings on items the user hasn’t visited or rated. Commonly used model-based approaches include latent semantic indexing (LSI), Bayesian clustering, probabilistic latent semantic indexing (PLSI), multiple multiplication factor model, Markov Decision process, and latent Dirichlet allocation.

12.3.1. k-nearest neighbor

In section 2.4, we worked through the process of collaborative filtering for generating both an item-to-item correlation matrix and a user-to-user similarity matrix. Collaborative filtering algorithms use the user-item matrix shown in table 12.1 as input. The matrix is transposed—each item becomes a row and users become columns—when related items are to be found.

There are really two steps in applying collaborative filtering. First, we need to determine similar items or users, and then make a prediction using the similar items. This approach is also commonly known as k-nearest neighbor (k-NN). The predicted value is typically a weighted sum of the ratings for the k neighboring items. To take care of different biases in ratings among the k neighbors, the weighted sum of the deviations in ratings from the user’s average rating value is also used. Let’s look at this computation via an example.

We illustrate the process of predicting ratings using the ratings of similar users by working through the example we introduced in section 2.4.1. Table 12.2 shows the user-item matrix associated with this example.

Table 12.2. Ratings data used in the example





John 3 4 2 3
Jane 2 2 4 8/3
Doe 1 3 5 3
Average 2 3 11/3 26/3

In section 2.4.1, we worked through three methods for computing the similarities. Using the Pearson-r correlation computation, we arrived at the user-to-user similarity matrix shown in table 12.3.

Table 12.3. Correlation matrix for the users




John 1 -0.866 -0.5
Jane -0.866 1 0.87
Doe -0.5 0.87 1

John is correlated with Jane by -0.866, while he’s correlated to Doe by -0.5. Next, let’s look at how we can predict John’s expected rating for Photo1 using the ratings of similar users Jane and Doe.

John’s expected rating for Photo1 = John’s average rating + w1 * (Jane’s rating –Jane’s average rating) + w2 * (Doe’s rating –Doe’s average rating)

= 3 + (-0.866/1.366) * (2 – 8/3) + (-0.5/1.366) * (1 - 3)

= 4.2

Similarly, the predicted rating for Photo2 from John is

= 3 + (-0.866/1.366) * (2 – 8/3) + (-0.5/1.366) * (3 - 3)

= 3.4

If you remember from our discussions in earlier chapters, the inverse document frequency (idf) is used in the term vector to emphasize terms that aren’t common across documents. A similar concept, inverse user frequency, is used to pre-process the user-item data. Items that are popular aren’t very good at capturing similarities between users and items. Therefore, if n is the total number of users and ni of them have rated a particular item, then the inverse user frequency is defined as log (n/ni). Entries in the user-item table are multiplied by their inverse user frequencies before the computation of the similarities.

For large-scale systems, it can be expensive to find k neighbors in real-time. Users are typically clustered offline so as to retrieve the k nearest neighbors in real time. KNN algorithms need to deal with sparse data, which can affect the quality of the solution. Furthermore, these algorithms can run into scalability issues, as their computation grows with both the number of users and the number of items being processed.

For this reason, large sites, such as Amazon, use item-based collaborative filtering. The example in section 2.4 illustrated the process of item-to-item collaborative filtering. In this process, the user-item table (table 12.1) is transposed to have items as rows and users as columns. In essence, every item has a vector associated with it, each user being a dimension. For every item, the k (a number typically between 10 and 30) closest items are obtained using a similarity function (vector dot product or correlation computation). Related items for an item are stored in the database. Once the related item table has been computed for each item, displaying related items to a user is a simple lookup. This approach scales independently of the number of users on the system.

Next, let’s look at how we can implement these algorithms.

12.3.2. Packages for implementing collaborative filtering

There are a number of different options in implementing k-nearest neighbor, which we go through in this section.

The collaborative filtering problem is similar to the text analysis problem—both of them deal with large-dimensional sparse matrices. In chapter 8, we developed the TagMagnitudeVector infrastructure for representing a term vector and computing the cosine similarity between two term vectors. The first option is to leverage this infrastructure. For data represented in the user-item table (see table 12.1), each row corresponding to a user is equivalent to a term vector for a document. This can be represented by a TagMagnitudeVector, and cosine-similarity can be computed by taking the dot product of two TagMagnitudeVector instances. Furthermore, if we wanted to compute the similarity by computing the Pearson-r correlation, we can extend the TagMagnitudeVector class so that all the terms would be normalized to have a zero mean and unit magnitude. Again, the correlation between two such normalized vectors would be their dot product.

The second option is to leverage WEKA (see section 7.2). The weka.core.neigh-boursearch package contains a number of classes that implement the nearest-neighbor search algorithms. Some of these classes are described in table 12.4 and shown in figure 12.5. All these search classes extend the abstract base class NearestNeigbourSearch.

Table 12.4. NearestNeighborSearch classes in WEKA



NearestNeighbourSearch Abstract class for nearest-neighbor search
BallTree Implements the BallTree/Metric Tree algorithm for nearest-neighbor search
CoverTree Implements the CoverTree data structure for nearest-neighbor search
KDTree Implements the KDTree search algorithm for nearest-neighbor search
LinearNNSearch Implements the brute force search algorithm for nearest-neighbor search
Figure 12.5. WEKA classes related to instance-based learning and nearest-neighbor search

In section 10.4, we looked at classifiers available in WEKA. The weka.classifiers.lazy class contains classification algorithms that are based on nearest-neighbor search algorithms. Some of these algorithms are shown in figure 12.5 and described in table 12.5.

Table 12.5. Classifiers in WEKA based on nearest-neighbor search



IB1 Nearest-neighbor classifier.
IBk K-nearest neighbors classifier.
KStar K* is an instance-based classifier that uses an entropy-based distance function.

Next, let’s walk through a simple example that illustrates what’s involved in implementing the k-nearest neighbor algorithm using WEKA. We split the implementation into two sections and use the dataset in table 12.2 as our example. The first part deals with creating the Instances to represent the data, while the second part deals with building a classifier and evaluating it. Listing 12.11 shows the implementation for creating the Instances dataset along with the main method that enumerates the three things this example will do: create the attributes, create the learning dataset, and illustrate the k-NN algorithm.

Listing 12.11. Creating the dataset for implementing k-NN

The implementation is fairly straightforward. We first create the attributes and then the set of instances:

FastVector attributes = eg.createAttributes();
Instances instances = eg.createLearningDataSet(attributes);

Note that we set the third attribute, item3, to be the predicted attribute:


Listing 12.12 shows the output generated by printing out the Instances:


Note that all the three attributes are mapped as numerical attributes.

Listing 12.12. The dataset created from the first part of code
@relation wekaCF

@attribute item1 numeric
@attribute item2 numeric
@attribute item3 numeric


Next, let’s look at the second part of the code, which deals with creating a classifier and querying the classifier for the expected values. This is shown in listing 12.13.

Listing 12.13. Making predictions using k-nearest neighbor

First, we create an instance of the k-nearest neighbor classifier. Since we have only three data points in our example, we set k to be 1:

IBk ibk = new IBk(1);

Next, we build the classifier (ibk.buildClassifier(instances);) and then evaluate each instance for the predicted value for item3. We only have three examples and each query instance is a perfect match for one of the three cases. Listing 12.14 shows the output for item3 for each of the three cases—as expected, the expected and predicted values matched perfectly.

Listing 12.14. The output predicted and expected values for our example
Expected=2.0 Predicted=2.0
Expected=4.0 Predicted=4.0
Expected=5.0 Predicted=5.0

The third alternative is to use one of a number of free packages that implement collaborative filtering in Java. We briefly go through three of the popular Java-based open source packages. Cofi ( is a project available under GPL and is led by Daniel Lemire. The music recommendation site Racofi ( uses the Cofi package. The Cofi package is fairly lightweight and easy to use. A number of memory-based collaborative filtering algorithms are implemented in this package.

Taste ( is another collaborative filtering package that’s fast and lightweight. It supports both item-based and user-based memory-based collaborative filtering algorithms. It’s fairly well-documented and easy to use.

Cofe ( is yet another collaborative filtering package that’s freely available. It’s well-documented and fairly easy to use.

Next, let’s look at a commonly used model-based collaborative filtering method known as latent semantic indexing (LSI), which reduces the dimensionality of the user-item matrix using a technique known as singular value decomposition (SVD).

12.3.3. Dimensionality reduction with latent semantic indexing

As noted in section 3.5.2, LSI has been used in content-based analysis to solve the problems of synonymy and polysemy. Consider a term-document matrix as shown in table 12.6, where each row corresponds to terms across documents, while a column represents a document. The ith column in this matrix corresponds to the term-vector representation for the ith document. In essence, to create this matrix, we first need to create the term vector for each document and then join all the term vectors vertically to create this matrix. Note that a cell value corresponds to the product of the term frequency and inverse document frequency for that term and document.

Table 12.6. Term-document matrix

Document 1

Document 2


Document n

Term 1        
Term 2        
Term m        

This matrix is similar to the user-item matrix in table 12.1. In LSI,[2] the dimensionality of the term-document matrix is reduced using the process of singular value decomposition (SVD). The following are a few motivations for doing such a transformation on the term-document matrix:


  • Necessary evil— If the term-document matrix is too large from a computational standpoint, an approximation may be more computationally friendly.
  • Filtering noise— If the term-document matrix is noisy, the transformed matrix may be a better approximation of the concepts.
  • Sparse matrix— If the term-document matrix is sparse, transforming the matrix may increase its density.

This is analogous to applying SVD to the term-document matrix. To understand this further, we need to first understand SVD and how it transforms the matrix.

An arbitrary matrix A of size m x n can be decomposed into three matrices using SVD. Let r be the rank of A. As shown in figure 12.6, the matrices are

  • U, an orthogonal[3] square matrix of size m x r.

    3 The product of the matrix and its transpose gives the identity matrix. See

  • S, a diagonal matrix of size r x r, with each diagonal value being the eigen value for the matrix. All values of S are positive and stored top to bottom in decreasing order of magnitude.
  • V is an orthogonal square matrix of size n x r.
Figure 12.6. Illustration of the dimensionality reduction

We can reduce the r x r dimensionality of the matrix S to use only the top k singular values. As shown in figure 12.6, the matrices U and Vt are reduced in dimensionality for this approximation of A. The lower dimensionality allows us to approximate using ksingular values.

Next, let’s work through the same example as in the previous section, but this time we apply dimensionality reduction to illustrate the SVD process.

12.3.4. Implementing dimensionality reduction

JAMA[4] is a widely used linear algebra package developed by MathWorks[5] and is intended to be the standard matrix class for Java. WEKA uses JAMA; the weka.core.matrix.Matrix class represents a matrix and has a number of methods, including one to compute the SVD of a matrix. In this section, we develop the Java code to implement dimensionality reduction using the example from section 12.3.1.



We develop SVDExample, a simple class that first creates a matrix representation for the data in table 12.2 and computes the SVD for the matrix. Then we approximate the matrix using reduced dimensionality. Listing 12.15 shows the first part of the code that deals with representing the data in a Matrix instance and computing its SVD.

Listing 12.15. Representing the data to illustrate dimensionality reduction

The example first creates a Matrix representation for the data. Note that the rank of the matrix is also printed in the method createUserItemMatrix(). The output from the first part of the program is shown in listing 12.16. Note that the rank for our matrix is 3.

Listing 12.16. Output from running the first part of the code
UserItem: Rank=3
 3 4 2
 2 2 4
 1 3 5
 0.55 -0.83 0.12
 0.54 0.23 -0.81
 0.64 0.51 0.57

UtU is orthogonal:
 1 0 0
 0 1 0
 0 0 1

 8.93 0   0
 0   2.71 0
 0   0   0.91

 0.38 0.58 0.72
 -0.55 -0.48 0.68
 -0.74 0.66 -0.14

VtV is orthogonal:
 1 0 0
 0 1 0
 0 0 1

As shown in listing 12.16, S contains three eigen values: 8.93, 2.71, and 0.91. Note that they appear in descending order in the diagonal of the matrix. Also, both U and V are orthogonal matrices. Next, let’s approximate the matrices by reducing the dimensionality of the matrix, the code for which is shown in listing 12.17.

Listing 12.17. Code illustrating dimensionality reduction

For a given value of k—in our example, this is shown for the values 1, 2, and 3—we compute the subset of the three matrices U, S, and V. Next, we approximate the original matrix using these reduced matrices, the output of which is shown in listing 12.18 for the three different k values.

Listing 12.18. Output from running the second part of the code
1 dimensional approx matrix:
  1.84 2.84 3.53
 1.81 2.79 3.47
 2.15 3.33 4.14

2 dimensional approx matrix:
  3.08 3.93 2.02
 1.45 2.48 3.9
 1.39 2.66 5.07

3 dimensional approx matrix:
 3 4 2
 2 2 4
 1 3 5

Note that the approximation using k=2 is fairly close to the original values of the initial matrix, and we get back the original matrix for k=3. Lastly, let’s look at a probabilistic model–based approach in collaborative filtering.

12.3.5. Probabilistic model–based approach

In chapter 10, we briefly introduced the Bayesian belief network—a directed acyclic graph where nodes correspond to random variables and directed arcs between nodes signify conditional probability between the parent and child nodes. In probabilistic collaborative filtering, each item corresponds to a node in the belief network. The states of a node correspond to the different rating values, with an additional state corresponding to the missing or “no vote” state. The learning algorithm searches for different network topologies corresponding to dependencies for each item. Once the network has been learned, each item will have parent nodes corresponding to items that are best predictors for the expected ratings.

In this section, we’ve covered various approaches to implementing collaborative filtering. Next, let’s look at some real-world examples of how this technology has been applied.

12.4. Real-world solutions

In this section, we cover three case studies of how large-scale recommendation systems have been developed at, Google News, and at Netflix. These three case studies should give you an insight into how large-scale (millions of users and millions of items) recommendation systems have been developed in the real world. We learn something new with each use case. First, we learn about Amazon’s item-to-item recommendation engine that scales to millions of users and millions of items. Next, we look at Google News personalization to learn about how to deal with noisy user input high item churn. Lastly, we look at Netflix and learn the approach taken by various researchers to improve the results of a recommendation engine, along with Netflix’s approach.

12.4.1. Amazon item-to-item recommendation

Perhaps one of the best-known examples of a recommendation system is at Amazon. com. Amazon reports high click-through and email advertising conversion rates using their recommendation engine, compared to untargeted content such as banner advertisements and top-seller lists. The content of this section is drawn from two sources: first from a paper[6] published on item-to-item collaborative filtering, and second from Greg Linden’s blog[7] entries. (Linden developed Amazon’s recommendation system.)

6 Linden, G., Smith, B., and York, J. 2003. “ Recommendations: Item-to-Item Collaborative Filtering.” IEEE Internet Computing 7, 1 (Jan. 2003), 76-80. DOI=


Figure 12.1 showed an example of the recommendations Amazon provides to a user browsing a book. For a given item (book), Amazon shows related items by analyzing customer purchasing patterns. Figure 12.7 shows another example of personalized recommendations based on items that the user has purchased in the past.

Figure 12.7. Screenshot of recommendations to a user at

Users can browse through the list of recommended items, and as shown in Figure 12.8, can rate an item and/or remove items from consideration from the recommendation engine.

Figure 12.8. To help the recommendation engine at Amazon, a user can rate an item and/or remove items from consideration.

Amazon makes recommendations based on information from multiple contexts, including

  • Short term information— Recommendations based on recent search terms and item browsing history, as shown in figure 12.9.
  • Items available in the shopping basket.
  • Purchasing history— Recommendations based on past purchasing history, as shown in figure 12.7. The system also uses available information about the user to send out targeted emails suggesting recommended items.
Figure 12.9. Recommendations based on browsing history

The recommendation system at Amazon has a number of challenges:

  • Large number of items and users— The system contains huge amounts of data with tens of millions of customers and millions of distinct catalog items.
  • Performance and scalability— Results need to be returned within half a second. There are a very large number of transactions—for example, on December 10, 2007,[8] which was one of Amazon’s strongest days, customers ordered more than 5.4 million items, or 62.5 items per second.


  • Limited information for new customers— New customers may have purchased or rated only a few items. Older customers may have purchased or rated a large number (thousands) of items.
  • Responsiveness to new information— The algorithm must respond immediately to new information from the customer.

As described in section 12.1.2, Amazon uses an item-based approach, where related items for an item are used for recommendations. This also means that all users get the same list of items as recommended items when they look at a particular item. Given the large number of users[9] and items, it was found that implementing a typical k-nearest neighbor search ran into severe performance and scalability issues. It was deemed that reducing the dimensionality of the search space—by clustering and partitioning items, or by discarding most popular and unpopular items—would degrade the recommendation quality. Again, from a performance and scalability point of view, a content-based approach for recommendations was not used—for users with thousands of purchases, you would need to combine the term vectors across all instances and query the catalog for related items.

9 Even at 10 million users and 1 million items

The item-to-item collaborative filtering algorithm implemented by Amazon scales independent of the number of users, and develops the expensive item-to-item table offline. During runtime computing, the list of recommended items is simply a table lookup. You can compute the item-to-item similarity table by iterating over the user-item purchase matrix (similar to table 12.1), but the following iterative algorithm provides a better utilization of memory and computing resources:

  • Assume that there are n items and a n x n matrix with items corresponding to rows and columns. Iterate the following steps across all items.
  • For a given item (parent item), iterate over all customers who have bought the particular item. For each item purchased (child item) by the customer, increment the count for that item—this is stored in the child item row and parent item column.

At the end of the iteration, to find related items for a the parent item, simply go to the parent item column and retrieve the items in the order of their counts.

A simple example helps illustrate the process. Table 12.7 contains the purchasing pattern for four customers in a catalog with four items. To keep things simple, we ignore modifications to the matrix to reflect the frequency of purchase for the four items (idf).

Table 12.7. Sample data for iterative item-to-item algorithm

Item 1

Item 2

Item 3

Item 4

User 1 1   1  
User 2   1   1
User 3 1 1 1  
User 4     1 1

Applying our algorithm should result in the item-to-tem matrix shown in Table 12.8. As shown in the table, the items most related to Item 1 are Item 3, followed by Item 2.

Table 12.8. Item-to-item matrix

Item 1

Item 2

Item 3

Item 4

Item 1 - 1 2  
Item 2 1 - 1 1
Item 3 2 1 - 1
Item 4   1   -

Once the item-to-item matrix has been computed offline, the recommended items are determined using the related items for each item and combining the scores across all instances.

The item-to-item recommendation algorithm used by Amazon is one of the simplest collaborative filtering algorithms, which works well and scales to huge amounts of data.

12.4.2. Google News personalization

Google News ( is a computer-generated news site that aggregates headlines from more than 4,500 news sites, clusters them into related stories, and displays it all in a personalized manner for a user. Figure 12.10 shows a typical news page for a logged-in user.

Figure 12.10. Google News with recommended stories using the user’s web history

Google provides personalized recommendations to users who opt in to the search history feature provided by various Google web sites. As a part of the search history, Google records search queries made by the user along with a list of news articles visited. When a user has adequate history of news items clicked, the site recommends news stories of interest to the user in two sections. The first is the Recommended for youremailaddress section in the center of the page, just below the Top Stories section. The second place is the Recommended link on the left side, just below the Top Stories link, where the user can get a larger list of recommended items. An example of this page is shown in figure 12.11.

Figure 12.11. Personalized news stories for a logged-in user

Google News is a good example of building a scalable recommendation system for a large number of users (several million unique visitors in a month) and a large number of items (several million new stories in a two-month period) with constant item churn. This is different from Amazon, where the rate of item churn is much smaller.

The content of this section is drawn from a paper[10] published on this topic and a talk[11] given by Mayur Datar from Google.

10 Das, A. S., Datar, M., Garg, A., and Rajaram, S. 2007. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW ‘07. ACM, New York, NY, 271-280. DOI=

11 Slides and video can be obtained from

Google decided to use collaborative filtering to build the recommendation system, mainly due to the availability of data from Google’s large user base, and the advantage that the same collaborative filtering approach could be applied to other application domains, countries, and languages. A content-based recommendation system perhaps may have worked just as well, but may have required language/section-specific tweaking. Google also wanted to leverage the same collaborative filtering technology for recommending images, videos, and music, where it’s more difficult to analyze the underlying content.

There are a number of things that are of interest to us in this case study:

  • High churn in items— One of the unique challenges for the news site has been the high rate of churn associated with the news stories—news stories change every 10–15 minutes. This is unlike our case study of Amazon, where the rate of churn isn’t as high. The rapid change in the item set makes it difficult to compute a model representation for item-to-item, since computing that would probably take a few hours, and the news would be obsolete by that time.
  • Noisy ratings— A user clicking on a story is used as a noisy positive vote for that story. A click is noisier than an explicit 1–5 star rating (see section 2.3) or treating a purchase as a positive vote. As discussed in sections 2.3.4 and 2.4.2, a click on a story corresponds to a rating of 1, while a non-click corresponds to a 0 rating.
  • Instant gratification— The system incorporates user clicks instantly in its recommendations, thus providing instant gratification to the user.
  • High performance requirements— Google has a high performance requirement for the recommendation system—typically the recommendation engine needs to provide a recommendation within a few hundred milliseconds.

The problem of recommending news stories to a user can be modeled into a standard collaborative filtering user-item table, as shown in table 12.1. Here, an item is a news story, while rows are users. The dimensionality of m and n is a few million. The matrix is sparse, where a typical user may have clicked on a few stories, while some users may have clicked on hundreds of stories. Google’s approach focuses on algorithms that are scalable and combines three different algorithms—one is a memory-based algorithm while the other two are model-based algorithms. The recommendation engine weighs the scores for each recommended item across the three approaches. The memory-based algorithm uses the related-item approach. Here we use the covisitation algorithm (similar to Amazon’s algorithm), where users are clustered together based on which pages they’ve visited. The other two algorithms, MinHash and PLSI, group users into clusters, and recommend stories clicked by one user to others in the same cluster.

User clustering is a batch process that should be done offline and run periodically to not affect the performance of the regular application. While the cluster story-counts are maintained in real-time, new users that haven’t been clustered in the previous clustering run are recommended stories based on a covisitation algorithm. Google leverages BigTable[12] and MapReduce to scale the algorithms to the large volume of data. Next, we briefly describe the three algorithms used.


Memory-Based Algorithm: Covisitation

The first algorithm, known as covisitation, uses an item-based approach to compute recommended news items. Similar to Amazon and covered in the previous section, the algorithm displays items that have been viewed by other users who have viewed a given item.

Conceptually, the algorithm is simple, and works as follows. Consider a window of time, typically a few hours. Two stories that have been visited by the same user during this period of time are said to have been covisited. Associated with each user is a list of items that have been visited by the user in the given window of time; we call this the user’s click history. For each item, the system keeps a running count of how many times this story has been covisited with each other story. If two stories have never been covisited then this count is 0. When a user clicks on a new item, the system uses the user’s click history and updates the covisitation count for this new item and each item in the user’s click history. Given an item, related items are thus other items that have the largest covisitation numbers associated with them. The algorithm also takes into account the temporal nature of the clicks, and discounts the count based on the age of the click and story. These counts are also normalized to a 0 and 1 scale. To generate a list of candidate items for a user, the system uses the user’s click history. For each of the items in the user’s click history, the list of covisited items along with their scores is retrieved. The scores for the items are added over all items in the user’s click history set, and the items with the highest summed weight are recommended to the user.

Model-Based Algorithms

Minwise Independent Permutation Hashing (MinHash) is a probabilistic clustering algorithm that assigns two users to the same cluster, with a probability proportional to the overlap in items they’ve clicked. A simple similarity function, also known as Jaccard coefficient, is used to compute the similarity between two users. The Jaccard coefficient, a number between 0 and 1, is the ratio of number of items that have been covisited by the two users divided by the total number of unique items that have been visited by either of the two users. A simple approach is to compute the similarities of the users with all the other users and recommend stories using these similarities to weigh the recommendations. However, given the large number of users, this approach isn’t scalable. Google therefore uses a sublinear time near-neighbor search technique called Locality Sensitive Hashing (LSH). The algorithm works as follows. The complete set of items are randomly arranged. A user is assigned a hash value equal to the index of the first item that the user has visited in this permutation. This process is repeated a number of times (10–20) in parallel. The hash value for a user is equal to the sum of the hash values assigned to the user in each of the runs. Users having the same hash values are assigned to the same cluster. For scalability and performance, the MinHash clustering algorithm is implemented as a MapReduce problem.

PLSI is a collaborative filtering algorithm developed by Hoffmann and is based on probabilistic latent semantic models. It models users and items as random variables, where the relationship between users and items is modeled as a mixture distribution.

Hidden variables are used to capture the relationships between the users and the items. The expectation-maximization algorithm is used to learn the parameters of the algorithm. For scalability, the algorithm was reformulated to fit into the Map-Reduce paradigm.

Experiments on the live Google site have shown better click-through rates for recommended stories than those for the top stories for the day. The large dimensionality for items and users along with a high degree of item churn makes this case study interesting.

Lastly, let’s look at Netflix.

12.4.3. Netflix and the BellKor Solution for the Netflix Prize

In November 2007, a little over a year after announcing a million-dollar prize for the winning entry in a contest to improve the prediction accuracy of movie ratings by 10 percent, Netflix awarded the first progress prize of $50,000 to BellKor,[13] a group of researchers at AT&T Labs. The BellKor (a.k.a. KorBell) team consisting of Yehuda Koren, Robert Bell, and Chris Volinsky, improved on the Netflix recommendation system by 8.43 percent, a little short of the 10 percent required to achieve the million dollar prize. Of course, this improvement is on a test dataset that Netflix provided as a part of the contest, and not integrated into Netflix’s recommendation engine yet. The contest in its first year attracted more than 27,000 contestants on more than 2,550 teams from 161 countries.


Netflix believes that any improvements in their recommendation engine will provide them with a competitive advantage. Opening up the process of finding a better recommendation engine with a public contest is yet another example of collective intelligence, where the collective efforts of other users is being used to improve the recommendation engine.

Before we go on to the BellKor solution, it’s useful to look at Netflix’s recommendation engine. This part of the section draws from a talk given by Jim Bennett, VP of recommendation systems, at the conference[14] in September 2006.


When a user signs up at Netflix, the user is urged to rate movies in an attempt to learn the user’s tastes. Netflix’s recommendation engine, Cinematch, uses an item-to-item algorithm (similar to Amazon) with a number of heuristics that aren’t disclosed to the public. The offline computation of the item-to-item matrix takes two days to train. To solve the problem of “cold startup” for new titles, items are set up manually; this manual setup is retired over a period of time. Figure 12.12 shows a typical recommendation screen that’s shown to a user. Here, movies similar to the movie “Babel” are being recommended to the user.

Figure 12.12. Movies related to a movie being recommended at Netflix

Figure 12.13 shows the home page for a user, which shows the user’s recommended movies.

Figure 12.13. Home page for a user at Netflix showing the user’s recommended movies

The current Netflix recommendation system uses about 2 billion movie ratings that have been collected from more than 10 million customers. However, Netflix estimates that by sometime around 2010 to 2012, it will

  • Have more than 10 billion ratings
  • Generate 10 million ratings a day
  • Make 5 billion recommendations a day
  • Have more than 20 million customers
Netflix Competition

The learning dataset for the competition consists of more than 100 million anonymous movie ratings, on a scale of one to five stars, made by 480,000 users on 17,770 movies. Note that the user-item dataset for this problem is sparsely populated, with nearly 99 percent of user-item entries being zero. The distribution of movies per user is skewed. The median number of ratings per user is 93. About 10 percent of the users rated 16 or fewer movies, while 25 percent of the users rated 36 or fewer. Two users rated as many as 17,000movies. Similarly, the ratings per movie are also skewed, with almost half the user base rating one of the popular movies (Miss Congeniality). About 25 percent of the movies had 190 or fewer ratings associated with them. A handful of movies were rated fewer than 10 times. The dataset doesn’t contain any personal identifiable information. It contains user ID, movie titles, star ratings, and dates—there are no text reviews in the dataset. In the absence of any content associated with the movies, there’s no option but to apply a collaborative-based approach to analyze the data.

The Netflix competition doesn’t take into account speed of implementation or scalability of the approach used. It simply focuses on the quality of the recommendation system in terms of minimizing the error between the user rating and the predicted rating. The Netflix data doesn’t contain much information to allow the use of a content-based approach; it’s for this reason that teams focused on collaborative-based techniques.

The winning team, BellKor, spent more than 2,000 combined hours poring through data to find the winning solution. The winning solution was a linear combination of 107 sets of predictions. Many of the algorithms involved either nearest neighbor method (k-nearest neighbor) or latent factor models such as SVD/factorization and Restricted Boltzmann Machines (RBMs).[15]

15 See and Google Tech Talk “Next Generation of Neural Networks” by Geoffrey Hinton of University of Toronto,

The winning solution uses k-NN to predict the rating for a user using both the Pearson-r correlation and cosine methods to compute the similarities. Similar to our discussion in section 12.3.1, they found that it was useful to remove item-specific and user-specific biases in the computation. Latent semantic models (see sections 12.3.4 and 12.3.5) were also used widely in the winning solution. RBMs are stochastic neural networks that contains two layers. The first layer corresponds to the observed ratings, while the second layer is hidden and is connected to the first layer.

The BellKor team found that it was important to utilize a variety of models that complemented each others’ shortcomings. None of the models by itself could get the BellKor team to the top of the competition. The combined set of models achieved an improvement of 8.43 percent over Cinematch, while the best model—a hybrid of applying k-NN to output from RBMs—improved on the result by 6.43 percent. The biggest improvement by LSI methods was 5.1 percent, with the best pure k-NN model scoring below that. (K for the k-NN methods was in the range of 20 to 50.) The BellKor team also applied a number of heuristics to further improve the results.

The BellKor team provides a number of recommendations on what’s needed to build a winning solution for a competition:

  • Combining complementary models helps improve the overall solution. Note that a linear combination of three models, one each for k-NN, LSI, and RBM, would have resulted in fairly good results—an improvement of 7.58 percent.
  • You need a principled approach to optimizing the solution.
  • The key to winning is to build models that can predict accurately where there’s sufficient data, without over-fitting in the absence of adequate data.

The Netflix competition—the data, the prize money, and the fame factor—have generated a large amount of interest in the field of collaborative filtering. Teams have collaborated with each other and have published their solutions, allowing others to build on their work. Figure 12.14 shows the leaderboard for the competition as of early

Figure 12.14. A screenshot of the Netflix leaderboard as of early 2008 (

August 2008. Note that the best score is an improvement of 9.15 percent over the benchmark result. The million-dollar question is, Will someone come up with the solution that’s good enough to surpass the 10 percent improvement mark? Only time will tell, but given the interest and efforts being made, it’s perhaps only a question of time.

In this section, we’ve covered the solutions for three large-scale recommendation systems—Amazon, Google News, and Netflix. By now, you should understand what a recommendation engine is, and how it can be used and built. In your application, you should be able to start with building perhaps a simple content-based item-to-item recommendation engine and then augment it with a collaborative-based approach. This should help you to personalize your application for every user. The challenge with personalization and building recommendation systems is to find the right balance between the delta improvement in quality and the additional time the user needs to wait for the response. I’ve found the strategy of starting with something simple and gradually building on top of it over time, along with lots of asynchronous pre-computation, to work well in most cases.

12.5. Summary

One of the best ways to personalize a site for a user is to use a recommendation engine. There are two main approaches to building a recommendation engine: content- based and collaborative-based. In content-based analysis, similar content is recommended by computing the similarity in term vectors between various items. But content-based algorithms can’t determine the quality of an item being recommended. For this, we need to use a collaborative approach. In collaborative filtering, each item is treated as black box, and user interactions with the item are used to compute similarities. Collaborative approaches are language-agnostic and are particularly suited for analyzing images, video, and music that might not have any content associated with them.

Approaches to building recommendation systems can be further divided into itembased and user-based analysis. In item-based analysis, items that are similar to items of interest are recommended to the user, while in user-based analysis, first similar users to a user are found, and them items liked by those users are recommended.

The recommendation problem uses a user-item matrix containing rating information (purchasing/voting/click-through) for cell values. In this matrix, a row corresponds to a user, while each column corresponds to an item. This matrix is typically of large dimension and sparse. This large dimensional matrix may be approximated to lower dimensions using singular value decomposition. The k-nearest neighbor algorithm determines the k closest users to predict the ratings for a user.

We also looked at three well-known, large-scale recommendation systems:, Google News personalization, and Netflix.

Using the concepts developed in this book, you should now be able to apply collective intelligence to your application. Good luck at providing a personalized experience to your users.

12.6. Resources

ACM Transactions on Computer-Human Interaction (TOCHI) Special Section on Recommender Systems. Volume 12, Issue 3 (September 2005).

Adomavicius, Gediminas and Alexander Tuzhilin. “Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions,” IEEE Transactions on Knowledge and Data Engineering, Vol. 17, no. 6, pp. 734–749. June, 2005.

Bell, Robert and Yehuda Koren. “Improved Neighborhood-based Collaborative Filtering.”

———“Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights.” IEEE Conference on Data Mining (ICDM’07). 2007, IEEE.

———“Improved Neighborhood Based Collaborative Filtering.” KDD 2007 Netflix Competition Workshop.

Bell, Robertl, Yehuda Koren, and Chris Volinsky. “The BellKor Solution to Netflix Prize.” 2007. and

———“Modeling relationships at multiple scales to improve accuracy of large recommender systems.”

———“Chasing $1,000,000: How We Won The Netflix Progress Prize.” ASA Statistical and Computing Graphics Newsletter. Volume 18, Number 2, December 2007.

BellKor home page.

Berry, M. W., S. T. Dumais, and G. W. O’Brien. “Using linear algebra for intelligent information retrieval.” SIAM Review 37(4):573–595. 1995.

Breese, John, David Heckerman, and Carl Kadie. “Empirical Analysis of Predictive Algorithms for Collaborative Filtering.” Proceedings of the 14th Annual Conference on Uncertainty in Artificial Intelligence (UAI-98), pgs 43-52. 2002. Morgan Kaufmann.

Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. “Bigtable: A Distributed Storage System for Structured Data.” OSDI ’06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA. November, 2006.

COFE: Collaborative Filtering Engine.

Cofi: A Java-based collaborative filtering library.

Das, Abhinandan S., Mayur Datar, Ashutosh Garg, and Shyam Rajaram. “Google news personalization: scalable online collaborative filtering.” In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08–12, 2007). WWW ’07. ACM, New York, NY, 271-280. DOI=

Deshpande, Mukund and George Karypis. “Item-based top-n recommendation algorithms.” ACM Transactions on Information Systems, 22(1):1-34, 2004.

Fleder, Daniel M. and Kartik Hosangar. “Recommender systems and their impact on sales diversity.” Proceedings of the 8th ACM conference on Electronic commerce. 2007.

Hoffman, Thomas. “Latent semantic models for collaborative filtering.” ACM Transactions on Information Systems (TOIS), Volume 22, Issue 1 (January 2004).

The Homepage of Nearest Neighbors and Similarity Search. Maintained by Yury Lifshits.

IEEE Intelligent Systems Special Issue on Recommender Systems. vol. 22(3), 2007International Journal of Electronic Commerce Special Issue on Recommender Systems. Volume 11, Number 2 (Winter 2006-07).

Johnson, Aaron. “Using Lucene and MoreLikeThis to Show Related Content.” Nov. 2006.

KDD Cup and Workshop (KDD’07). ACM Press (2007).

Kim, Juntae. “Effect of Dimensionality Reduction in Recommendation Systems.” AI Workshop, 2002.

Lemire, Daniel and Anna Maclachlan. “Slope One Predictors for Online Rating-Based Collaborative Filtering.” Proceedings of SIAM Data Mining (SDM ’05), 2005.

Linden, Greg. Geeking with Greg. Blog.

Linden, Greg, Brent Smith, and Jeremy York. “ Recommendations: Item-to-Item Collaborative Filtering.” IEEE Internet Computing. 7, 1 (Jan, 2003), 76-80. DOI=

Porter, Joshua. “Watch and Learn: How Recommendation Systems are Redefining the Web.” 2006.

“The Present and Future of Recommender Systems.” September 12-13, 2006. Bilbao, Spain.

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’07). ACM Press (2007).

“Reinforcing the Blockbuster Nature of Media: The Impact of Online Recommenders.”

Salakhutdinov, Ruslan, Andriy Mnih, and Geoffrey Hinton. “Restricted Boltzmann machines for collaborative filtering.” In Proceedings of the 24th International Conference on Machine Learning (Corvalis, Oregon, June 20 - 24, 2007). Z. Ghahramani, Ed. ICML ’07, vol. 227. ACM, New York, NY, 791-798. DOI=

Sarwra, Badrul, George Karypis, Joseph Konstan, and John Riedl. “Item-based collaborative filtering recommendation algorithms.” Proceedings of the 10th International World Wide Web Conference (WWW10), Hong Kong, May 2001.

———“Application of dimensionality reduction in recommender systems—a case study.” In ACM WebKDD Workshop, 2000.

SVD Recommendation System in Ruby. 2007.

Taste: Collaborative filtering for Java.


Yahoo Launchcast.