# Given a particular vacation rental property, which other properties in our inventory are the *most similar*?

Here at HomeAway, answering this question is doubly valuable. On one hand, it helps us augment our travelers’ search experiences by giving us the ability to *recommend* properties to them, based off of choices they’ve previously made while using our sites. For example, if last summer you booked a 4 bedroom house in Austin for a family reunion, we’ll be able to recommend similar venues for next year’s big get together, effectively giving you a personalized list of possibilities with which to begin your search. On the other hand, answering this question also enables us to build tools and features that help our owners analyze their properties’ competitive marketplaces. And as an owner, having your finger on the pulse of your local market can be just what you need in order to get the most out of your vacation rental.

One traditional approach to this problem is that taken by the real estate appraiser. When you make an offer to purchase a house, an appraiser will take into account all of the comparable properties that have sold in your area recently, and make a value judgment as to whether or not your sale price is supported by the activity of the local marketplace. This process of determining a property’s comparables, or *comps*, is as much an art as it is a science, with the appraiser taking into account things like the reputation and desirability of the neighborhood, the relationships between square footage, number of bedrooms and number of bathrooms, whether or not there have been any recent renovations or upgrades made - even things like the quality of the view from the bedroom windows.

While this process continues with some success in the real estate marketplace today, it certainly doesn’t meet the requirements for objectivity, or scale to the needs of an organization like HomeAway, where we would like to generate a highly accurate and non-biased list of comps for each of our 1.2+ million listings (and growing). To meet these needs, we took to a different approach. We reasoned that the traditional practice of comparing two properties side by side, in a vacuum, was inherently limiting, and instead asked, “what if we took **lots** of properties, and **lots** of our users’ interactions with those properties, and attempted to find similarity relationships that way?”.

The astute reader may already be aware that this is **exactly** the problem that collaborative filtering solves! In a nutshell, the basic idea behind collaborative filtering is as follows:

Count how many times A and B cooccur in a particular user’s interactions. Do this for all users, across all users’ interactions. The more often A and B cooccur together, the higher the probability that they’re similar.

The canonical example of this in the collaborative filtering literature is the relationship between diapers and beer. Yes, you read that correctly..

It turns out that diaper and beer purchases cooccur together surprisingly often. It’s not unusual for new fathers, making late night runs to the grocery store for much needed diaper replenishments, to also pick up a six pack on their way out. So, as the story goes, one particularly enterprising grocer noticed this relationship in their data, and, in response, moved the diapers next to the beer. Sales of both soon skyrocketed.

The lesson here is not that noticing cooccurrence relationships will make you rich, although they almost certainly will increase efficiency if acted upon appropriately, but rather that there are hidden insights lurking in your operational data, and that collaborative filtering is an effective tool for finding them. At HomeAway, we have an abundance of operational log data residing in our Hadoop data lake, making the promise of new efficiencies gleaned from collaborative filtering approaches all the more tantalizing. Big data, big payoffs.

So, in response to these needs, what we’ve built here at HomeAway is a collaborative filtering-based similar property *recommender* - a service that, given any particular vacation rental property, can *recommend* a list of other similar properties in our inventory. These recommendations are grounded in how our users search, view, and book our properties online, and also take into account aspects of the property listing that the business deems important for comparison, e.g: where it’s located, what amenities it has, etc. And finally, perhaps most importantly, these recommendations are *personalizable*. Given a particular user’s history, we can make tailored recommendations to them, giving them the most relevant information with which to make their vacation choices.

“Ok cool. This all sounds great, but what goes into building something like that?”, you might ask. Well, the answer is actually fairly direct..

**Math.** Lots and lots of math!

In particular, linear algebra. As it turns out, the math of recommenders can be very elegantly captured in terms of vectors and matrices. Let’s have a look at the anatomy of a simple recommender - one that can make product recommendations based on the purchase history of a set of users.

Here, we’re able to take a particular user’s personal history with respect to product purchases, and cross-reference that with what is known as a *correlation matrix*, built from what we know about **all** users’ purchase histories. The result determines a set of personalized product recommendations for that user. This is the basic principle behind all personalized recommendations.

Although elegantly expressed in this equation, there are quite a few complex and nuanced details that go into computing a correlation matrix and with it generating personalized product recommendations. Let’s take a closer look at the correlation matrix, [P^{T}P].

We can begin by inspecting its dimensions. P is a matrix whose rows correspond to individual users, and whose columns correspond to individual products. Therefore, P^{T} is just the opposite - a matrix whose rows correspond to products and whose columns correspond to users. Their product, [P^{T}P] would then be represented by the result of the pseudo calculation (products x users) * (users x products), yielding a matrix whose dimensions are (products x products). And this is exactly what we want! A correlation matrix which provides a means of comparing every product to **every other product** in our inventory.

Let’s recall for a minute the mechanics of matrix multiplication. To get the value in the i^{th} row, j^{th} column of the product matrix, you sum up the pairwise products of the elements from the i^{th} row of the left hand side matrix with the elements from the j^{th} column of the right hand side matrix. If that was hard to visualize, here is that rule in action for a (2x3)*(3x2) matrix product:

Under what circumstances will the value of a cell in the product matrix be non-zero? To answer this, let’s take a single cell’s value from the above example, *ag + bi + ck*. Since all of *a*, *b*, *c*, *g*, *i*, and *k* are non-negative, *ag + bi + ck* is only non-zero when at least one of the summands is non-zero. And the summands, e.g: *ag* or *bi*, are only non-zero when both of the factors, e.g: *b* and *i*, are non-zero. When we think about what *b* and *i* represent in the lhs and rhs matrices, respectively, we see that a non-zero *bi* product means that the same user (the one represented by the 2^{nd} row of P and second column of P^{T}) had a **cooccurrence** of two products - namely, those represented by the first row of the lhs matrix and the first column of the rhs matrix. If there are any other non-zero summands in *ag + bi + ck*, that means that another, different user had a cooccurrence involving those same two products. In this way, we can see that every cell of the product matrix [P^{T}P] is a value representing how many times two products cooccurred across the history of our entire user base.

This raw counting approach will yield usable similarity data, but it has the undesirable property of producing extremely high values for very popular products, skewing recommendations towards things that many people have already seen, or even purchased. In order to “smooth out” these values, we need to apply a function to each cell to generate a new, mapped value that will provide a better similarity measure with which to generate recommendations. In practice, one of the best and most popular measures for this application turns out to be the Log Likelihood Ratio, or *LLR*. LLR was part of a ‘93 paper, written by Ted Dunning, that would later become the subject of his PhD thesis, and has gained traction in the machine learning community as a practical measure for determining similarity.^{[1]} For a deeper dive, check out Pat Ferrel’s excellent blog Occam’s Machete, which was foundational to this author’s understanding of recommender systems, in particular his article on tuning recommenders.^{[2]}

Alright, great. We’ve got a firm grasp on the theory behind personalized recommendations, but how do we actually architect and deploy a system for executing this at scale? While there are doubtless many choices for doing this, one way we’ve chosen to approach the problem here at HomeAway is with Cascading, Driven, Mahout, and Spark.

We use Cascading, instrumented by Driven, to parse, join, and filter the various data sets within HomeAway that contain our users’ pertinent browsing histories. And then Mahout, on top of Spark as a computation fabric, to consume those data sets and generate LLR-weighted correlation matrices. Once we have those matrices, output by Mahout as *part* files on HDFS, we store them in a database, index them for fast lookup at search time, and correlate them with user history at runtime to generate personalized, adaptive recommendations. If you’re interested in an out-of-the-box implementation of this lambda architecture, Pat Ferrel’s action(ML) may be a good choice for you.^{[3]}

The same processes and architecture can be repeated to generate collaborative filtering recommenders for actions other than purchases, too. In the cases where these actions influence your primary action (e.g: product views in relation to product purchases), a *cross-correlation matrix* can be generated. Cross-correlation matrices are computed in a very similar fashion to regular correlation matrices, except they are effectively “anchored” by user actions with respect to the primary action. This manifests itself in the recommendation “formula” for cross-correlation:

Notice that P^{T} is the multiplier in the matrix product [P^{T}V]. This is how we’d count cooccurrence of product views *with respect to* product purchases, effectively anchoring by the primary action and generating a cross-correlation matrix for product views with respect to purchases. This can be done for any number of secondary signals, and generally makes for a stronger ensemble recommender.

Finally, there are cases in which there are too few user interactions to generate reliable recommendations. For instance, in the case when a product is new to the marketplace, and hasn’t had time to accumulate enough user interactions to make discerning similar products practical. We call this the *cold start* problem. To combat this, one approach is to look for product-to-product similarities using only intrinsic product features, not user interactions. This is called *row-based similarity*. We begin by constructing a matrix whose rows represent products, and whose columns represent product features. We then look for cooccurrence of product features across products, applying the same techniques that we used in our previous example illustrating user-based cooccurrence recommenders.

There is one small difference in the recommender equation used in this case, but an analysis of our matrices’ dimensions will lead us to that detail. Call our (product x feature) matrix F. Then, in order to arrive at a correlation matrix with the desired dimensions of (product x product), we must not calculate [F^{T}F], as in the previous example. This would yield a product matrix with dimensions (feature x feature). Rather, we should calculate [FF^{T}] and apply LLR weightings to arrive at our desired correlation matrix. Hence, the equation for row-based similarity recommendations is as follows:

Putting all of this together, we can arrive at a generalized recommender equation, indicative of the simultaneous use of user-based correlation and cross-correlation matrices, as well as row-based similarity correlators. This general equation allows us to generate personalized recommendations for users at runtime, and also addresses the cold start problem, falling back on row-based similarities when little or no user history is available.

And there you go. That one, elegant equation holds all of the magic and nuance of collaborative filtering as applied to cooccurrence recommenders. In this author’s humble opinion, it’s a thing of beauty. Something to be marveled at and contemplated. But it’s also extremely useful. So, go forth! Build recommenders and glean actionable insights from your data sets! Put more relevant products in front of customers and watch conversion soar! HomeAway has certainly seen the power of applying these theories and techniques to our data, and is actively investing in them today.

##### References

- Dunning, Ted. “Surprise and Coincidence.” tdunning.blogspot.com. blogspot, 21 Mar. 2008. Web. 26 Jan. 2016. http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html.
- Ferrel, Pat. “Part 4: Tuning Your Recommender.” Occam’s Machete. WordPress, 12 Oct. 2013. Web. 26 Jan. 2016. http://occamsmachete.com/ml/2013/10/12/part-4-tuning-your-recommender/.
- Ferrel, Pat. “A Demo Is Worth a Thousand Words.” Occam’s Machete. WordPress, 17 Oct. 2015. Web. 26 Jan. 2016. http://occamsmachete.com/actionml/2015/10/17/a-demo-is-worth-a-thousand-words/.