M41 Highway

Data science and software engineering blog


Leave a comment

A movie recommender based on the similarity of text contents

A recommendation function is gaining popular in many websites. It is useful to increase the traffic of the websites. There are many different implementation of a recommendation engine, from item-based, user-based, item-user-based, content-based collaboration filtering, to naive based algorithms. One of the core concept in collaboration filtering is the measurement of similarity between entities. The degree of similarity is usually implemented in mathematical way to measure the distance between two vectors. Here I will show you the basic idea (It is also the core library implemented is Lucene) of using the cosine algorithm to find out the most similar movie from a pool. 

In the demonstration, the movie contains attributes such as title, description, category (genre), director, actor, etc (refer to the data). First of all, we have to extract the important information that represent the specific movie most. We will break down the whole passage/ short sentences into into keyword by a way of tokenization (with the help of IKAnalyzer in this case because of Chinese language, refer to tokenizer) so that we can analyze and get the most important information from the entity .Image

TF/IDF is a standard way to calculate the importance of each keyword by counting the frequency of keywords while normalized through the whole data set, for example, movie “A” has keyword “children”, “zoo”, “elephant” and movie “B” has keywords “comedy”, “children”, “zoo” . The comparison of every two movies will be proceed one by one. All the unique keywords from the two movies will form a array, i.e. [“comedy”, “children”, “zoo”, “party”, “elephant”]. And we initialize a vector [0, 0, 0, 0, 0] for “A” and “B” movie respective to the array. We check the keyword of movie “A” against the array, if the keyword appear on the array, we assign value 1 in the corresponding positing in the vector. So we do it for movie “B” and we have [0, 1, 1, 0, 1] for movie “A” and [1, 1, 1, 0, 0] for movie “B”. The similarity between “A” and “B” is equivalent to calculate the “distance” between the vector [0, 1, 1, 0, 1] and [1, 1, 1, 0, 0] in a five-dimension space, which is (dot product of [0, 1, 1, 0, 1] and [1, 1, 1, 0, 0] / scalar product of Pythagorean distance of [0, 1, 1, 0, 1] and [1, 1, 1, 0, 0] ). 

Image

0 + 1+ 1+ 0 + 0 / 3 + 3 = 0.66667

The value lies between 0 (for different) to 1 (similar).

Image

You may get source to explore the details. Here’s some images to illustrate the process. The first image shows the words tokenized.

Image

And this image shows score of the TF/IDF process. 

Image

The last image shows the result that similarity of the movie ‘Lord of the Rings” to other movies in the pool.

Image

The code discussed above is mainly for the demonstration purpose for the development team. There are a few shortcomings if it is directly applied in a production system. The similarity calculation is O(n^2) which does not scale if you have 100000 movies for example, so you should consider a parallel processing way like Hadoop. Second, the vector calculation is implemented in a for-loop which can be optimized by using a vector specialized library, or implemented in efficient library say Python NumPy. Third, we have selected a few attributes (dimension) in the demonstration. In reality, there is no limitation of the number of attributes used in the calculation, but the high dimension will suffered from the curse of dimensionality. Continue reading