Narendra U P's HOMEPAGE -- WELCOME TO MY WORLD -- EXPLORE THE UNKNOWN

Work Related to Topics in Internet & Web Modelling

SVD
Naren's Homepage
Description of course project

Recommender Systems Course Project

The goal of the course project is to develop a public domain suite of software and test codes for recommender systems research. The software should allow users to input datasets, apply various algorithms on the datasets, and obtain important indicators of dataset characteristics and/or algorithm performance. Important considerations here are: the codes should be able to handle massive datasets (e.g., millions of entries) and utilize suitable data structures to speedup computations. Ability to do parallel computing would be a plus.

At a basic level, the software should allow the experimentation with algorithms of four types: (i) linear algebra, (ii) graph theory, (iii) probabilistic methods, (iv) order statistics. Below, we identify what capabilities each of these modules must support. These modules must be plug-and-playable, output from one should be usable as input to another (as appropriate).

Linear algebra: This module must support the view of recommendation datasets in terms of vectors [1] - either people represented as vectors in object space, or vice versa, or both. It must allow similarities between vectors to be captured in various ways - e.g., cosine, inner product, Jaccards, Dice, pseudo-cosine [6]. It must allow new similarities to be definable by the user, depending on their domain constraints. Keep in mind that similarities can be in either object space or people space. Given a vector, the module must allow the ranking of vectors that are most similar to these vectors, according to a chosen similarity metric. The module must allow the use of matrix decompositions (e.g., SVD [2] - you can use a public domain SVD package here) to create rank-r approximations to given matrices and compute similarities in the reduced basis. It must be able to identify the representation for a given person/object in the new basis [3]. Another matrix decomposition that must be supported is the QR decomposition [1]. The user should also be able to create new basis representations [4] so that they can study similarity in some new way. Suitable pre-processing routines must be built in, e.g., re-centering user's ratings so that everybody has the same mean, re-weighting neighbors (see [7]).

Graph theory: This module must support the view of recommendation datasets in terms of graphs. The original dataset must be interpretable as a bipartite graph (of people and objects) and we must also be able to apply a similarity metric (see "Linear algebra" module) and interpret the results of the similarity metric as a social network graph. Alternatively, if we are using an "item-based" algorithm [8], we must be able to interpret the results of the similarity metric as a graph over objects. Hence, we obtain either a bipartite graph or a unimodal graph. The module must allow the calculation of various indicators on this graph - e.g., shortest path length, clustering coefficient [9] (including multiple meanings of these terms). It must allow the fitting of power-laws to the graphs and estimate the parameters accurately [10]. Suitable templates and useful member functions must be provided so that users can define their own graph metrics. Particular attention must be paid to metrics that are of the `diffusive computation' nature, i.e., where we iteratively re-compute numbers over the graph nodes based on graph connectivity (e.g., as done in Google's PageRank and the HITS algorithm). See [11] for ideas on how to structure graph theoretic code and [12] for sugestions on a programming model.

Probabilistic methods: This module must support the generation of new datasets that confirm to a given probabilistic model and also be able to estimate probabilities from a given dataset, for modeling. It must support three classes of models: (i) simple conditional independence models [5] (e.g., Naive Bayes), (ii) EM algorithms (for simple model-based collaborative filtering [13]), and (iii) latent variable models [14]. Here, you can make use of other public domain software, such as Kevin Murphy's BayesNet toolbox.

Order statistics: This module must support the application of order-theoretic notions, such as ranking ratings into a list, capturing similarities between people via comparing lists [16] (e.g., Kendall's tau or Spearman's rho coefficient). It must support the application of boosting algorithms [17] to the task of combining preferences and arriving at better recommendations.

The score for the project will be broken up into (i) how well your module works, (ii) how well your module fits into others. Extra credit if somebody pays attention to a language/scripting languages for "stringing" together all these modules, for realizing concrete scenarios.

References:
1. M.W. Berry, Z. Drmac, and E.R. Jessup, Matrices, Vector Spaces, and Information Retrieval, SIAM Review, Vol. 41, No. 2, pages 335-362, 1999.
2. M.W. Berry, S.T. Dumais, and G.W. O'Brien, Using Linear Algebra for Intelligent Information Retrieval, SIAM Review, Vol. 37, No 4, pages 573-595, 1995.
3. A. Booker et al., Visualizing Text Datasets, IEEE Computing in Science and Engineering, Vol. 1, No. 4, pages 26-34, July/August 1999.
4. F. Jiang and M.L. Littman, Approximate Dimension Equalization in Vector-Based Information Retrieval, In Proceedings of the Seventeenth International Conference on Machine Learning, 2000.
5. B. Kitts, D. Freed, and M. Vrieze, Cross-Sell: A Fast Promotion-Tunable Customer-Item Recommendation Method Based on Conditional Independent Probabilities, Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 437-446, ACM Press, Aug 2000.
6. W.P. Jones and G.W. Furnas, Pictures of relevance: a geometric analysis of similarity measures, Journal of the American Society for Information Science, Vol. 38, No. 6, pages 420-442, Nov 1987.
7. J.L Herlocker, J.A. Konstan, A. Borchers, and J. Riedl, An Algorithmic Framework for Performing Collaborative Filtering, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 230-237, 1999.
8. B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, Item-based collaborative filtering recommendation algorithms, Proceedings of the Tenth international conference on World Wide Web, p.285-295, May 01-05, 2001.
9. M.E. J. Newman, The structure and function of complex networks, SIAM Review Vol. 45, pages 167-256, 2003.
10. L. Adamic, Zipf, Power-law, and Pareto: A Ranking Tutorial, available at: http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html
11. G. Jeh, J. Widom: Mining the space of graph properties. KDD 2004: 187-196.
12. J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004, http://labs.google.com/papers/mapreduce.html
13. W.S. Lee, Collaborative Learning for Recommender Systems, Proc. 18th International Conference on Machine Learning, 2001.
14. T. Hofmann and J. Puzicha, Latent Semantic Models for Collaborative Filtering, ACM Transactions on Information Systems, Vol. 22, No. 1, pages 89-115, Jan 2004.
16. R. Fagin, R. Kumar, and D. Sivakumar, Comparing top k lists, SIAM Journal on Discrete Mathematics, Vol. 17, No. 1, pages 134-160, 2003.
17. Y. Freund, R. Iyer, R.E. Schapire, Y. Singer, An Efficient Boosting Algorithm for Combining Preferences, Journal of Machine Learning Research, Vol. 4, pages 933-969, Nov 2003.
Back To Course Page

Yesterday is History; Tomorrow is a Mystery; Today is a GIFT.