|
Work Related to Topics in Internet & Web Modelling
SVD
Naren's Homepage
Description of course project
Recommender Systems Course Project
The goal of the course project is to develop a public domain suite of
software and test codes for recommender systems research. The software
should allow users to input datasets, apply various algorithms on the
datasets, and obtain important indicators of dataset characteristics
and/or algorithm performance. Important considerations here are: the codes
should be able to handle massive datasets (e.g., millions of entries)
and utilize suitable data structures to speedup computations. Ability to
do parallel computing would be a plus.
At a basic level, the software should allow the experimentation with
algorithms of four types: (i) linear algebra, (ii) graph theory, (iii)
probabilistic methods, (iv) order statistics. Below, we identify what
capabilities each of these modules must support. These modules must be
plug-and-playable, output from one should be usable as input to another
(as appropriate).
Linear algebra: This module must support the view of recommendation datasets
in terms of vectors [1] - either people represented as vectors in object space,
or vice versa, or both. It must allow similarities between vectors to be
captured in various ways - e.g., cosine, inner product, Jaccards, Dice,
pseudo-cosine [6]. It must allow new similarities to be definable by the user,
depending on their domain constraints. Keep in mind that similarities can be
in either object space or people space. Given a vector, the module must allow
the ranking of vectors that are most similar to these vectors, according to
a chosen similarity metric. The module must allow the use of matrix
decompositions (e.g., SVD [2] - you can use a public domain SVD package here)
to create rank-r approximations to given matrices and compute similarities in
the reduced basis. It must be able to identify the representation for a given
person/object in the new basis [3]. Another matrix decomposition that must
be supported is the QR decomposition [1]. The user should also be able to create
new basis representations [4] so that they can study similarity in some new way.
Suitable pre-processing routines must be built in, e.g., re-centering user's
ratings so that everybody has the same mean, re-weighting neighbors (see [7]).
Graph theory: This module must support the view of recommendation datasets
in terms of graphs. The original dataset must be interpretable as a bipartite
graph (of people and objects) and we must also be able to apply a similarity
metric (see "Linear algebra" module) and interpret the results of the similarity
metric as a social network graph. Alternatively, if we are using an
"item-based" algorithm [8], we must be able to interpret the results of the
similarity metric as a graph over objects. Hence, we obtain either a
bipartite graph or a unimodal graph. The module must allow the calculation of
various indicators on this graph - e.g., shortest path length, clustering
coefficient [9] (including multiple meanings of these terms). It must allow
the fitting of power-laws to the graphs and estimate the parameters
accurately [10]. Suitable templates and useful member functions must be
provided so that users can define their own graph metrics. Particular
attention must be paid to metrics that are of the `diffusive computation'
nature, i.e., where we iteratively re-compute numbers over the graph nodes
based on graph connectivity (e.g., as done in Google's PageRank and
the HITS algorithm). See [11] for ideas on how to structure graph theoretic
code and [12] for sugestions on a programming model.
Probabilistic methods: This module must support the generation of new
datasets that confirm to a given probabilistic model and also be able to
estimate probabilities from a given dataset, for modeling. It must support
three classes of models: (i) simple conditional independence models [5]
(e.g., Naive Bayes), (ii) EM algorithms (for simple model-based
collaborative filtering [13]), and (iii) latent variable models [14].
Here, you can make use of other public domain software, such as Kevin
Murphy's BayesNet toolbox.
Order statistics: This module must support the application of
order-theoretic notions, such as ranking ratings into a list, capturing
similarities between people via comparing lists [16] (e.g., Kendall's tau
or Spearman's rho coefficient). It must support the application of
boosting algorithms [17] to the task of combining preferences and arriving
at better recommendations.
The score for the project will be broken up into (i) how well your module
works, (ii) how well your module fits into others. Extra credit if
somebody pays attention to a language/scripting languages for "stringing"
together all these modules, for realizing concrete scenarios.
References:
1. M.W. Berry, Z. Drmac, and E.R. Jessup, Matrices, Vector Spaces, and
Information Retrieval, SIAM Review, Vol. 41, No. 2, pages 335-362, 1999.
2. M.W. Berry, S.T. Dumais, and G.W. O'Brien, Using Linear Algebra for
Intelligent Information Retrieval, SIAM Review, Vol. 37, No 4, pages
573-595, 1995.
3. A. Booker et al., Visualizing Text Datasets, IEEE Computing in Science and
Engineering, Vol. 1, No. 4, pages 26-34, July/August 1999.
4. F. Jiang and M.L. Littman, Approximate Dimension Equalization in
Vector-Based Information Retrieval, In Proceedings of the Seventeenth
International Conference on Machine Learning, 2000.
5. B. Kitts, D. Freed, and M. Vrieze, Cross-Sell: A Fast Promotion-Tunable
Customer-Item Recommendation Method Based on Conditional Independent
Probabilities, Proceedings of the Sixth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 437-446, ACM Press, Aug 2000.
6. W.P. Jones and G.W. Furnas, Pictures of relevance: a geometric analysis of
similarity measures, Journal of the American Society for Information
Science, Vol. 38, No. 6, pages 420-442, Nov 1987.
7. J.L Herlocker, J.A. Konstan, A. Borchers, and J. Riedl, An Algorithmic
Framework for Performing Collaborative Filtering, Proceedings of the 22nd
annual international ACM SIGIR conference on Research and development in
information retrieval, pages 230-237, 1999.
8. B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, Item-based collaborative
filtering recommendation algorithms, Proceedings of the Tenth international
conference on World Wide Web, p.285-295, May 01-05, 2001.
9. M.E. J. Newman, The structure and function of complex networks, SIAM Review
Vol. 45, pages 167-256, 2003.
10. L. Adamic, Zipf, Power-law, and Pareto: A Ranking Tutorial, available at:
http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html
11. G. Jeh, J. Widom: Mining the space of graph properties. KDD 2004:
187-196.
12. J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on
Large Clusters, OSDI 2004, http://labs.google.com/papers/mapreduce.html
13. W.S. Lee, Collaborative Learning for Recommender Systems, Proc. 18th
International Conference on Machine Learning, 2001.
14. T. Hofmann and J. Puzicha, Latent Semantic Models for Collaborative
Filtering, ACM Transactions on Information Systems, Vol. 22, No. 1,
pages 89-115, Jan 2004.
16. R. Fagin, R. Kumar, and D. Sivakumar, Comparing top k lists, SIAM
Journal on Discrete Mathematics, Vol. 17, No. 1, pages 134-160, 2003.
17. Y. Freund, R. Iyer, R.E. Schapire, Y. Singer, An Efficient Boosting
Algorithm for Combining Preferences, Journal of Machine Learning Research,
Vol. 4, pages 933-969, Nov 2003.
Back To Course Page
|