PageRank Module

                                                                                         

Introduction:


PageRank is a global ranking system that tries to capture the importance of  pages in the world wide Web.

The Web is very large, with great variation in the amount, quality and the type of information in Web pages.
Also many Web pages are not sufficiently self-descriptive. Hence traditional information retrieval (IR) techniques are not effective enough in capturing importance of a page.

The link structure of the Web contains important implied information, and can help in filtering or ranking Web pages. In particular, a link from page A to page B can be considered a recommendation of page B by the author of page A. Many algorithms have been proposed to exploit the link structure, like HITS (Hyper Text Induced Topic Search) by Kleinberg[1999], PageRank by Page et al [1998] and Citation Ranking.

Background:


PageRank ranks a page taking into consideration the number of pages that point to a given page and the importance of that  pages.

Let the pages on the Web be denoted by 1, 2, . . . , m. Let N(i) denote the number of forward (outgoing) links from page i. Let B(i) denote the set of pages that point to page i. For now, assume that the Web pages form a strongly connected graph (every page can be reached from any other page). We will relax this assumption shortly. The simple PageRank of page i, denoted by r(i), is given by:

pagerank equation
In linear algebra, this can be written as r = ATr , where r is the m x 1 vector [r(1), r(2), . . . , r(m)], and the elements ai, j  of the matrix A are given by, aij = 1 /N(i) , if page i points to page j, and ai, j = 0 otherwise. Thus the PageRank vector r is the eigenvector of matrix AT corresponding to the eigenvalue 1. Since the graph is strongly connected, it can be shown that 1 is an eigenvalue of AT, and the eigenvector r is uniquely defined.


Design issues of PageRank:


1.Choosing a method:

One of the methods for computing the principal eigenvector of a matrix is called Power Iteration. In power iteration, an arbitrary initial vector is multiplied repeatedly with the given matrix, until it converges to the principal eigenvector. Power iteration is guaranteed to converge only if the graph is aperiodic. (A strongly connected directed graph is aperiodic if the greatest common divisor of the lengths of all closed walks is 1.) In practice, the Web is always aperiodic.

2.Problems with PageRank Calculation:

Any strongly (internally) connected cluster of pages within the Web graph from which no links point outwards forms a Rank Sink. An individual page that does not have any outlinks constitutes a Rank Leak. In case of a rank sink, nodes not in a sink receive a zero rank. On the other hand, any rank reaching a rank leak is lost forever.

2a.Eliminating Rank Sink:

We plan to introduce a decay factor d (0 < d < 1) in the PageRank definition. In this modified definition, only a fraction d of the rank of a page is distributed among the nodes that it points to. The remaining rank is distributed equally among all the pages on the Web. Thus, the modified PageRank is

modified page rank

 where m is the total number of nodes in the graph.

2b.Eliminating Rank Leak:

We are assuming that leak nodes have links back to all the pages that point to them. This way, leak nodes that are reachable via high-rank pages will have a higher rank than leak nodes reachable through unimportant pages.

3.Convergence of algorithm:

In order for the Power Iteration to be practical, it is not only necessary that it converge to the PageRank, but that it does so in a few iterations. The convergence of the Power Iteration for a matrix depends on the eigenvalue gap, which is defined as the difference between the modulus of the two largest eigenvalues of the given matrix. Page et al. [1998] claim that this is indeed the case, and that the Power Iteration converges reasonably fast (in around 100 iterations).

Arasu et al. [2002] made a quantitative study of various iterative techniques and found out that Gauss Seidel
method converges faster than Power iteration and Jacobian method. We plan to implement this method also to verify their claim.

4.Reducing iterations in algorithm:

Since we are only interested in the relative ordering of the pages induced by the PageRank than the actual PageRank values themselves, we plan to terminate the power iteration once the ordering of the pages becomes reasonably stable.

5.Robustness against spamming:

Since PageRank is a global ranking algorithm it is relatively robust to spamming. Also we plan to put a limit on the number of links that point to a given page within the same domain to reduce spamming.

6.Improving Perfromance:

We will be implementing the code in perl. We also plan to profile the program so as to identify snippets of code that take more time. The performance of these can be improved either by changing the datastructures or by 'caching'. In perl we have a module 'Memoize' that can be used to cache the return value of a function. If the function is called again in a short span of time without change in arguments, then the cached value is returned.

Data Structures Needed:


Inputs Required from the Storage Module:

For calculating pagerank, we must have information about each link between two pages . This information will be provided by stroage group in the following form:

Page-id     No_Links     Pageids

        (source)                      (destinations)

For Example a typical entry is

                                                            N                 5              BCDEF

where N,B,C,D,E and F are 8 byte pageids.



Data Structures and Functions Used:

Node of a Graph:
sub new {
  my $this = { 
           vertices => {},
           in_edges => {},
           rank    => {}
 	     };
  bless $this, "graph";
 }

    Backlinks are stored per page using the page identifier as a hash key and we insert them as elements in an anonymous array. This process is done in one pass, while scanning the pageids itself. Along with this we are also storing the number of outlinks per page using the same hash for that page identifier. The pagerank of each page is also stored on a per element basis along with the other attributes.

Add_Edge:


This function is called as each edge is inserted into the graph.

 sub add_edge 

 {

    # Edge from x to y

    # If pageid x is seen for the first time,

# initialize its outdegree.
    # Add the pageid x to an array associated with the pageid y as its backlink
 }

In Edge:

This function returns the backlinks of a given page.


 sub in_edge
 {
     # x is the given pageid.

# if x has backlinks,

# return the address of the array containing the backlinks.

# else return 0;

}

PageRank Algorithm:

sub page_rank
{
     # for each pageid do the following:
     # get all the backlinks of that page

# calculate the contribution of all backlinks to the page rank

# of the current page.

# Their sum multiplied by damping factor '(1-d)' plus a threshold 'd'

# will determine the pagerank for this page.

# verify whether the newpageranks are close.. ie)within a threshold '€'

# of the previous pageranks.

# If so, exit..

# else repeat the Iteration
}

Performance Characteristics:

Let us consider the number of pages to be ranked as 2,000,000.

Assuming each document has around 30 links on an average, we require 8+4+8+8+30*8=268 bytes per page. For 2,000,000 pages it will be around 536MB. This will fit in memory in either of the machines, Pushya and Aaslesha which have 1GB of main memory each. We are planning to run this on either of these machines for maximum efficiency.


Future Work:

1. The Power Iteration needs to be parallelized to scale for ranking larger set of pages. We intend to do         this in future.

2. A variation of Power Iteration, the inverse Power iteration scales well when parallelized, compared to
    the original method. We also plan to implement this in the fututre.