Lecture 11: Page Rank

Previously, we have only had one metric for whether a document dd is a good fit for a query qq, that was sim(q,d)sim(q,d). However in cases of huge document corpuses, there may be too may values of dd for which sim(q,d)sim(q,d) is large.

One assumption that must be broken to tackle this problem is that all documents are equal, they are not. Wikipedia is more likely to answer your query than one's personal web page.

Note: Page Rank is named after Google's Larry Page not web page

We assign a prior probability P(d)P(d) to each document dd in our corpus. We can think of P(d)P(d) as the probability that dd is relevant to qq before the user creates query qq.
Whether dd is returned in response to query qq depends on sim(q,d)sim(q,d) and P(d)P(d). We treat P(d)P(d) as the Page Rank of dd

Cases where document retrieval is based solely on sim(q,d)sim(q,d) we effectively assume P(d)P(d) to be the same d\forall d. This case is referred to as equal priors

Constructing Priors

We can change our relevance assumption to be that the relevance of any document ot a query is related to how often that document is accessed.

Such as comparing academic papers by their citation index. This is an example of a self-evaluating group whereby each member evaluates all other group members.

Definition of a Markov Model

A NN-state Markov Model consists of:

P0=[P0(1)P0(2)P0(N)]P_0 = \begin{bmatrix} P_0(1) \\ P_0(2) \\ \vdots \\ P_0(N) \end{bmatrix}

where P0(n)P_0(n) is the probability that the model starts in state nn and n=1NP0(n)=1\sum_{n=1}^{N}P_0(n) =1

A=[a11a12a1ja1Nai1ai2aijaiNaN1aN2aNjaNN]A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1j} & \cdots & a_{1N} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ a_{i1} & a_{i2} & \cdots & a_{ij} & \cdots & a_{iN} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ a_{N1} & a_{N2} & \cdots & a_{Nj} & \cdots & a_{NN} \end{bmatrix}

where

n=1Nain=1,i=1,,N\sum_{n=1}^{N} a_{in} = 1, \forall i =1,\ldots,N

From the initial state probability distribution and the state transition matrix, you can draw a Transition Diagram such as:

State probability distribution at time tt

The state probability distribution at time t=0t=0 is P0P_0, similarly, at time t=1t=1 the PD is P1P_1.

What happens at time PtP_t in general and what happens as tt\rightarrow \infin

In order to calculate th state probability distribution at time 1, you must calculate and sum the probability of being at each state given a previous (or starting) state. The starting state could be any state. Diagrammatically this can be seen:

In matrix form this can be written:

P1=ATP0P_1 = A^T P_0

and more generally,

Pt=ATPt1=ATATPt2==(AT)tP0P_t = A^TP_{t-1} = A^TA^TP_{t-2} = \cdots = (A^T)^tP_0

Suppose that PtPP_t \rightarrow P as tt \rightarrow \infin or the probability distribution plateaus over time, then:

P=ATPP = A^TP

or, PP is an eigenvector of ATA^T with an eigenvalue of 1

Analysis

Interestingly, we can encode an 'exit' state into our initial state probability distribution, by definining an initial spd as :

P0=[0.60.20.20]P_0 = \begin{bmatrix} 0.6 \\ 0.2 \\ 0.2 \\0 \end{bmatrix}

with a corresponding transition matrix. Our state probability distribution will converge to:

Pt[0001]P_{t\rightarrow\infin} \begin{bmatrix} 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}

Convergence

Do our state probability distributions always converge as tt \rightarrow \infin? not

Conditions for convergence of Markov Processes

Note not expected to be known

  1. The model must be irreducible, i.e. for every pair of states s0,s1s_0, s_1 there must be a time tt and a state sequence x1,x2,,xtx_1,x_2,\ldots,x_t with s0=x1,s1=xts_0 = x_1, s_1 = x_t and P(x1,x2,,xt)>0P(x_1,x_2,\ldots,x_t) > 0. Essentially, it is possible to get from state s0s_0 to s1s_1 via the state sequence x1,x2,,xtx_1,x_2,\ldots,x_t with a non-zero probability.
  2. The model must be aperiodic, a state is aperiodic if the HCF of the set of return times for the state must be 1. A model is aperiodic if all of its states are aperiodic.

Simplified Page Rank

Given a set of documents D={d1,,dN}D = \{ d_1,\ldots,d_N\} we define:

With this the simplified Page rank xnx_n for document dnd_n is given by:

xn=dmpa(n)xmhmx_n = \sum_{d_m\in pa(n)} \frac{x_m}{h_m}

As this equation is recursive we instead write:

xni+1=dmpa(n)xm(i)hmx_n^{i+1} = \sum_{d_m \in pa(n)}\frac{x_m{(i)}}{h_m}

We can formulate this in matrix form:

Let W\bf W be the N×NN \times N matrix whose (m,n)th(m,n)^{th} entry is given by:

wm,n={1hnif there is a hyperlink from xn to xm0otherwisew_{m,n} = \begin{cases} \frac{1}{h_n} & \text{if there is a hyperlink from } x_n \text{ to } x_m \\ 0 & \text{otherwise} \end{cases}

Let xix_i be the N×1N\times 1 column vector whose nthn^{th} entry is xinx_{i_n} then:

xi+1=Wxix_{i+1} = \textbf{W}x_i

We can therefore define:
Wxi=[1h11h201hn1hN01h21h31hn1hN1h11h21h31hn1hN][x1ix2ix3ixNi]\textbf{W}\vec{x_{i}} = \begin{bmatrix} \frac{1}{h_1} & \frac{1}{h_2} & 0 & \dots & \frac{1}{h_n} & \dots & \frac{1}{h_N} \\ 0 & \frac{1}{h_2} & \frac{1}{h_3} & \dots & \frac{1}{h_n} & \dots & \frac{1}{h_N} \\ \vdots &\vdots &\vdots &\vdots &\vdots &\vdots &\vdots \\ \frac{1}{h_1} & \frac{1}{h_2} & \frac{1}{h_3} & \dots & \frac{1}{h_n} & \dots & \frac{1}{h_N} \end{bmatrix}\cdot \begin{bmatrix} x_{1_i} \\ x_{2_i} \\ x_{3_i} \\ \vdots \\ x_{N_i} \\ \end{bmatrix}

Simplified Page Rank: Markov model interpretation

We can implement Simplified Page Rank using a Markov model where:

P0=x0P_0 = x_0
A=WTA = \textbf{W}^T

I.e. P0P_0 is the initial estimate of Page Rank, W\bf W is the transpose of the state transition probability matrix A\bf A

A=wT\bf A = w^T means that Pt=WPt1P_t = \textbf{W}P_{t-1}

"Damping Factor"

To this point all authority xnx_n of a page dnd_n comes from the pages that have hyperlinks to it. I.e. a page with no hyperlinks to it will have an authority of 0. This breaks with the first requirement for Markov convergence, irreducibility.
To solve this problem, we introduce a damping factor dR,0<d<1d \in \R, 0 < d < 1. Where dd os the proportion of authority that a page gets by default.

Our simplified Page rank equation now becomes:

xi+1=(1d)Wxi+(d)N1Nx_{i+1} = (1-d)\textbf{W}x_i + \frac{(d)}{N}1_N

Where 1N=[111]1_N = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} is a N×1N\times 1 column vector of 1s

Convergence

x=(1d)Wx+dN1Nx = (1-d)\textbf{W}x + \frac{d}{N}1_N

is a system of NN equations with NN unknowns. We can re-write this as a dynamic system:

xt+1=(1d)Wxt+dN1Nx_{t+1} = (1-d)\textbf{W}x_t + \frac{d}{N}1_N

When considered in this way we can see that it converges for any initial condition x0x_0 to a unique fixed point xx^* s.t.

x=(1d)Wx+dN1Nx^* = (1-d)\textbf{W}x^* + \frac{d}{N}1_N

Dangling Pages

A page which contains no hyperlinks to other pages is referred to as a dangling page as in a tree-based representation they form the leaf nodes.
If dnd_n is a dangling page the the nthn^{th} column of W\bf W consists of entirely 0s. \therefore the sum of the nthn^{th} column sums to 0 and W\bf W is no longer a column stochastic matrix. Cases such as these can break some of our earlier analysis.

One solution to dangling pages is to introduce a dummy page dN+1d_{N+1} and add a link to dN+1d_{N+1} to every previously dangling page.
We then extend our transition matrix to get a new (N+1)×(N+1)(N+1) \times (N+1) matrix W\bf W and create a new dangling page indicator r={r1,r2,,rN}\vec{r} = \{r_1,r_2,\ldots,r_N\} where:

rn={1if dn is a dangling page0otherwiser_n = \begin{cases} 1 & \text{if } d_n \text{ is a dangling page} \\ 0 & \text{otherwise} \end{cases}

We add r\vec{r} as the bottom row of W\bf W and add an additional column of NN 0s followed by a single 1 so that our new transition matrix Wˉ\bf \bar{W} becomes:

Wˉ=[1h11h2001hN001h21h301hN01h101h301hN001h11h21h301hN00001001]\mathbf{\bar{W}} = \begin{bmatrix} \frac{1}{h_1} & \frac{1}{h_2} & 0 & \dots & 0 & \dots & \frac{1}{h_N} & 0 \\ 0 & \frac{1}{h_2} & \frac{1}{h_3} & \dots & 0 & \dots & \frac{1}{h_N} & 0 \\ \frac{1}{h_1} & 0 & \frac{1}{h_3} & \dots & 0 & \dots & \frac{1}{h_N} & 0 \\ \vdots &\vdots &\vdots &\vdots &\vdots &\vdots & \vdots & 0 \\ \frac{1}{h_1} & \frac{1}{h_2} & \frac{1}{h_3} & \dots & 0 & \dots & \frac{1}{h_N} & 0 \\ 0 & 0 & 0 & \dots & 1 & 0 & 0 & 1 \end{bmatrix}

Here you can see that each dangling page has a hyperlink to dN+1d_{N+1} and dN+1d_{N+1} has only a link to itself.

An alternative solution is to create links from dangling pages to all pages. Here we construct a N×NN\times N matrix V\bf V with NN equal rows rN\frac{\vec{r}}{N}

Finally we construct an N×NN\times N matrix of (W+V)\bf (W+V) of the form:

V+W=[1h11h201N1hN01h21h31N1hN1h101h31N1hN1h11h21h31N1hN]\mathbf{V + W} = \begin{bmatrix} \frac{1}{h_1} & \frac{1}{h_2} & 0 & \dots & \frac{1}{N} & \dots & \frac{1}{h_N} \\ 0 & \frac{1}{h_2} & \frac{1}{h_3} & \dots & \frac{1}{N} & \dots & \frac{1}{h_N} \\ \frac{1}{h_1} & 0 & \frac{1}{h_3} & \dots & \frac{1}{N} & \dots & \frac{1}{h_N} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \\ \frac{1}{h_1} & \frac{1}{h_2} & \frac{1}{h_3} & \dots & \frac{1}{N} & \dots & \frac{1}{h_N} \end{bmatrix}

With this new transition matrix our page rank equation becomes:

x=(1d)(W+V)x+dN1Nx = (1-d)(\mathbf{W+V})x+ \frac{d}{N}1_N

Probabilistic Interpretation of Page Rank

We can essentially think of each page as a state in a Markov model or a node in a graph. Each note or state is connected via a hyperlink structure. Each Connection between a node pp and qq is weighted by the probability of its usage. These weights depend only on the current node, not on the path to it; this is a property of Markov decision chains.