PageRank myths

The genesis of this post was an online conversation I had with Giles at Realwire. It inspired me to pull together some thoughts on myths or misconceptions that exist amongst marketers about the concept of PageRank. For those of you who might be reading this and not be fully au fait with with PageRank I will first provide a whirlwind primer and will recommend some more in-depth reading material at the end of the post.

PageRank is a mathematical way at analysing the links between websites. It was developed at Stanford University by Sergei Brin and Larry Page. It built on work done as far back as the 1950s at the University of Pennsylvania and the HITS algorithim developed by Jon Kleinberg as part of his work at IBM’s Almaden Research Centre on the CLEVER project. HITS is similar in nature to the foundation algorithim of the critically well regarded Teoma search engine purchased by Ask in 2001.

In Google’s own words:

PageRank Technology: PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that we believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results.

PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. We have always taken a pragmatic approach to help improve search quality and create useful products, and our technology uses the collective intelligence of the web to determine a page’s importance.

Page and Brin described it in their famous academic paper:

PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the “random surfer” will get bored and request another random page. One important variation is to only add the damping factor d to a single page, or a group of pages. This allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking. We have several other extensions to PageRank, again see [Page 98].

Another intuitive justification is that a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo’s homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web.

The PageRank algorithim (along with similar algorithms used by other search engines) represented what my former colleague Eckhart Walther called the ‘tyranny of the minority‘ where only webmasters got to ‘vote’ through outbound links to sites about what was authorative or influential on the web. In addition, not all votes were created equal as ‘votes’ from recognised high authority sites like the BBC or academic institutions would count for more than a link from this blog.

Myths

  • PageRank is the measure of my page – No its a measure of the site as a whole that the page sits on. The ‘Page’ in PageRank is actually Larry Page
  • PageRank is life-and-death – No it isn’t. Google claims to use some 200 ‘signals’ including ‘PageRank algorithm’. A couple of things here. The 0-to-10 value that consumers see in Google toolbar is an approximation, in reality it is closer to a logarithmic approximation of the value likely to be churned out by the original PageRank algorithim. An integer from 0 – 10 just wouldn’t provide you with the granularity needed to separate the millions of possible results to a search query. Google PageRank in the toolbar changes about four times a year, in comparison the secret source of the search engine probably changes 100s of times a year with a major change every six weeks or so (what we used to call a ‘weather report’ at Yahoo!).  Secondly, even Google themselves tell website owners not to focus on it as a measure and have taken it down from their Webmaster tools to try and prevent people obsessing needlessly about it

PageRank is useful as a rough measure, particuarly if a site is being assessed for influence as part of a basket of measures, but it is not the be all and end all.

Further reading