Renée Ridgway – From Page Rank to RankBrain

One might ponder, is searching only about finding things one knows to search for, because one knows about the existence of such things? Take for instance, the recent referendum on June 23, 2017, when the UK voted to exit the EU[1]. With the 52-48 margin results, one could argue for the voice of the people who expressed what they really wanted and were a well-informed public going into the polls. Like many users who frequently employ search engines for information regarding businesses, medical advice or their own rankings, people used Google Search to find answers to their questions. However, the search terms ‘What does it mean to leave the EU?’ and ‘What is the EU?’ occurred after the polls were closed. It then became apparent that people were wondering what they actually had just voted for, if they had voted. These queries were measured by Google Trends, a so-called ‘public web facility’ of Google, Inc.[2] which is based on Google Search results and reflects how often a keyword, or search term, is entered in the search box from around the world.[3]

topquestionsontheeuropeanunion_googletrendsIn an era of ‘big data’ conclusions are often based on correlations but closer scrutiny at data for interpretation is desired. Included in the troller of big data are not only queries made by users but search results. “As these algorithms nestle into people’s daily lives and mundane information practices, users shape and rearticulate the algorithms they encounter; and algorithms impinge on how people seek information, how they perceive and think about the contours of knowledge, and how they understand themselves in and through public discourse”(Gillespie 183). This reciprocal relationship of human interaction with a machine was already mapped out by Introna and Nissenbaum in their seminal text: Shaping the Web: Why the politics of search engines matter. Written at the dawn of the development of ‘gateway platforms’ for the internet, one of their key statements concerns access, for “those with something to say and offer, as well as those wishing to hear and find”(Introna, L.D. & Nissenbaum 169-85). What has become clear is that corporations gather user data yet the filtering or ‘curation’ process is not transparent.[4]

Whereas early net programmers and users with their ‘bulletin board’ postings, chat rooms or networks in the 1990s envisioned a ‘digital democracy’, in the early 2000s the political discourse was already censored as it emerged. Matthew Hindman’s book, The Myth of Digital Democracy (2009) elucidates how political information is filtered through ‘Googlearchy’[5], and that ‘deliberative democracy’ has been prohibited by internet technologies and infrastructure itself, such as “the social, economic, political and even cognitive processes that enable it” (Hindman 130). Corporations have now become complicit in the censoring, blocking the plurality of discourses as they collate users’ data. Once Silicon-Valley companies and their ‘liberal’ approach took a defensive posture to state interference, nowadays they willingly hand over users’ data to secret services entities of various nations, becoming actors of what is presently called ‘surveillance capitalism’ (Zuboff).[6] Platforms such as Google intervene (Gillespie) with its ‘Adwords’ service,[7] by serving up ads that influence the user’s experience and detouring their path to information. It is this type of ‘curation’ that I will elucidate in the following essay by looking specifically at the search algorithms responsible for such filtering of knowledge and their potential consequences.

movetogibraltar_googletrendsThe rise of Page Rank

The concept of Page Rank has its basis in the Scientific Citation Index (SCI), a form of academic hierarchy that has now been grafted as a conceptual paradigm for the way we find information and how that information is prioritised for us, designed by a monopoly, a corporation called Google a.k.a. Alphabet. It is not surprising then that the present CEO, Larry Page and President, Sergey Brin of Alphabet were two academics at Stanford who drew upon the SCI by recognizing that hyperlinked structures of citations show how an article is valued by other authors. The eponymous Page Rank algorithm was developed in 1998 and is basically a popularity contest based on votes. A link coming from a node with a high rank has more value than a link coming from a node with low rank. The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.

“When Google developed PageRank, factoring in incoming links to a page as evidence of its value, it built in a different logic: a page with many incoming links, from high-quality sites, is seen as ‘ratified’ by other users, and is more likely to be relevant to this user as well” (Gillespie 178). The more important or worthwhile websites are likely to receive more links from other websites and this cycle is repeated because popular sites are linked to other popular sites. “The hyperlink as a key natively digital object is considered to be the fabric of the web and in this role has the capacity to create relations, constitute networks and organize and rank content” (Helmond). These connective hyperlinks are used for navigating the web and the ‘algorithmization of the hyperlink’ turns a navigational object into an analytical device that ‘automatically submits and retrieves data’(Helmond).[8]

Secret recipes

Presently, ‘keyword search’ is still the way Google Search organises the internet by crawling and indexing[9], which determines the importance of a website based on the words it contains, how often other sites link to it, and dozens of other measures. “The process by which an index is established, and the attributes that are tracked, make up a large part of the ‘secret recipes’ of the various search engines” (Halavais 18). With Google Search the emphasis is to keep the attention of the user and to have them click on the higher rankings, effortlessly. However as Gillespie points out, the exact workings are opaque and vary for diverse users, “the criteria and code of algorithms are generally obscured—but not equally or from everyone” (Gillespie 185). Based on users’ histories, location and search terms, the searcher is ‘personalised’ through a set of criteria.[10] Not only are the creators of content of web pages kept in check by search engines, but the tracking of different factors, or signals, determine the ranking of an individual page. Mostly through reverse engineering, a whole ‘Search Engine Optimisation’ (SEO) industry has developed around ‘gaming’ the algorithm to figure out its recipe or signals. These “search engine optimizers have identified their own set of signals that seem to affect search engines directly” (Fishkin & Pollard 2007 qtd. by Halavais 83).

periodic-table-of-seo-2015

Signals

During the past 18 years, Google has constantly tweaked their proprietary algorithm, containing around 200 ingredients or ‘signals’ in the recipe.[11] “Signals are typically factors that are tied to content, such as the words on a page, the links pointing at a page, whether a page is on a secure server and so on. They can also be tied to a user, such as where a searcher is located or their search and browsing history.”[12] Links, content, keyword density, words in bold, duplicate content, domain registration duration and outbound link quality are some other examples of factors, or ‘clues’. One of the major changes in 2010 to the core algorithm of Page Rank is the ‘Caffeine’ update, which enabled an improvement in the gathering of information or indexing, instead of just sorting. Described as a change to the indexing architecture, this new web ecosystem facilitates the searching of content immediately after it is crawled, providing a 50% fresher index. ‘Panda’ was an update that was implemented in 2011 that downranks sites, which are considered lower quality, enabling higher quality pages to rise. In April 2012 Google launched the ‘Penguin’ update that attempts to catch sites, which are ‘spamming’, e.g. buying or obtaining links through networks and boosting Google rankings. It now devalues spam instead of demoting (adjusting the rank) of the entire site and as of September 30, 2016, updates in real time as part of the core algorithm.[13]

Analogous to the components of engine that has had it parts replaced, where Penguin and Panda might be the oil filter and gas pump respectively, the launch of ‘Hummingbird’ in August 2013 was Google’s largest overhaul since 2001. With the introduction of a brand new engine the emphasis has shifted to the contextual — it’s less now about the keyword and more about the intention behind it — the semantic capabilities are what are at stake. Whereas previously certain keywords were the focus, at the moment it’s about the other words in the sentence and their meaning. The complexity level of the queries has gone up, resulting in an improvement of indexing web documents. Within this field of ‘semantic search’ the ‘relationality linking search queries and web documents’[14] is reflected with the ‘Knowledge Graph’[15], along with ‘conversational search’ that incorporates voice activated enquiries.

itsmerelypostmodernsemioticsappliedtosearchIf Hummingbird is the new Google engine from 2013, the latest replacement part is then ‘RankBrain’. Launched around early 2015 it ostensibly ‘interprets’ what people are searching for, even though they may have not entered the exact keywords. ‘RankBrain’ is rumoured to be the third most important signal, after links and content (words) and infers the use of a keyword by applying synonyms or stemming lists[16]. User’s queries have also changed and are now not only keywords but also multi-words, phrases and sentences that could be deemed ‘long-tail’ queries. These need to be translated to a certain respect, from ‘ambiguous to specific’ or ‘uncommon to common,’ in order to be processed and analysed.[17] This reciprocal adaptability between the users and interface has been verified by previous research. Therefore it is probable that Google assigns these complex queries to groups with similar interests in order to ‘collaboratively filter’ them.[18]

Bias

“A number of commentators (e.g. Wiggins 2003) have become concerned with the potential for bias in Google’s secret ranking algorithm” (Halavais 77).

Ironically the bias awareness started with the creators themselves. Upon reading Page and Brin’s seminal text (1998) one arrives at Appendix A: Advertising and Mixed Motives and discovers almost an afterthought about advertising and search engines. It’s incredibly revealing, because they state that search engines that are advertising-driven are “inherently biased towards the advertisers and away from the needs of the consumers” (Page and Brin). They cite The Media Monopoly by Ben Bagdikian[19], where the historical experience of the media shows that the concentration of ownership leads to imbalances. In turn, Alexander Halavais references both of these citations along with pointing out that their critique instead could have referenced the writings of Robert McChesney, who described how radio was not commercialised until the RCA(Radio Corporation of America) came along and the federal government changed regulation (Halavais 77). “McChesney suggested that in the 1990s the internet seemed to be following the same pattern, and although the nature of the technology might preclude its complete privatization, the dominance of profit-oriented enterprise could make the construction of an effective public sphere impossible” (Halavais 78). To return to radio, it only has a limited bandwidth of its broadcast spectrum. With the internet there is also limited bandwidth of the user and her ability to filter the overload of information. A certain power is assigned then to the commercial value of a search engine that delivers ‘relevant’ results, “or ‘better’ results than its provider’s competitors, which posits customer satisfaction over some notion of accuracy (van Couvering 2007 qtd. by Gillespie 182).

googlegarage

Machine learning

“Algorithms are not always neutral. They’re built by humans, and used by humans, and our biases rub off on the technology. Code can discriminate.”[20]

In this short essay I have attempted to debunk some of the mythology surrounding Google’s proprietary ‘Page Rank’ “— as the defining feature of the tool, as the central element that made Google stand out above its then competitors, and as a fundamentally democratic computational logic—even as the algorithm was being redesigned to take into account hundreds of other criteria” (Gillespie 180). I have briefly described some of the signals involved in how this algorithm ‘ranks’, based on hyperlinks and their algorithmatization that have become devices for the collation of data, which in turn is sold to third parties. “If broadcasters were providing not just content to audiences but also audiences to advertisers (Smythe 2001), digital providers are not just providing information to users but also users to their algorithms. And algorithms are made and remade in every instance of their use because every click, every query, changes the tool incrementally” (Gillespie 173). Online advertisements structure the workings, directing and ‘affecting’ the consumer, prosumer or user even if they do not click on them, as they are already personalised when using Google Search, notwithstanding if they are not signed into a Google account.

As of June 2016 ‘RankBrain’ is being implemented for every Google Search query and the SEO industry speculates it’s summarising the page’s content. The murmur is that the algorithm is adapting, or ‘learning’ as it were from people’s mistakes and its surroundings. According to Google the algorithm learns offline, being fed historical batched searches from which it makes predictions. This cycle is constantly repeated and if the predictions are correct, the latest versions of ‘RankBrain’ go live.[21] Previously there were not computers powerful or fast enough, or the data sets were too small to carry out this type of testing. Nowadays the computation is distributed over many machines, enabling the pace of the research to quicken. This progress in technology facilitates a constellation or coming together of different capabilities from various sources, through models and parameters. Eventually the subject, or learner, in this case the algorithm, is able to predict, through repetition. Where is the human curator in all of this? “There is a case to be made that the working logics of these algorithms not only shape user practices, but also lead users to internalize their norms and priorities” (Gillespie 187). The question then is to what extent is there human adaption to algorithms in this filtering or curation process, how much do algorithms affect human learning and whether not only discrimination, but also agency, can be contagious.[22]

 

Works Cited

Feuz, Martin; Fuller, Matthew; Stalder, Felix. “Personal Web Searching in the age of Semantic Capitalism: Diagnosing the Mechanics of Personalisation”. First Monday, peer-reviewed journal on the internet. Volume 16, Number 2-7, February 2011. Web. http://firstmonday.org/article/view/3344/2766

Fishkin, R. and J. Pollard. April 2, 2007. “Search Engine Ranking Factors Version 2.” SEOMoz.org. Web. http://www.seomoz.org/article/search-ranking-factors.

Gesenhues, Amy. “Google’s Hummingbird Takes Flight: SEOs Give Insight On Google’s New Algorithm”. Search Engine Land. 2013. Web. http://searchengineland.com/hummingbird-has-the-industry-flapping-its-wings-in-excitement-reactions-from-seo-experts-on-googles-new-algorithm-173030

Gillespie, Tarleton. “The Relevance of Algorithms”. Media Technologies, ed. Tarleton Gillespie, Pablo Boczkowski, and Kirsten Foot. Cambridge, MA: MIT Press, 2014, pp. 167-193. Print.

Gillespie, Tarleton. “Platforms Intervene”. Social Media + Society, April-June 2015. pp 1–2. Sage Publishers. Print

Halavais, Alexander. Search Engine Society. Cambridge: Polity, 2008. Book. Print.

Helmond, Anne. “The Algorithmization of the Hyperlink.” Computational Culture 3(3). 2013.

Hindman, Matthew. The Myth of Digital Democracy. Princeton: Princeton University Press 2009. Print.

Introna, Lucas D. and Nissenbaum, Helen. “Shaping the Web: Why the Politics of Search Engines Matters”. The Information Society, 2000, 16:169–185. Taylor & Francis. Print

Page, Lawrence and Brin, Sergey. The Anatomy of a Large-Scale Hypertextual Web Search Engine (1999). Web. http://infolab.stanford.edu/~backrub/google.html

Pariser, Eli. The Filter Bubble. New York: Penguin Books, 2012. Print.

Selyukh, Alina. “After Brexit Vote, Britain Asks Google: ‘What Is The EU?’” NPR. 2016. Web. http://www.npr.org/sections/alltechconsidered/2016/06/24/480949383/britains-google-searches-for-what-is-the-eu-spike-after-brexit-vote

Sullivan, Danny. “Dear Bing, We Have 10,000 Ranking Signals To Your 1,000. Love, Google.” Search Engine Land. 2010. Web. http://searchengineland.com/bing-10000-ranking-signals-google-55473

Sullivan, Danny. “FAQ: All about the Google RankBrain algorithm.” Search Engine Land. 2016. Web. http://searchengineland.com/faq-all-about-the-new-google-rankbrain-algorithm-234440

Schwart, Barry. “Google Penguin doesn’t penalize for bad links – or does it?” Search Engine Land. 2016. Web. http://searchengineland.com/google-penguin-doesnt-penalize-bad-links-259981

Turk, Victoria. “When Algorithms are sexist”. Motherboard. 2015. Web. http://motherboard.vice.com/en_uk/read/when-algorithms-are-sexist

Wikipedia: https://en.wikipedia.org/wiki/Google

Wikipedia: https://en.wikipedia.org/wiki/Knowledge_Graph

Wikipedia: https://en.wikipedia.org/wiki/Stemming

Zuboff, Shoshana. “The Secrets of Surveillance Capitalism”. Frankfurter Allgemeine Zeitung. 2016. Web. http://www.faz.net/aktuell/feuilleton/debatten/the-digital-debate/shoshana-zuboff-secrets-of-surveillance-capitalism-14103616.html

 

[1] https://www.google.com/trends/story/GB_cu_EoBj9FIBAAAj9M_en

[2] Google is now the ‘leading subsidiary’ of the company Alphabet, Inc. as well as the ‘parent for Google’s internet interests’. https://en.wikipedia.org/wiki/Google

[3] Interestingly enough, on June 23 at 23:54 GMT after polls had closed and predictions of the outcome surfaced in the media, Londoners searching for ‘move to Gibraltar’ spiked heavily (+680%). http://www.npr.org/sections/alltechconsidered/2016/06/24/480949383/britains-google-searches-for-what-is-the-eu-spike-after-brexit-vote

[4] Eli Pariser has deemed this ‘The Filter Bubble’, which I address in more detail in my PhD.

[5] Those most heavily linked ‘rule’, in other words.

[6] http://www.faz.net/aktuell/feuilleton/debatten/the-digital-debate/shoshana-zuboff-secrets-of-surveillance-capitalism-14103616.html

[7] A complete description of Adwords is beyond the scope of this essay. Adwords is an online advertising system that enables competition between bidders based on keywords, or search terms, cookies to display certain webpages and advertisers pay when users click on the ads. It is Google’s main source of revenue, which is why it is actually an advertising company not a search engine.

[8] Ranking algorithms reduce social relations to a specific dimension of commercialisation, the placing of a reference, a hyperlink, which is modern capitalism’s current form of socialisation, networking and quite possibly the most sought after currency of the internet.

[9] Since 2013, Google.com is the most visited website in the world, according to Alexa. “Google processes over 40,000 search queries every second which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.” In 1999, it took Google one month to crawl and build an index of about 50 million pages. In 2012, the same task was accomplished in less than one minute. 16% to 20% of queries that get asked every day have never been asked before. Every query has to travel on average 1,500 miles to a data centre and back to return the answer to the user. A single Google query uses 1,000 computers in 0.2 seconds to retrieve an answer. http://www.internetlivestats.com/google-search-statistics/.

[10] No space here to elaborate but will explain ‘personalisation’ in Chapter 3 of my thesis, or see here: http://www.aprja.net/?p=2531

[11] Google usually describes that is has around 200 major ranking signals, yet there have been discussions of 1000 or even 10000 sub-signals. http://searchengineland.com/bing-10000-ranking-signals-google-55473

[12] http://searchengineland.com/faq-all-about-the-new-google-rankbrain-algorithm-234440

[13] “Some sites want to do this because they’ve purchased links, a violation of Google’s policies, and may suffer a penalty if they can’t get the links removed. Other sites may want to remove links gained from participating in bad link networks or for other reasons.” http://searchengineland.com/google-penguin-doesnt-penalize-bad-links-259981

[14] According to David Amerland, author of Google Semantic Search. http://searchengineland.com/hummingbird-has-the-industry-flapping-its-wings-in-excitement-reactions-from-seo-experts-on-googles-new-algorithm-173030

[15] Knowledge Graph was launched in 2012 and combines ‘semantic search’ information added to search results so that users do not query further. However this has lead to a decrease of page views on Wikipedia of different languages. https://en.wikipedia.org/wiki/Knowledge_Graph

[16] In regard to information retrieval, ‘stemming’ is when words are reduced to their ‘stem’ or root form. “Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation”. https://en.wikipedia.org/wiki/Stemming

[17] http://searchengineland.com/faq-all-about-the-new-google-rankbrain-algorithm-234440

[18] http://firstmonday.org/article/view/3344/2766

[19] Bagdikian published later updated and revised editions called The New Media Monopoly, which subsequently became part of the ‘Amazon Noir’ a.k.a. ‘Pirates of the Amazon’ art project: http://www.amazon-noir.com/index0000.html

[20] Victoria Turk. http://motherboard.vice.com/en_uk/read/when-algorithms-are-sexist

[21] http://searchengineland.com/faq-all-about-the-new-google-rankbrain-algorithm-234440

[22] During the writing of my PhD I use Google Search for my research and have allowed myself to be personalized on my Apple computer without installing plugins, etc. that would attempt to prevent it.

 

Advertisements

4 Comments

  1. I know this is implicit in your text, but I think its interesting to think about how the ways in which the tools used to rank web pages might be similar or different to the ones that rank, or categorise, the user. With personalisation, what a search engine is doing is using the search term to link its summary of a web page with a summary of the user. The summaries of both are then altered by the how the results are responded to. If this process gets good enough, you won’t need the search term at all (google will give you the answer before you ask the question, I think google maps already does this to some extent) but then you have algorithms acting upon each other, rather than on the user.

    Liked by 1 person

    1. hi John,
      thanks for this. If you know anyway to make this process more transparent… please share 😉 what I do know, or drawing upon conclusions from other research, is that personalisation is not really what it’s touted to be. You are assigned to a group who is similar to you. Is there then one singular unique user? maybe already but then algos are acting upon how we relate to user groups?

      Like

  2. Thanks very much for your paper, a strong summary of an area I kind of assume I know enough about – but I had no idea such a seismic update was made as recently as this summer.

    It’s also interesting that RankBrain is a kind of cousin of the citations measure of quality employed by the research excellence framework in academia – and I wonder if it would be a neat move to complete the circle somehow (as you suggest in your final footnote) as to the way that academia itself is now being re-visited by its own invention. Are there “citers of quality” for example?

    Like John, I’m also interested in this warping of cause-effect where machines anticipate what we need. It’s strange, because I’m inherently opposed to this kind of personalisation on political grounds, equality of access essentially, but I notice that I’m frustrated by the inefficiency of information from non-google search tools, such as Discover, or even an open source engine like DuckDuckGo. Some examples of the detrimental qualities of this bias-as-prediction, and weighing them against the ‘convenience factor’, would be interesting to me.

    Like

    1. thanks for your very keen comments. Yes, I have an ‘academic’ critique of completing the circle within the PhD, only not in this text I am afraid.

      Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s