Google’s Recent Black Eye: Pushing Bad Data

In September of 2005, following a “measurement contest” with rival Yahoo, Google ceased counting or at least publicly displaying the number of pages it had indexed. This number reached approximately 8 billion pages before being removed from the homepage. Recently, various SEO forums reported that Google had added a few billion more pages to its index within the past few weeks. This may sound like cause for celebration, but this “accomplishment” would reflect poorly on the search engine that accomplished it.

People were buzzing about the nature of the fresh, new billions of pages. They were blatant spam, containing Pay-Per-Click (PPC) advertisements and scraped content, and they were frequently ranking highly in search results. In doing so, they displaced much older, more established sites. A Google representative responded to the issue via forums by labeling it a “bad data push,” which elicited groans from the SEO community.

How did someone fool Google into indexing so many spam pages in such a short amount of time? I’ll provide an overview of the procedure, but you shouldn’t get too excited. You will not be able to do it yourself after reading this article, just as a diagram of a nuclear explosive will not teach you how to make the real thing. Nonetheless, it makes for an intriguing story, one that exemplifies the ever-increasing ugliness of the most popular search engine in the world.

A Stormy and Dark Night
Our tale begins deep in the heart of Moldavia, between the picturesque countries of Romania and the Ukraine. An enterprising local, in between fending off vampire attacks, had a brilliant idea and ran with it, presumably away from the vampires… His plan was to exploit the manner in which Google handled subdomains, and not in a small way, but in a significant manner.

The core of the problem is that Google currently treats subdomains similarly to full domains, as unique entities. This means that it will add the subdomain’s homepage to the index and return at a later time to perform a “deep crawl.” Deep crawls involve the spider following links from the domain’s homepage deeper into the site until it finds everything or abandons the search and returns later for more.

A subdomain is a “third-level domain” in brief. You’ve likely encountered subdomains before; they resemble subdomain.domain.com. The English version of Wikipedia is “en.wikipedia.org,” while the Dutch version is “nl.wikipedia.org.” Subdomains are an alternative to multiple directories or even separate domain names for organizing large websites.

Therefore, we have a type of page that Google will index without question. It’s surprising no one took advantage of this situation sooner. Some commentators speculate that this “quirk” was introduced after the most recent update to “Big Daddy.” Our friend from Eastern Europe gathered servers, content scrapers, spambots, PPC accounts, and some essential, highly inspired scripts, and combined them as follows…

Five Billion Served So Far…
First, our hero created scripts for his servers that, when GoogleBot visited, would generate an almost infinite number of subdomains, each with a single page containing keyword-rich scraped content, keyword-optimized links, and PPC advertisements for those keywords. Spambots are dispatched to tens of thousands of blogs around the world to throw off GoogleBot’s scent by spamming referral links and comments. It doesn’t take much for the spambots to set the stage for the dominoes to fall.

GoogleBot identifies the spammy links and, in accordance with its mission, follows them into the network. Once GoogleBot is sent into the web, the scripts that run the servers continue to generate pages, each with a unique subdomain and containing keywords, scraped content, and PPC advertisements. These pages are indexed, and in less than three weeks, the Google index grows by 3-5 billion pages.

Initial reports indicate that the PPC advertisements on these pages originated from Adsense, Google’s PPC service. The ultimate irony is that Google profits from all the impressions charged to Adsense users whose advertisements appear on these billions of spam pages. In the end, the purpose of this endeavor was to generate Adsense revenue. Put in so many pages that, by sheer force of numbers, people would find and click on the advertisements on those pages, generating a nice profit for the spammer in a very short period of time.

Billions or Millions? What is Broken?
In the DigitalPoint forums, news of this accomplishment spread like wildfire. Specifically, it spread like wildfire throughout the SEO community. The “general public” is currently uninformed and will likely remain so. A Google engineer responded to a Threadwatch thread on the subject, calling it a “bad data push.” In essence, the company’s position was that they have not added 5 billion pages. Later claims include assurances that the problem will be algorithmically resolved. Those who are monitoring the situation (by tracking the known domains the spammer was using) can only observe that Google is manually removing them from the index.

The tracking is performed with the “site:” command. Theoretically, this command displays the total number of indexed pages for the website specified after the colon. Google has already acknowledged that there are issues with this command, and “5 billion pages” appears to be another symptom. These issues extend beyond the site: command to the display of the number of results for numerous queries, which some consider to be grossly inaccurate and which in some instances fluctuates erratically. Google admits indexing some of these spammy subdomains, but has not yet provided alternative numbers to dispute the 3-5 billion displayed by the site: command initially.

The number of indexed spammy domains and subdomains has steadily decreased over the past week as Google employees manually remove the listings. There is no official confirmation that the “loophole” has been closed. This presents the obvious problem that, since the method has been revealed, a number of imitators will rush to cash in before the algorithm can be modified to account for it.

Conclusions
There are at least two defects present here. The site: command and the obscure, tiny portion of the algorithm that permitted billions (or at least millions) of spam subdomains to enter the index. Google’s current priority should likely be to close the loophole before spammers flood the market. Those who may have a small return on their advertising budget this month may find the issues surrounding the use or abuse of Adsense to be equally troubling.

Do we maintain “faith” in Google despite these events? Most likely, yes. It is not so much whether they deserve that faith, but rather that the majority of people will never learn about this. Even days after the story broke, the “mainstream” media has barely mentioned it. Some tech sites have mentioned it, but this is not the type of story that will make the evening news, mainly because the background knowledge required to comprehend it is beyond the capabilities of the average citizen. The story will likely become an interesting footnote in “SEO History,” the most esoteric and neoteric of worlds.

2 thoughts on “Google’s Recent Black Eye: Pushing Bad Data”

  1. Pingback: What Is A Robots.txt File? And How Do You Create One? (Beginner’s Guide) - Hire A Virtual Assistant

  2. Pingback: Official: Google Mobile-first Indexing For All Websites Starting September 2020 - Hire A Virtual Assistant

Leave a Comment

Your email address will not be published.