What is Google Index and How Does the Indexing Process Work?
Google index refers to the huge database of billions of web pages that Google’s web crawlers, called the Googlebot, have discovered and stored on thousands of computers.
Think of it as a dynamic internet library that’s continually updated—new and relevant content is added, while poor quality and outdated content is removed.
When you search on Google, the search engine matches your queries to relevant pages within this index, providing you with the most relevant results.
However, not all pages make it into Google’s index.
The web crawler must find and index web pages before they can appear in search results. Likewise, users cannot access unindexed pages via Google search unless they know the specific URL.
How Many Indexed Pages Are There?
Google has never publicly disclosed the number of pages in its index, perhaps because the number is not really important in the grand scheme of search performance.
But we get some insights into the figure from a collection of sources, including statements during Google’s Antitrust Trial in 2023.
If we recall, in 2016, Google updated its How Search Works page, announcing that the Google spider had finally discovered 130 trillion unique pages on the Internet. This was a far cry from Google’s announcement a few years prior when they crawled only 30 trillion pages in 2013 and 1 trillion pages in 2008.
Here is the screenshot of the announcement:
This was corroborated during day 24 of Google’s Antitrust Trial in 2023. In the morning session, Pandu Nayak, Google’s VP of search, mentioned that trillions of pages exist online. He added that Google also indexes only a portion of those trillions, primarily pages containing helpful information for users.
In the afternoon session, Nayak gave a non-committal response when asked to confirm if there were 400 billion indexed pages in 2020. While he neither denied nor confirmed, we can presume 400 billion is a good ballpark figure for the number of Google-indexed pages.
Nayak also said, “… I don’t know in the past three years if there’s been a specific change in the size of the index.” This suggests that the number of indexed pages changes over time, and it may balloon or shrink depending on many factors.
A 9-year longitudinal research study in 2016 confirmed this when they found Google’s index size fluctuating between extremely high and extremely low numbers. Specifically, the researchers found the estimates to be as low as 1.96 billion indexed pages on November 24, 2014. Less than three months later, it jumped to 45.7 billion indexed pages on January 5, 2015.
Here is the graph:
The point is that Google actively crawls and analyzes trillions of pages to refine its index. Google does this to serve users only the best possible pages on search results. More importantly, this meticulous indexing process also incentivizes websites that create helpful, reliable, people-first content.
Why is Google Indexing Important?
We can trace back one benefit of indexing to Google’s mission statement: To organize the world’s information and make it universally accessible and useful.
Statistically, Google has a dominant 90% share of the global search engine market and processes approximately 8.5 billion searches daily.
With countless people depending on Google as their primary go-to for search queries, it is the search engine’s responsibility to ensure reliable information is publicly accessible. Indexing helps Google filter out spam from quality content.
From a business standpoint, indexing is the first step in search engine optimization. Only pages on Google’s search index are capable of appearing and ranking in search results. In other words, even if site owners implement good SEO practices, an unindexed page will remain invisible in Google search. This gives indexed pages a competitive advantage over those that aren’t.
Moreover, using Google is one of the best ways for people to discover your site or business. Pages indexed by Google, especially those ranking highly on SERPs, have a higher chance of driving organic traffic to their websites.
And since half of all Google searches have local intent (e.g., “businesses near me”), unindexed pages may miss out on a significant number of potential customers. There’s no question that it’s essential to show up in search results.
To understand how to leverage the power of indexing, let’s explore the technical aspects of how Googlebot discovers and indexes your website.
How Does Google Index Your Site?
In a video series called “How Search Works,” Gary Illyes explored the inner workings of Google’s search results. Below, we simplify the complex process into easy-to-understand steps.
Google search works in three major stages, which can further be subdivided into eight distinct steps. Many web pages don’t make it through each phase. Take a look at each of them below:
Stage I: Crawling
Crawling refers to Google’s process of exploring new and updated pages. It can be divided into three steps:
Step 1: URL discovery
URL discovery is a perpetual process. To achieve this, Google uses an automated program called Googlebot that traverses the internet from page to page 24/7. In 2020, Google said it discovered more than 25 billion URLs per day.
Googlebot uses an algorithm that regulates its crawling activity:
- Crawl preferences: which sites to crawl
- Crawl frequency: how often to crawl a site
- Crawl depth: how many pages to crawl in a site
Some websites, like news websites that publish content regularly, are crawled more often than less demanding sites. Google will also crawl a site more frequently if it responds faster to server requests, which is why page speed is a ranking factor.
On the other hand, Googlebot may also ignore some URLs, especially if there is a robots.txt file instructing which parts of your website to crawl and which parts to avoid.
Googlebot discovers new pages by following links from existing indexed pages. Alternatively, site owners may generate a sitemap, a collection of URLs on your site, and submit it through Google Search Console.
Step 2: Fetching
After discovering the URL, Googlebot downloads or “fetches” all the data served on the specific page. These data include:
- HTML content: texts, meta tags, headings, links, alt texts
- CSS styles: visual presentation of the website
- JavaScript: interactive elements or dynamically loaded content
- Media: images, videos, PDFs, and other files
Step 3: Rendering
Right after fetching all the data, Googlebot runs all the codes on a recent version of Chrome and renders them into a visual representation of that page. Gary Illyes explains it as basically what browsers do, except that a bot does the browsing instead of a human.
Stage II: Indexing
Contrary to popular belief, indexing is more than just deciding whether Google indexes a page or not. It covers a series of processes that can be divided into three distinct steps:
Step 4: Parsing
Parsing involves analyzing text, code, or data to understand its structure and meaning. In the context of indexing, Googlebot breaks down the HTML code into its individual components, such as elements, tags, attributes, and text content.
If Google encounters semantic errors within the code, it will automatically fix them. Parsing aims to help Google visualize how the tags are nested within each other and understand the page’s hierarchical structure.
As an example, Gary Illyes highlights the importance of the <head> element. This section, found at the top of an HTML document, contains the essential metadata (machine-readable information) about the page, such as meta tags, link tags, and style sheets.
If Google finds any unsupported tags, Gary said, Google “will forcibly close the element right before the unsupported tag.” This leaves other essential metadata out of the <head> element, negatively impacting how Googlebot interprets the page.
Step 5: Canonicalization
After parsing the HTML, Google finds out if the page is a duplicate of another existing page and determines which among the duplicates must be kept in the index. The page maintained in the index is called the “canonical” version.
After grouping similar pages within a duplicate cluster, Google selects the canonical version, which is the page that best represents the group.
Google determines this in two ways:
- Comparing the signals (classified nuggets of information that Google collects about pages) between each version
- The presence of rel=”canonical” element
On a side note, we suspect that Google decides the canonical version based on which page will likely rank competitively in SERPs.
The canonical page appears in search results. However, the other duplicates may also become alternate versions served in particular contexts, such as if a user searches for a very specific page within that duplicate cluster. But this event is improbable (and an unsuspecting searcher wouldn’t know).
Step 6: Index selection
Only once the canonical version of the page is selected does Google decide whether to index the page. This process is called index selection. Indexation depends largely on the page’s quality and all other previously collected signals.
If Google deems a page ripe for indexing, it stores the canonical page (along with its duplicate cluster) in the Google index.
Stage III: Serving
Did you know that whenever you search a query on Google, you are actually talking directly to the Google index?
As such, how you arrange the words in your query influences how Google replies, a.k.a. selects and ranks the most relevant pages in its index. This sequence is called serving, and it can be divided into two steps:
Step 7: Interpretation
Google cleans up and analyzes your search query for certain “entities,” which are real-world objects or concepts with unique identities or meanings.
Let’s consider this search query:
In this video, Gary explains that Google removes stop words or words that don’t add value or affect the meaning of a search query.
On the query “What are the symptoms of a common cold,” removing the stop words “are,” “the,” “of,” and “a” doesn’t change the search intent. This leaves you with the following entities:
- “what” indicates that the intent is to ask or inquire
- “symptoms” refer to the physical signs, effects, or manifestations of something
- “common cold” is a type of medical condition
Depending on the search query, Google may include stop words in the analysis. For example, the idiomatic expression “turn over a new leaf” naturally includes stop words in the phrase. Google may also expand keywords to include synonyms, like “couch,” which is synonymous with “sofa.”
With the help of NLP techniques, like Bidirectional Encoder Representations from Transformers (BERT), Google has become more advanced and sophisticated in dissecting search queries.
Step 8: Ranking
Understanding queries is essential to deliver relevant results to users accurately. Relevance is influenced by many factors, with content of the page being the most important one.
For general inquiry, Google will rank pages according to quality, such as the uniqueness of content, relative importance of the website, popularity, etc.
The searcher’s device type, location, and language may affect the links Google serves. For example, queries with local intent will likely show local results, including the map pack and a list of Google Business Profiles.
Make Sure Your Site Remains on Google Search Results
Google search index has a tendency to be erratic. Just because your page or website is indexed now doesn’t mean Google will maintain it.
Even Google’s John Mueller’s website was briefly de-indexed during the March 2024 Core Update, a.k.a. March of Madness.
Therefore, optimizing your web pages and actively monitoring their indexing status on SERPs is essential.
IndexCheckr lets site owners, SEO agencies, and link builders track if target pages are on Google’s index.
Site owners can configure the tool to perform automated checks on the link at desired intervals. If you face indexing issues, you can submit these pages directly to indexer tools for help.