Gary Illyes at Google Search Central provides more information about the inner workings of Googlebot and announces a relocation for the IP ranges files. This concrete information can directly affect how your site is crawled and indexed.
Key Points:
- Googlebot is not a single robot: it relies on a central infrastructure shared by dozens of Google services (Shopping, AdSense, etc.).
- Googlebot only downloads the first 2 MB of an HTML page (excluding PDFs): anything beyond this threshold is ignored, not retrieved, not processed, and not indexed.
- The IP ranges of Google crawlers are relocating: a transition to /crawling/ipranges/ is required within the next 6 months.
- The order of elements in your HTML matters: critical tags should be placed as high as possible in the code.
Googlebot Has Never Been a Single Robot
This is one of the most stubborn myths in SEO. In the 2000s, Google only had one product and thus had a single crawler, which retained the name "Googlebot." However, today, Googlebot is actually one of the other clients in a central crawling infrastructure.
When you see "Googlebot" in your server logs, you are only observing Google Search traffic. Many other services, such as Google Shopping or AdSense, use this same infrastructure under different crawler names. A list of the main crawlers is documented on the Google Crawling Infrastructure website.
2 MB Limit: Understanding What Google Really Downloads
This topic, raised by Google a few weeks ago, is the most technical and likely the most critical point for webmasters. Googlebot only downloads the first 2 MB of each HTML URL, including HTTP headers. For PDFs, this limit is set at 64 MB. For crawlers that do not specify a limit, the default value is 15 MB.
The realities are:
- Downloading is cut off at 2 MB. Googlebot does not reject the page; it simply stops the download exactly at the 2 MB threshold. The retrieved portion is then sent to the indexing systems and Web Rendering Service (WRS) as if it were the complete file.
- Anything beyond this is invisible. Bytes beyond this threshold are not retrieved, processed, or indexed. For Googlebot, they simply do not exist.
- Related resources are retrieved separately. Every resource referenced in the HTML (excluding media, fonts, and some exotic files) is downloaded by WRS with its own byte counter, independently of the main page.
For most sites, 2 MB of HTML represents a significant volume. However, some applications can cause issues: images in base64 format directly integrated into HTML, large blocks of CSS or JavaScript, or large menus placed at the top of the code. If these elements push your text content or structured data beyond the threshold, Googlebot will never see them.
Render: What the Web Rendering Service Does with These Bytes
Once the bytes are retrieved, WRS kicks in. Like a modern browser, it executes JavaScript and CSS on the client side and attempts to understand the final state of the page. It also processes XHR requests to better understand the text content and structure of the page but does not load images or videos.
Two important points to keep in mind: WRS can only execute the code that is actually downloaded by the download process and operates statelessly. It clears local and session storage data with each request, which can affect the interpretation of JavaScript-dependent dynamic elements.
Best Practices for Optimizing the Crawling of Your Pages
Google provides several directly applicable suggestions:
- Keep your HTML lightweight. Export CSS and JavaScript to separate files. These resources are retrieved independently and come with their own 2 MB quota.
- Place your critical elements at the top of the document. Meta tags, titles, canonicals, links, and essential structured data should be positioned as early as possible in the HTML code, thus eliminating the risk of falling below the threshold.
- Monitor your server logs. High response times cause Google’s crawlers to automatically reduce their crawl frequency, preventing overload on your infrastructure.
Google indicates that this 2 MB limit is not fixed and will evolve as the web develops.
Relocation of Crawlers' IP Ranges
At the same time, Google announces that the JSON files listing the IP ranges of its crawlers are relocating. These files, previously available at /search/apis/ipranges/ on developers.google.com, are moving to a more general location: developers.google.com/crawling/ipranges/.
This change reflects a previously mentioned fact: these IP ranges are not solely related to Googlebot Search. The old path will remain accessible during the transition period, but Google plans to remove it and implement a redirect within 6 months. Official documentation has been updated to redirect to the new location.
Comments
(4 Comments)