What causes Google to crawl old URLs that aren’t linked from referring pages?

Understanding Googleโ€™s Crawling Behavior: Old URLs and Referring Pages

Have you ever wondered why Google continues to crawl outdated URLs, even when there are no direct links to them from referring pages? This is a question I’ve been contemplating after noticing that some old URLs from a subdomain I discontinued in 2021 are still being processed by Google.

Take, for example, the URL:

https://amp.example.com/product/acme-555

Though this link now redirects via a 301 to its updated counterpart:

https://example.com/product/acme-555

I decided to check the old AMP URL using Google Search Console (GSC). Here’s what I discovered:

  • The URL is not indexed by Google.
  • The last crawl occurred on March 3, 2024.
  • The referring page is the updated link: https://example.com/product/acme-555.
  • The sitemap associated with the old AMP subdomain shows a temporary processing error, indicating it may be empty.

Interestingly enough, the new URL does not contain any references or links to its predecessor. When I inspected the new URL in GSC, the findings were quite different:

  • The URL is indexed by Google.
  • The last crawl was also on March 3, 2024.

This raises an intriguing question: How is Google aware of the old URL if itโ€™s not linked anywhere on the referring page?

Many factors could contribute to this behavior, including historical crawl patterns, the presence of the old URL in previous sitemaps, or even residual data in Googleโ€™s cache. Itโ€™s clear that even after a siteโ€™s structural changes, remnants can linger in Googleโ€™s indexing algorithms.

If you’ve encountered similar situations or have insights into Googleโ€™s crawling practices, I invite you to share your experiences. Understanding the nuances of how Google interacts with URLs can help us navigate our SEO strategies more effectively.


2 responses to “What causes Google to crawl old URLs that aren’t linked from referring pages?”

  1. It’s quite common for Google to continue crawling old URLs even after they have been removed or redirected, and the situation you’re experiencing with your old subdomain is worth exploring. Here are several reasons why Google might still be crawling these outdated URLs, even when they’re not being linked by referring pages:

    1. Legacy Backlinks and External Issues

    Even if the referring page you checked (the new URL) does not link to the old AMP URL, itโ€™s possible that other external sources or legacy backlinks still point to the old address, causing Google to crawl it. You can use tools like Ahrefs or Moz to investigate whether any old links remain active on other websites. If there are any, it may be beneficial to reach out to those sites and request an update to their links.

    2. Caching and Crawl Frequency

    Google does not update its index instantaneously. Crawling frequency can depend on several factors, including how often the page was previously updated, its authority, and the volume of incoming links. If the old subdomain had significant authority or traffic before its removal, it may continue to be crawled for a while until the indexing frequency decreases. Monitoring the crawl behavior over time can provide insights into how this is changing.

    3. 301 Redirects and Caching

    Your implementation of a 301 redirect from the old URL to the new one should guide Google to the correct page; however, old URLs may still show up in crawls while Google processes this change. If Google previously had the old AMP URL indexed, it will take some time for the crawlers to switch their focus entirely to the new URL. Using GSC’s โ€œRemove URLsโ€ tool can help expedite the de-indexing process for specific URLs, though 301 redirects are usually the preferred method for preserving SEO equity.

    4. Sitemap Issues

    You noted that the sitemap for the amp.example.com property is empty and showing a temporary processing error. This is an important point, as Google sometimes relies heavily on sitemaps to understand site structure and derive URLs for crawling. If there are unresolved errors, it could affect how Google crawls the related URLs. Make sure to submit an updated sitemap for only your active URLs or consider removing the old sitemap entry altogether if you have already migrated everything successfully.

    5. Utilizing the Robots.txt File

    If you have already removed the subdomain, consider updating your robots.txt file to explicitly disallow crawling of the old URLs. This is an additional layer to signal search engines that these pages should not be crawled. Combining this with proper 301 redirects creates a more robust approach to managing how Google interacts with your URLs.

    6. Monitor Search Console Data

    Continue monitoring your Google Search Console (GSC) data for updates on crawl status and errors related to the old URLs. Google may eventually stop crawling them altogether as they realize the links no longer yield valid content. Regular monitoring can also help you notice any continued issues that could need addressing.

    Conclusion

    Ultimately, even though youโ€™re not directly linking to the old AMP page from the new one, other factors such as legacy backlinks, crawl frequency, and sitemap issues contribute to Googleโ€™s crawling behavior. Keep enforcing the 301 redirects, update or remove outdated sitemaps, and maintain your GSC insights. Eventually, this should lead to a decrease in crawls to the old URLs. It’s always beneficial to engage with the SEO community to share experiences and strategies, as these scenarios can be complex and varied.

  2. This is a fascinating topic that delves deep into Google’s crawling algorithms and how they handle outdated URLs. Your observations highlight a crucial aspect of SEO that often gets overlooked: the persistence of historical data within Google’s indexing systems.

    One possible explanation for the continued crawling of your old URL, despite the absence of direct links, could be the impact of **link equity** and the way Google values historical context. If your old AMP URL had significant inbound links in the past, those signals might cause Google to revisit the URL occasionally, even if it has been redirected. This behavior underscores the importance of maintaining a comprehensive backlink profile, as old links can still influence crawling decisions long after a page is removed or redirected.

    Moreover, as you noted, Google’s crawler can retain information about previously indexed pages for quite some time. Itโ€™s also worth considering that certain **crawling patterns** may be programmed to revisit URLs in case they were previously deemed relevant, particularly if the old URL was a high-traffic page.

    If youโ€™re looking to optimize your current strategy, ensuring that old URLs are properly redirected, cleaning up any lingering entries in your sitemap, and consistently updating your content can help guide Googleโ€™s attention to your more current pages. Have you thought about implementing a robust 301 redirect strategy coupled with a site audit to streamline how Google interacts with your URLs going forward?

    Thanks for shedding light on such an important aspect of SEO! I look forward to hearing more from others who have encountered similar situations.

Leave a Reply

Your email address will not be published. Required fields are marked *