Why does Google crawl outdated URLs with no referring page links?

Why Google Continues to Crawl Outdated URLs

I’ve noticed that Google is still crawling some outdated URLs from a subdomain I phased out in 2021. For instance, the following URL:

https://amp.example.com/product/acme-555

Now redirects to:

https://example.com/product/acme-555

Upon inspecting this old AMP URL using Google Search Console (GSC), here’s what I observed:

  • URL not indexed on Google
  • Last Crawl Date: March 3, 2024
  • Referring Page: https://example.com/product/acme-555 (the updated URL)
  • Sitemaps: Temporary processing error (note: an empty sitemap exists for https://amp.example.com, which might be part of the issue)

Interestingly, the newer URL doesn’t reference or link back to the older AMP URL. When I inspect the updated URL through GSC, the results are:

  • URL is indexed on Google
  • Last Crawl Date: March 3, 2024

I would appreciate hearing from others who have encountered similar issues.


2 responses to “Why does Google crawl outdated URLs with no referring page links?”

  1. It’s not uncommon to encounter scenarios where Google continues to crawl old URLs, even when those URLs are not explicitly linked from the current site. Let’s break down the potential reasons why this is happening and discuss some possible steps you can take to address the issue.

    Why Google Might Be Crawling Old URLs

    1. Historical Data: Google remembers URLs it has crawled in the past. Even if these pages no longer exist or are not linked, they might still be scheduled for crawling as part of Google’s historical crawling efforts.

    2. Internal or External Links: It’s possible there are still some legacy internal or external links pointing to the old AMP URL that you are unaware of. For example, some less obvious pages (e.g., paginated results, archives) or third-party sites could be linking to the old URL.

    3. Browser Cache and Bookmarks: Users might have the old URLs bookmarked or cached in their browsers. When they visit these old URLs, Google might see traffic to them, which could prompt a crawl.

    4. Temporary Processing Errors: The sitemap error you’ve reported could be contributing to the issue. Although the sitemap is reportedly empty, Google might be encountering difficulties and choosing to revisit URLs until the temporary error is resolved.

    5. Redirection Chains: Sometimes, Google might crawl redirected URLs to ensure the redirects are still correctly handling traffic. Your 301 redirect is a permanent redirect, but verification can still occur.

    6. Canonical Tags: If there’s a discrepancy with canonical tags, particularly on any pages linking to the old URL, Google may attempt to resolve which version of the content is the canonical one by visiting the old URL.

    Steps to Address the Issue

    To address the continued crawling of these old URLs, consider the following approaches:

    1. Check Legacy Links: Use link analysis tools (like Ahrefs, SEMrush, or Google Search Console’s own link report) to see if any pages still link to the old URL. This includes both internal and external sources.

    2. Update and Resubmit Sitemaps: Ensure that your sitemap accurately reflects your current URL structure and resubmit it through Google Search Console. This could help mitigate the temporary processing error you’re encountering.

    3. Utilize the URL Removal Tool: If you wish to expedite the removal of these old URLs from Google’s index, consider using Google’s URL removal tool to request their removal.

    4. Monitor Logs and Analytics: Monitor server log

  2. This is a fascinating observation! Google’s crawling behavior can indeed be perplexing, especially with outdated URLs lacking direct linking. One potential reason for this continued crawling could be related to the way Googleโ€™s algorithms interpret and prioritize URL authority and relevance. If the AMP URL had previously accumulated some form of authority or traffic, Google might still find it worthwhile to check in on it even after you’ve phased it out.

    Moreover, the empty sitemap for your old subdomain seems to play a role as well. Itโ€™s possible that without clear signals indicating that those URLs are outdated or should be disregarded, Google assumes they’re still valid pages worth checking on. Implementing a more comprehensive 301 redirect strategy across both the subdomain and any related domains can help mitigate this issue. By ensuring that all traffic consolidates to the updated URL, you signal to Google that the old entries are obsolete more clearly.

    It might also be worthwhile to consider submitting a removal request for the old URLs via Google Search Console if you want to expedite the decommissioning process. This can help reduce crawler visits over time, allowing Google to focus on your current content. Has anyone tried using the URL removal tool, and if so, did it yield results more quickly? I’d love to hear how others have navigated similar experiences!

Leave a Reply

Your email address will not be published. Required fields are marked *